Typically, the first step is to import the data from the sources into a centralized location - a landing zone. This requires all the necessary credentials and access rights. Data import is often done periodically, which may require orchestration via pipelines. For this, we typically use tools like Microsoft Azure Data Factory or, more recently, Microsoft Fabric Data Factory.
The data never arrives in a perfect state. It often contains duplicates, missing values, corrupted entries, or inconsistent formats. Data cleaning is a crucial step to ensuring the data is usable. We typically do this in Python, but can adjust depending on your requirements.
At the end of the day, the data must address your business needs. This usually requires merging data from multiple sources and applying business logic that reflects your KPIs. For maximum flexibility, we also do this in Python, unless instructed otherwise. This part of the data engineering process is typically where cooperation with your team is most crucial.
Depending on the reporting needs, this is where the data is exported into a suitable format. Whether the priority is frequent updates, simple access (such as via OneLake), simple querying, or direct integration with a dashboarding tool, this is where we handle it. At this stage, it's important that the reporting needs are clear, so we can tailor the solution accordingly.