Get certified for free when you join Fabric Data Days 2026 and dive into Fabric, Power BI, SQL, AI, and other essential data skills.
Join nowData Days is here! Join us now for 60+ days of learning, challenges, and connection. Learn more
Co-Author - Abhishek Narain
Building an effective Lakehouse starts with establishing a robust ingestion layer. Ingestion refers to the process of collecting, importing, and processing raw data from various sources into the data lake. Data ingestion is fundamental to the success of a data lake as it enables the consolidation, exploration, and processing of diverse and raw data. It forms the backbone for downstream analytics, machine learning, and reporting activities, equipping organizations with the flexibility and agility to extract meaningful insights from their data. In this blog, we will walk through the key components of a data Ingestion framework based on experience working with different customers while building a lakehouse in Fabric.
Data Ingestion can be divided in 2 types:
Please note that we have only considered batch ingestion. Real time ingestion is not in scope of this blog.
Below graphic represents the basic building blocks (components) of the ingestion framework.
Demystifying_data_Ingestion_in_Fabric_fundamental_components_for_ingesting_Data
Below table details out different components and purpose of each component.
| # | Component Name | Purpose |
|---|---|---|
| 1 | Control Table | Control table is used to control and select an item from the source system to be moved into a data lake. |
| 2 | Data Ingestion | Data Copy component is used to copy the data from source system to data lake. Typically, same format is used for ingesting data from Source to Bronze Layer/Staging layer but we have the flexibility to change the source type during ingestion. |
| 3 | Auditing | Auditing component is used to audit the records ingested into data lake. Auditing includes identifying any errors or issues that may have occurred during the process, as well as reviewing performance metrics to identify areas for improvement. |
| 4 | Notification | Notification component is used to notify in case of success or failure events. This is super critical to ensure that the operation team is notified during all critical events. |
| 5 | Config Management | Config Management component is used for managing the configuration. |
| 6 | Reporting | This component is used for visual reporting of different ingestion metrics. |
Demystifying_data_Ingestion_in_Fabric_fundamental_components_for_ingesting_Data
2. Data ingestion - This component is essential for copying data from the source. A key consideration is datatype mapping from source to target, as Microsoft Fabric stores data in Delta Lake format. For more information on the datatypes supported by Microsoft Fabric, refer to the documentation. There may be scenarios where unsupported data types are encountered. For example, in SQL Server, data types such as geography, money, and datetime2(7) do not have corresponding equivalents in Delta Lake. This must be handled carefully to avoid loss of information when ingesting data into the Microsoft Fabric Lakehouse or warehouse.
3. Auditing - The auditing component is used to inspect the records ingested into the lakehouse. It helps in tracking data lineage, ensuring transparency in data transformations. It logs errors and anomalies, aiding in issue detection and correction. Performance metrics are captured to optimize the transformation processes. It also supports compliance by maintaining historical records of data activities. Additionally, it provides an audit trail for accountability and security.
Fabric's data warehouse can be used to audit all the information. Some of the key fields which can be audited are:
| Field Name | Purpose |
|---|---|
| source_type | Type of source (Blob,ADLS Gen2,Database,AWS S3) |
| event_run_id | Captured from data pipeline run id (Data pipeline system variable) e.g. 36ff2fde-dd85-4b1f-982e-651530df7c90. |
| item_name | Actual Item Captured from source e.g. For files Container/folder name & for Database it will be Database.schemaname.tablename |
| data_read | Total amount of data retrieved from source data store, including data copied in this run and data resumed from the last failed run. |
| data_written | The actual amount of data written/committed to the sink. The size may be different from dataRead size, as it relates to how each data store stores the data. |
| files_read | Total number of files copied from source data store, including files copied in this run and files resumed from the last failed run. Note: In case of database this field will be NA |
| files_written | The actual amount of data written/committed to the sink. The size may be different from data Read size, as it relates to how each data store stores the data. |
| rows_read | The actual amount of data read from the source. |
| rows_written | Number of rows copied to sink. This metric does not apply when copying files as-is without parsing them, for example, when source and sink datasets are binary format type, or other format type with identical settings. |
| data_consistency_verification | Tell the result whether the data has been verified to be consistent between source and destination store after being copied. |
| copy_duration | Total amount of duration take to copy the data. |
| event_start_time | Time when the pipeline started |
| event_end_time | Time when the pipeline ended. |
| source_cutoff_time | Source_cutoff_time will be the incremental pipeline execution time. Incase of full load it will be same as source_cutoff_time in the control table. This will be updated on each incremental load. |
| load_type | Status either Full or Incr (Incremental). |
| status | Status either success or failure. |
| event_triggered_by | Status either Manual or Scheduled. |
| error_details | Contain error codes and messages that pinpoint the location and nature of the problem. |
Demystifying_data_Ingestion_in_Fabric_fundamental_components_for_ingesting_Data
4. Notification - The notification component is utilized to alert in the event of success or failure. This is crucial to ensure the operations team is informed during all critical events. In Microsoft fabric, the Office 365 Outlook activity in Data Factory allows you to send an email with your Office 365 account. We also have the Teams activity in Data Factory which allows you to send a message to a Teams channel or group chat. The message in Office 365 outlook account and Teams can include dynamic expressions to be customized as much as necessary. Please find below a sample Teams notification sample.
Demystifying_data_Ingestion_in_Fabric_fundamental_components_for_ingesting_Data
5. Config management - Config management is used to manage configurations, which are essentially the connection details of the source.
6. Reporting - The ingestion reporting dashboard is a single pane of glass for all the displaying all the information captured by audit. One of the advantages of Fabric is that all audit information captured in data warehouse will be displayed using Power BI. Please find below a sample Teams notification sample.
Demystifying_data_Ingestion_in_Fabric_fundamental_components_for_ingesting_Data
When designing our Ingestion framework, adhere to several fundamental principles to ensure its effectiveness, maintainability, and adaptability.
I hope this blog helped you in understanding how to design an effective ingestion framework for a lake house using Fabric platform.
Special shout out to Gyani, Winnie, Polly, Hideo, and Dharmendra for their continuous support!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.