This is best Fabric, Power BI, SQL and AI community event. How do we know? The last event sold out! Save €200 with code FABCMTY200.
Register nowA new Data Days event is coming soon! This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. Don't miss out.
Data Factory empowers you to ingest, prepare and transform data across your data estate with a modern data integration experience. Whether you are a citizen or professional developer, Data Factory is your one-stop-shop to move or transform data. It offers you intelligent transformations and a rich set of activities from hundreds of cloud and on-premises data sources, and a growing list of output destinations that span across both Microsoft and 3rd party databases, services, and applications.
There are two primary high-level features Data Factory implements: dataflows and pipelines.
In this article, we will focus on Dataflow Gen2 and how it works. Dataflow Gen2 is built using the familiar Power Query experience that's available today across several Microsoft products and services such as Excel, Power BI, Power Platform, Dynamics 365 Insights applications, and more. Power Query empowers all users, ranging from citizen to professional, to perform data ingestion and data transformations across their data estate. With Dataflow Gen2, you can perform joins, aggregations, data cleansing, custom transformations, and much more all from an easy-to-use, highly visual, low-code UI.
Dataflow Gen2 in Fabric is the evolution of Dataflow, redesigned from the ground up with more powerful compute engines, data movement capabilities to help you load data to several Fabric and Azure data destinations. In this blog, we will provide an overview of the Dataflow Gen2 architecture and how the dataflow engine works.
Your data integration project can be as simple as ingesting data from source to destination with little modification, or as complex as blending and transforming data from different structured and unstructured cloud and on-premise data sources. At each end of the spectrum, different strategies are required to either transform data in memory, or land it in staging storage where you can leverage different compute technologies to handle big data volumes. Another challenge in data integration projects is to manage orchestration logic to make it all work in sync. Building such projects requires the same complexity as coding a program flow in any procedural programming language. As with any computer program, it requires trained professionals that can conceptualize and create a program flow, handling all the branches and exceptions, configuring compute and data providers to work together, and is similarly rife with bugs that must be weeded out from the system over time.
The dataflow engine abstracts away traditional ETL and ELT complexities described above. In the next sections, we will review the different technologies the dataflow engine brings together and through examples, we will show how complex decisions and configurations are abstracted away.
Dataflows Gen2 are authored using Power Query. Once you publish a dataflow, the dataflow definition is generated – this is the program that will run once the dataflow is refreshed to produce tables in staging storage and/or output destination. During refresh, the definition of a dataflow is used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination.
Diagram_that_captures_the_architecture_of_the_dataflow_engine_during_its_executi
The above diagram captures the various components of the dataflow architecture, including the Lakehouse artifact used to stage data being ingested, and Warehouse artifact used as a compute engine and means to write back results to staging or supported output destinations faster. When warehouse compute cannot be used, or when staging is disabled for a query, the mashup engine will extract, transform, or load the data to staging or destination.
Note
When your first Dataflow Gen2 is created in a workspace, dataflow staging Lakehouse and Warehouse are provisioned along with their related SQL Endpoint and Datasets. These artifacts are an implementation detail of Dataflow Gen2 that is currently visible and required for their operations. These artifacts should not be deleted. It is recommended to not leverage them outside of dataflows experience as they will be hidden in the future.
Screenshot_of_a_list_of_staging_artifacts_created_by_the_Dataflow_Gen2_artifact
Simple dataflows may only require moving a small dataset from source to destination with little modification and can be performed in memory by the mashup engine. However, this may not work well for all your ETL projects.
In more complex projects with patterns that involve blending data from multiple cloud or on-prem sources or transforming big data volumes from sources that do not support folding (predicate pushdown), it may be best to first ingest the data into staging and then perform the transformation work using dataflow’s specialized engine, and finally, load the data to an output destination.
Let's review how dataflows work with a few examples:
I would like to build reports and dashboards for the Contoso Fast Food chain. For my report, I need to analyze store and online transactions per customer. My data includes hundreds of millions of records for customers, physical and online store purchases all stored in ADLS Gen2 storage account.
I need to extract the data from the source, clean it, append transactions and merge then with the customer data. Then, perform aggregations and load data to the destination I will use for reporting. As you see in the diagram below, I will:
Note
When authoring a dataflow, a query containing the lightning icon signals that the enhanced compute engine will be used during refresh.
Screenshot_of_the_Power_Query_editor
When I refresh the dataflow, here is what happens behind the scenes:
Dataflow_refresh_process_diagram_that_explains_how_staging_artifacts_are_leverag
One significant improvement in dataflow gen2’s warehouse compute engine architecture is seen when the dataflow engine orchestrates reading from staging, transforming, and writing back results into staging or supported destinations. The engine leverages the DataflowStagingWarehouse to not only fold/push down the computation, but also to write the data results back to staging/destination.
You might not always want to stage your data before loading it to an output destination. Here are a few examples where you might want to load data directly to a destination:
Screenshot_of_the_Power_Query_editor_with_Azure_SQL_Database_as_the_output_desti
In this scenario, I would like to ingest a file containing 100k records to my Azure SQL. I would also like to apply a few transformations like renaming and removing columns and filter out columns that contains invalid data. For these types of transformations and data volume, staging data is not required. I can disable staging for any query by right clicking the query and deselecting the “Enable Load” action. The query name will become italic, and in the diagram view, the query’s outline will be greyed out as seen in the above image.
Behind the scenes, controlling the staging setting of a query will change orchestration from its default setting:
Diagram_of_the_Power_Query_Evaluation_when_Staging_is_Enabled
To orchestration that extracts, transforms, and loads data from source to destination in memory:
Diagram_of_the_Power_Query_Evaluation_when_Staging_is_Enabled
In the Power Query Editor, you can enable or disable staging by right clicking the query and selecting the "Enable staging" option. By default, its enabled, as indicated by the checkmark:
Data_Factory_Spotlight_Dataflow_Gen2
It's important to note that in Dataflow Gen2, disabling staging does not mean data will not be loaded to a destination. This setting controls whether data will or will not be staged. Loading to a destination is managed via the query’s data destination control at the bottom right corner of the editing experience:
Data_destination_set_to_SQL_Server_database
Note
Today, a dataflow must contain at least one load enabled query in order to publish. We are updating the dataflow experience to allow publishing dataflows with no load-enabled queries as long as at least data destination is configured. This change will by available in all production regions by August 2023.
Let's review the difference between the refresh history of the above dataflow, when it is refreshed with and without staging enabled.
As you can see below, when staging is disabled, the refresh history will only contain the activity of loading to the output destination:
Screenshot_of_refresh_history_dialog
And when staging is enabled, refresh history will contain both an entry for loading data into staging and the activity of loading data from staging into the output destination:
Screenshot_of_refresh_history_dialog
When staging is enabled, ingestion will take more time. If there is no need to have data staged, it should be disabled.
When creating dataflows that load data to Fabric Lakehouse or Warehouse, for best performance, it is recommended to co-locate the dataflow and destinations in the same workspace.
The Dataflow Gen2 architecture brings together a range of technologies that help you extract, transform and load data into a variety of Fabric and Azure destinations, with no code and little effort - making it easy to get started with data analytics in Fabric.
Over the next few weeks, we’ll dive deeper into more Data Factory topics.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.