Efficiently Processing Massive Data Volumes using ...

tharunkumarRTK · ‎08-17-2024

Introduction

In this article, you will learn how to process massive volumes of data using Microsoft Fabric.

Prerequisites

Sample Datasets: Link
Enable Fabric for your organization
Fabric Capacity and User license: Microsoft Fabric License, Buy a Fabric capacity
A virtual machine with On Premises data gateway installed: Install On Premises data gateway
Overview

This architectural overview represents the various fabric artifacts that we will use during the process.

Above architecture follows the guidelines of medallion architecture. The Medallion Lake architecture organizes data in a Lakehouse into distinct layers, following best practices for data management and processing.

Procedure Overview

1. Create a Lakehouse to store the raw data and enriched data in delta format.

2. Create a Data pipeline to copy the raw data from on premises data source to Lakehouse.

3. Create a Notebook to transform the raw data and to build the staging tables.

4. Create a dataflow gen 2 to normalize the staging tables.

5. Create Power BI semantic model and build the report.

Procedure
Create a Lakehouse

1. Open app.powerbi.com or app.fabric.microsoft.com Create a new fabric workspace. Make sure the workspace is hosted in a Fabric/Premium capacity.

2. Click on the power bi icon on bottom left and switch to Data Engineering persona. Click on Lakehouse icon to create a new Lakehouse

enter name and select the sensiticity label and click on create.

3. As soon as you click on create, a lakehouse, a SQL analytics endpoint and also a default semantic model will be automatically created.

You can click on the three dots icon next to the SQL endpoint and clcik on‘Copy SQL analytics endpoint’. Note it your notepad, we will use this info in the next section.

Before creating the pipeline:
1. Make sure your raw data files are on an on-premises server.
2. Create an on-premises gateway connection
  a. Open app.fabric.microsoft.com and click on the gear icon on top right Click on ‘Manage gateways and connections’ page
  
  b. Click on ‘+New’ button on top left and a new connection window will open on the right.
  
  After filling all the details of your data source, click on ‘Create’ button.
  
  Note: You can only create an on-premises gateway connection after configuring the on premises gateway machine. Refer the link attached in the prerequisites section for more information.
  
  Create a Data pipeline
  
  1. Open your fabric workspace and switch to the data engineering persona click on pipeline icon to create a new data pipeline.
  
  2. Enter the name of your pipelines and select the appropriate sensitivity label and click on create.
  
  3. Open the pipeline on the top menu click on copy data drop down Click on ‘Add to Canvas’
  
  4. Click on the added ‘Copy data’ activity under ‘General’ section you can enter a meaningful name to your activity.
  
  5. Switch to ‘Source’ section select the on-prem gateway connection from the connection dropdown. Check the box next to ‘Recursively’ Incase if you want to delete the files from on prem server as soon as they moved to the Lakehouse then click on the check box next to ‘Delete files after completion’
  
  6. Switch to ‘Destination’ section From the ‘Connection’ dropdown, select the Lakehouse that was created before Select the ‘File Format’ as binary.
  
  7. Run the pipelines and check the Run status and make sure the files are copied the Lakehouse.
  
  Now that data is copied to the Fabric Lakehouse, In the following section we will create a Fabric notebook to transform the data and to build the staging tables.
  
  Create a Notebook
  
  1. Switch to Synapse Data Engineering persona click on Notebook icon
  
  2. Select the appropriate sensitivity label and click on ‘Save and continue’
  
  3. Open the Notebook and in the left-hand panel ‘Add’ to link the Lakehouse
  
  4. We created the Lakehouse in the previous steps, so select existing Lakehouse and select the Lakehouse.
  
  5. Import the necessary libraries required to transform the raw data Initialize the relative path of raw data folder in Lakehouse Initialize the schema variable and include the columns in the csv files.
  
  6. Run the below spark code to load the data from csv files into spark data frames create staging delta tables.
  
  7. Make sure delta table is created. You can verify it by expanding the table folder in the left-hand panel.
  
  Raw data has been transformed, lets normalize the fact table using Dataflow gen 2
  
  Create a Dataflow gen 2
  
  1. Switch to Data Factory persona and click on the Dataflow Gen2 icon.
  
  2. Click on get data and select SQL server Enter the Lakehouse SQL analytics end point in the ‘Server’ section click on next and select the ‘Staging_Sales’ table created before and click on next.
  
  3. As per the business logic and keeping the downstream analytics solutions in mind. Normalize the fact table into dimensions and facts.
  
  4. On the right bottom you will find an option to add Lakehouse as destination, Click on add and select the Lakehouse. Enter the table name and click on Next Choose the update method as append or overwrite depending on the data pipeline strategy.
  
  5. Select the appropriate data types for each column and click on ‘Save Settings’
  
  6. Repeat step 4 and step 5 for all other tables and run the dataflow.
  
  7. Once the data refresh is completed, open the Lakehouse or SQL analytics endpoint to validate the newly created delta tables.
  
  Note: To reduce the run time of dataflow, make sure you use ‘Native SQL’ option and include the transformation logic within the SQL statement. For example: In the below query, instead of performing the distinct operation in the dataflow after retrieving whole sales table, we included the distinct operation within SQL statement.
  
  Considering the truncating the staging tables and archive the raw data files in the Lakehouse, as soon as the dataflow refresh is completed. This practice can reduce the running time in the subsequent runs and reduces cost.
  
  Create a Power BI Semantic model and build the Power BI Report
  
  1. Switch to Power BI persona open the Lakehouse and on top you will find an option to create a ‘New semantic model’.
  
  2. Enter the name for your semantic model and select all the production grade non staging tables and click on confirm.
  
  3. Open the new semantic model and establish the relationships between the dimension tables and fact tables.
  
  4. After completing the necessary data modelling, create appropriate measures for building the dashboard.
  
  5. Click on the new report button on the top to create the reports as per the A sample report would look like this
  
  Data Refresh
  
  To refresh all the key data artifacts, instead of scheduling them individually or triggering them manually, orchestrating all the activities in the data pipeline would be the best option.
  
  In the same pipeline that we created in the previous steps, include the notebook, dataflow and semantic model refresh activities and schedule the pipeline.
  
  You can also consider including failure alerts, as soon as any activity in the pipeline fails you will be notified with an emai

dharmendars007 · ‎09-20-2024

Than You @tharunkumarRTK for posting this, it is really helpful..

I would like to ask how can we post articles like the way you posted in the community where can i find that create option ? basically i want to know how to be Blog contributer inside this community

tharunkumarRTK · ‎09-28-2024

Hi @dharmendars007

Reach out to the community admin @Natalie_iTalent. And share the gist of your article, they can grant the permissions.

Best Regards

KailashPareek · ‎10-17-2024

Nice end to end example. In my case, my fact table is 900 billion rows and in F8 capacity model creation in web service fails with errors. The DQ using SQL endpoint does not work as it appears SQL endpoind does not support partition pruning

tharunkumarRTK · ‎10-17-2024

@KailashPareek

I think you need to evaluate the SQL queries that are executing against your sql endpoint. You can use 'Query activity' to find all the queries. See if there is any scope for optimization

KailashPareek · ‎10-17-2024

It is simple select * from table where DATE=<> and HOUR=<> and there is partition on DATE/HOUR. The SQL Endpoint does not seem to use partition info and select only data from relevant parquet files of delta table.

Efficiently Processing Massive Data Volumes using Microsoft Fabric: A Comprehensive Guide

Mastering Dynamic Stock Management with Cumulative...

Explore Small Multiples for the New Card Visual in...

Re: Get certified in Microsoft Fabric—for free!

Re: Get certified in Microsoft Fabric—for free!

Re: Get certified in Microsoft Fabric—for free!

Join us at the 2025 Microsoft Fabric Community Conference

Efficiently Processing Massive Data Volumes using Microsoft Fabric: A Comprehensive Guide

Mastering Dynamic Stock Management with Cumulative...

Explore Small Multiples for the New Card Visual in...

Re: Get certified in Microsoft Fabric—for free!

Re: Get certified in Microsoft Fabric—for free!

Re: Get certified in Microsoft Fabric—for free!