Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now

tharunkumarRTK

Efficiently Processing Massive Data Volumes using Microsoft Fabric: A Comprehensive Guide

Introduction

In this article, you will learn how to process massive volumes of data using Microsoft Fabric.

Prerequisites

  1. Sample Datasets: Link
  2. Enable Fabric for your organization
  3. Fabric Capacity and User license: Microsoft Fabric License, Buy a Fabric capacity
  4. A virtual machine with On Premises data gateway installed: Install On Premises data gateway

    Overview

    This architectural overview represents the various fabric artifacts that we will use during the process.

    tharunkumarRTK_0-1723877761779.png

    Above architecture follows the guidelines of medallion architecture. The Medallion Lake architecture organizes data in a Lakehouse into distinct layers, following best practices for data management and processing.

     

    Procedure Overview

    1. Create a Lakehouse to store the raw data and enriched data in delta format.

    2. Create a Data pipeline to copy the raw data from on premises data source to Lakehouse.

    3. Create a Notebook to transform the raw data and to build the staging tables.

    4. Create a dataflow gen 2 to normalize the staging tables.

    5. Create Power BI semantic model and build the report.

     

    Procedure
    Create a Lakehouse

    1. Open app.powerbi.com or app.fabric.microsoft.com  Create a new fabric workspace. Make sure the workspace is hosted in a Fabric/Premium capacity. 

    2. Click on the power bi icon on bottom left and switch to Data Engineering persona.   Click on Lakehouse icon to create a new Lakehouse

    tharunkumarRTK_1-1723878036921.png


    enter name and select the sensiticity label and click on create.

     

     

    3. As soon as you click on create, a lakehouse, a SQL analytics endpoint and also a default semantic model will be automatically created.

    tharunkumarRTK_3-1723879047363.png

    tharunkumarRTK_4-1723878114692.png

     

    You can click on the three dots icon next to the SQL endpoint and clcik  on‘Copy SQL analytics endpoint’. Note it your notepad, we will use this info in the next section.

     

    Before creating the pipeline:

    1. Make sure your raw data files are on an on-premises server.
      tharunkumarRTK_5-1723878203391.png
    2. Create an on-premises gateway connection
           a. Open app.fabric.microsoft.com and click on the gear icon on top right  Click on ‘Manage gateways and connections’ page

      tharunkumarRTK_6-1723878241060.png

                  b. Click on ‘+New’ button on top left and a new connection window will open on the right.

      tharunkumarRTK_7-1723878309405.png

      After filling all the details of your data source, click on ‘Create’ button.

      Note: You can only create an on-premises gateway connection after configuring the on premises gateway machine. Refer the link attached in the prerequisites section for more information.

       

      Create a Data pipeline

      1. Open your fabric workspace and switch to the data engineering persona  click on pipeline icon to create a new data pipeline.

      tharunkumarRTK_8-1723878380622.png

      2. Enter the name of your pipelines and select the appropriate sensitivity label and click on create.

      tharunkumarRTK_9-1723878423807.png

      3. Open the pipeline  on the top menu click on copy data drop down Click on ‘Add to Canvas’

      tharunkumarRTK_10-1723878463089.png

      4.  Click on the added ‘Copy data’ activity  under ‘General’ section you can enter a meaningful name to your activity.

       

      5. Switch to ‘Source’ section  select the on-prem gateway connection from the connection dropdown.  Check the box next to ‘Recursively’  Incase if you want to delete the files from on prem server as soon as they moved to the Lakehouse then click on the check box next to ‘Delete files after completion’

      tharunkumarRTK_11-1723878540076.png

       

      6. Switch to ‘Destination’ section  From the ‘Connection’ dropdown, select the Lakehouse that was created before  Select the ‘File Format’ as binary.

      tharunkumarRTK_12-1723878587870.png

      7. Run the pipelines and check the Run status and make sure the files are copied the Lakehouse.

       

      Now that data is copied to the Fabric Lakehouse, In the following section we will create a Fabric notebook to transform the data and to build the staging tables.

       

      Create a Notebook

      1. Switch to Synapse Data Engineering persona  click on Notebook icon

      tharunkumarRTK_13-1723878650795.png

      2. Select the appropriate sensitivity label and click on ‘Save and continue’

      tharunkumarRTK_14-1723878691419.png

      3. Open the Notebook and in the left-hand panel ‘Add’ to link the Lakehouse

      tharunkumarRTK_15-1723878726559.png

      4. We created the Lakehouse in the previous steps, so select existing Lakehouse and select the Lakehouse.

      5. Import the necessary libraries required to transform the raw data  Initialize the relative path of raw data folder in Lakehouse  Initialize the schema variable and include the columns in the csv files.

      tharunkumarRTK_17-1723878795320.png

       

      6. Run the below spark code to load the data from csv files into spark data frames  create staging delta tables.

      tharunkumarRTK_18-1723878834455.png

      7. Make sure delta table is created. You can verify it by expanding the table folder in the left-hand panel.

      Raw data has been transformed, lets normalize the fact table using Dataflow gen 2

       

      Create a Dataflow gen 2

      1. Switch to Data Factory persona and click on the Dataflow Gen2 icon.

       

      2. Click on get data and select SQL server  Enter the Lakehouse SQL analytics end point in the ‘Server’ section  click on next and select the ‘Staging_Sales’ table created before and click on next.

      tharunkumarRTK_1-1723878993109.png

       

      3. As per the business logic and keeping the downstream analytics solutions in mind. Normalize the fact table into dimensions and facts.

       

      tharunkumarRTK_4-1723879069130.png

      4. On the right bottom you will find an option to add Lakehouse as destination,  Click on add and select the Lakehouse.  Enter the table name and click on Next  Choose the update method as append or overwrite depending on the data pipeline strategy.

      tharunkumarRTK_5-1723879100951.png

      5. Select the appropriate data types for each column and click on ‘Save Settings’

      6.  Repeat step 4 and step 5 for all other tables and run the dataflow.

      7. Once the data refresh is completed, open the Lakehouse or SQL analytics endpoint to validate the newly created delta tables.

      Note: To reduce the run time of dataflow, make sure you use ‘Native SQL’ option and include the transformation logic within the SQL statement. For example: In the below query, instead of performing the distinct operation in the dataflow after retrieving whole sales table, we included the distinct operation within SQL statement.

      tharunkumarRTK_6-1723879288058.png

      Considering the truncating the staging tables and archive the raw data files in the Lakehouse, as soon as the dataflow refresh is completed. This practice can reduce the running time in the subsequent runs and reduces cost.

       

      Create a Power BI Semantic model and build the Power BI Report

      1. Switch to Power BI persona  open the Lakehouse and on top you will find an option to create a ‘New semantic model’.

      tharunkumarRTK_7-1723879319294.png

      2. Enter the name for your semantic model and select all the production grade non staging tables and click on confirm.

       

      3. Open the new semantic model and establish the relationships between the dimension tables and fact tables.

      tharunkumarRTK_9-1723879400208.png

      4. After completing the necessary data modelling, create appropriate measures for building the dashboard.

      5. Click on the new report button on the top to create the reports as per the A sample report would look like this

       

      tharunkumarRTK_11-1723879473458.png

      Data Refresh

      To refresh all the key data artifacts, instead of scheduling them individually or triggering them manually, orchestrating all the activities in the data pipeline would be the best option.

      In the same pipeline that we created in the previous steps, include the notebook, dataflow and semantic model refresh activities and schedule the pipeline.

      tharunkumarRTK_12-1723879508785.png

      You can also consider including failure alerts, as soon as any activity in the pipeline fails you will be notified with an emai

Comments

Than You @tharunkumarRTK  for posting this, it is really helpful..

 

I would like to ask how can we post articles like the way you posted in the community where can i find that create option ? basically i want to know how to be Blog contributer inside this community

Hi @dharmendars007 

 

Reach out to the community admin @Natalie_iTalent. And share the gist of your article, they can grant the permissions.

 

Best Regards

Nice end to end example. In my case, my fact table is 900 billion rows and in F8 capacity model creation in web service fails with errors. The DQ using SQL endpoint does not work as it appears SQL endpoind does not support partition pruning

@KailashPareek 

I think you need to evaluate the SQL queries that are executing against your sql endpoint. You can use 'Query activity' to find all the queries. See if there is any scope for optimization 

It is simple select * from table where DATE=<> and HOUR=<> and there is partition on DATE/HOUR. The SQL Endpoint does not seem to use partition info and select only data from relevant parquet files of delta table.