Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered

Reply
HamidBee
Power Participant
Power Participant

Advice Needed: Building a Data Pipeline in Microsoft Fabric for Monthly Transactional Data

Hello everyone,

 

I’m currently working on a data pipeline project in Microsoft Fabric that involves processing transactional data emailed to me monthly, and I’d be grateful for any advice on the best approach to structuring this pipeline.

 

Project Details:

 

  • Data Source: Each month, I receive a non-normalized Excel file containing transaction data via email.
  • Objective: The goal is to automate the ingestion, transformation, and loading process to organize this data into a star schema within a data warehouse, preparing it for analytics and reporting.
  • End Use: Once processed, the data should be optimized for fast querying and analysis.

 

Current Strategy:

 

  1. Ingestion: Using a Copy Activity to move the Excel data from the Fabric Lakehouse folder into a staging table within the data warehouse.
  2. Transformation: Considering both Stored Procedure and Notebook Activities for data transformations. The process includes cleansing, deduplication, and mapping to final dimension and fact tables.
  3. Normalization: After transformations, I aim to load the data into a normalized schema, with tables like DimLocation, DimTransactionType, and FactTransactions.

 

Questions:

 

  1. Stored Procedures vs. Notebooks for Transformations: Since some transformations are complex, would you recommend prioritizing Notebooks (e.g., PySpark) or Stored Procedures? Or a combination of both?
  2. Monthly File Handling: Given the monthly data drop, is it more efficient to use dynamic pipeline parameters for each file, or should I set up new pipelines as files arrive?
  3. Performance Optimization: Any suggestions for optimizing performance, especially for handling large datasets and ensuring transformations remain cost-effective?

 

Thanks in advance for any insights or experiences you can share.

1 ACCEPTED SOLUTION
Srisakthi
Continued Contributor
Continued Contributor

Hi @HamidBee ,

 

If you are more familiar with SQL you can go with Stored procedure activity. But for mre complex transformation its better to use Spark.

Stored Procedures vs. Notebooks for Transformations:

Typically for the complex transformation it is better to use Notebook(pyspark), moreover we have the flexibility to configure spark settings for the workload like setting your executors, dynamic scaling etc. for better performance and cost effective.

Monthly File Handling:

In data pipeline we have Triggers(Preview feature so check with your internal team), you can set the triggers based on file arrival in azure blob storage and have your file name in such a way for dynamic pipeline parameters. In case if you dont want to use Triggers then you can schedule your pipelne based on your monthly schedule and have parametized file name to pick up.

Performance Optimization

Analyse the pattern, load and adjust spark configuration accordingly for cost effective and performance using Fabric metrics app.

 

Thanks,

Srisakthi

 

 

 

View solution in original post

2 REPLIES 2
Srisakthi
Continued Contributor
Continued Contributor

Hi @HamidBee ,

 

If you are more familiar with SQL you can go with Stored procedure activity. But for mre complex transformation its better to use Spark.

Stored Procedures vs. Notebooks for Transformations:

Typically for the complex transformation it is better to use Notebook(pyspark), moreover we have the flexibility to configure spark settings for the workload like setting your executors, dynamic scaling etc. for better performance and cost effective.

Monthly File Handling:

In data pipeline we have Triggers(Preview feature so check with your internal team), you can set the triggers based on file arrival in azure blob storage and have your file name in such a way for dynamic pipeline parameters. In case if you dont want to use Triggers then you can schedule your pipelne based on your monthly schedule and have parametized file name to pick up.

Performance Optimization

Analyse the pattern, load and adjust spark configuration accordingly for cost effective and performance using Fabric metrics app.

 

Thanks,

Srisakthi

 

 

 

Thanks for sharing this information.

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.