Solved: Re: Building an Incremental Loading Solution in Mi...

Felpan · ‎10-31-2025

I'm building an incremental loading dataflow in Microsoft Fabric to process budget data from Excel files stored in SharePoint. The solution WORKS, but requires 5+ steps and manual notebook execution—I suspect we're overcomplicating it. I'm looking for suggestions on whether there's a smarter way to leverage Fabric's built-in features. Microsoft Fabric's Dataflow Gen 2 has incremental refresh support, but I cannot use it because my first query uses a custom Power Query function (`fnGetFiles_Base1`) that:
- Recursively navigates SharePoint folder structures dynamically
- Doesn't hardcode paths (scalable for 20+ departments)
- Uses SharePoint.Contents() which appears incompatible with incremental refresh

MY HYPOTHESIS: Incremental refresh requires direct data source connections, not custom functions with external fetches. Is this correct?

Our current solution

Step 1
├─ Query: Find_V1_A2_Files. (The query searches for files matching specific naming conventions)
├─ Action: Fetch ALL files from SharePoint + identify by filename pattern
├─ Logic: Uses fnGetFiles_Base1() custom function + filename validation
├─ Output: All files matching naming convention + custom column LoadingTime for timestamp
└─ Destination: Lakehouse (Replace mode)
Step 2 Previous Run Reference
├─ Query: Find_V1_A2_Files_Previous (this is the same query as step 1, is used in next step)
├─ Action: Cache the previous run's results
└─ Purpose: Enables incremental comparison

STEP 3 Incremental Filtering (Manual Implementation)
├─ Query: Find_V1_A2_Files_Previous_Filtered
├─ Logic: JOIN + FILTER
│ - JOIN: Current vs Previous by [Name]
│ - Filter: WHERE [Date modified] > [LoadingTime_Previous]
├─ Output: ONLY new/modified files
└─ No destination (intermediate query)
STEP: 4 Data Processing
├─ Query: Department_V1 (processes V1 files)
├─ Query: Department_V2 (processes V2 files)
├─ Input: Uses Find_V1_A2_Files_Previous_Filtered
├─ Logic:
│ - Reads Excel workbooks
│ - Expands data tables
│ - Adds LoadingTime_Prosessed for tracking
└─ Destination: Lakehouse (Append mode)

Since we use Append mode, if a file is modified again after initial processing, the same rows (identified by 3 column) get appended again. This creates duplicates that require post-processing deduplication. So next step is to Deduplication with Notebook

├─ Tool: Python notebook with PySpark
├─ Logic:
│ - Window function: RANK BY (column1, column2, column3)
│ ordered by DESC(LoadingTime_Prosessed)
│ - Filter: Keep only rank = 1
│ - Output: Retain latest version of each record
└─ Action: OVERWRITE table in Lakehouse

1. Can incremental refresh work with REST API-based SharePoint access instead of .Contents()?
2. Are we missing a Fabric-native alternative to this architecture?

I would greatly appreciate any feedback or insights from the community.

Vinodh247 · ‎11-01-2025

Yes incremental refresh in fabric dataflow gen2 requires a direct query folding compatible data source and SharePoint.Contents() breaks folding. Once folding is broken fabric cannot automatically track deltas or apply range filters which makes incr refresh unavailable.

1. REST API based SharePoint access:
you can implement incr refresh if you switch to the sharePoint REST API or Graph API via Web.Contents(). This approach allows you to control pagination and filtering (?$filter=Modified ge ...), making the source foldable-like and enabling fabric’s incremental refresh to work.

2. Fabric native alternative:

more maintainable architecture is:

Use a dataflow gen2 (staging) that calls the sharePoint REST API and lands metadata (file name, modified date, path) in Lakehouse in replace mode.
Use a 2nd dataflow or data Pipeline to perform incremental merge (using ModifiedDate as watermark) into a curated Lakehouse table.
Use notebook or data pipeline activity only for business logic, not deduplication. Fabric’s delta merge in Lakehouse (via Spark SQL MERGE INTO) can handle updates natively.

In short I can advise the following, you have to decide which would be apt to your scenario:

Incremental refresh = not supported with SharePoint.Contents() or custom recursive functions
Switch to REST API + metadata staging + Lakehouse merge for a Fabric native automated, and scalable incremental load.

Please 'Kudos' and 'Accept as Solution' if this answered your query.

Regards,
Vinodh
Microsoft MVP [Fabric]

View solution in original post

V-yubandi-msft · ‎11-09-2025

@Felpan , is the issue resolved now, or are you still facing any difficulties? If you need any additional details or support, please feel free to share.

Thank you.

V-yubandi-msft · ‎11-06-2025

Hi @Felpan ,

I’d like to thank  @lbendlin , for their valuable contributions to the community and for sharing helpful solutions.

@Felpan , I hope lbendlin’s guidance was useful in resolving your issue. If you have any more questions or need additional support, please don’t hesitate to ask. We’re here to help.

Best Regards,
Community Support Team

lbendlin · ‎11-01-2025

MY HYPOTHESIS: Incremental refresh requires direct data source connections, not custom functions with external fetches. Is this correct?

"requires" is a strong word. "prefers" describes it better. But yes, a Direct Query source is better at limiting the data that needs to travel over the network for the processing of each partition.

Don't use Excel files. Use CSV archives.

Don't use recursive functions. There's really no need for them other than the academic joy.

Use the standard SharePoint.Contents and filter your Folder Path and File Name patterns early

Avoid any merges or "Combine binaries". Do this yourself. I have a blog entry on that option : Streamlined process for appending multiple files w... - Microsoft Fabric Community

Felpan · ‎11-02-2025

Don't use Excel files. Use CSV archives. Is there CSV archives of Excel files in Sharepoint? Can you explain?

lbendlin · ‎11-03-2025

Excel ist more costly to ingest due to the Meta data overhead.

Incremental refresh against non-folding sources will require ALL source data to be fetched before the partition filtering. Using ZIP files makes that much more palatable.

Vinodh247 · ‎11-01-2025