topic Re: Full Load from API Skipping some files in Data Engineering

Full Load from API Skipping some files

max_mrc — Tue, 20 May 2025 02:51:36 GMT

I have created a pipeline in Fabric that does a full load from an API endpoint. The pipeline gets the data from the API endpoint and loads the files into a Lakehouse. I tried in the sandbox environment and it works perfect. However, when I run the pipeline in production, the capacity exceeded. Then I upgraded from F32 to F64. However, still the capacity exceeded. I let it run and the pipeline succeeded. However, I have got only 123000 records out of 210000 records. There is no pattern as well. It just skipped few files in between. Is it because bursting, smoothing that Fabric use? What is the best way to tackle this issue. Can I do something to make my pipeline take only 80% of the capacity, take longer time no problem but I want the full load to happen.

Re: Full Load from API Skipping some files

v-lgarikapat — Tue, 20 May 2025 11:17:32 GMT

Hi @max_mrc ,

Thanks for reaching out to the Microsoft fabric community forum.
Caues

API Pagination/Throttling Behaviour

Some API calls may fail silently or be throttled, returning partial data.
If you're calling the API in parallel or in large batches, errors may not be retried correctly, and Fabric might skip the failed chunks.
Check if your pipeline has error-handling or retries on failed API calls.

Fabric Capacity Bursting/Smoothing

Fabric tries to burst workloads over the set capacity using smoothing.
If your job exceeds the burst buffer too quickly or for too long, operations may get dropped or throttled — especially less critical ones, like background loads.
This could explain why some files/data are skipped without error.

Concurrency / Parallelism

If your pipeline processes API calls in parallel, it can spike your capacity usage.
That spike may lead to task drops or silent failure of lower-priority tasks.

Best Practices & Fixes

Limit Concurrency

In your copy activity or loop, set degree of parallelism to a lower number (e.g., 2–4).
Fabric defaults to high parallelism in some cases, which might spike capacity use.
2. Throttle API Calls

Introduce a wait/delay (e.g., sleep 1s) between each API call or page fetch.
Use a custom Until loop or ForEach with delay to slow down execution and reduce load.

3. Implement Retries and Logging

Use robust retry logic in each API call (3+ retries with exponential backoff).
Log each API call result (success/failure), even in a separate table if needed, to detect any silent skips.

4. Partition Your Load

Break the full load into smaller, deterministic partitions (e.g., by date, region, or ID range).
This helps in tracking what's been loaded and allows for easy retry of failed partitions.

5. Monitor Capacity Consumption

Use Fabric Monitoring tools or Azure Monitor Metrics to watch CPU and memory usage in real time.
Set alerts for near-capacity thresholds to know when you’re close to limits.

6. Use Dataflows Gen2 or Data Pipeline Alternatives

For larger full loads, consider:

Dataflows Gen2 (if available) for chunked API ingestion.
Staging into Blob/ADLS first, then processing into Lakehouse.

Optional: "Take Only 80% of Capacity"

There’s no direct way to say “only use 80% of capacity,” but you can

simulate that behaviour by:

Reducing parallelism.
Adding delay/sleep in loops.
Spreading load over more pipeline runs (e.g., time-partitioned loads).
Reducing dataset size per copy activity.

Evaluate and optimize your Microsoft Fabric capacity - Microsoft Fabric | Microsoft Learn

Understand your Fabric capacity throttling - Microsoft Fabric | Microsoft Learn
Plan your capacity size - Microsoft Fabric | Microsoft Learn

Smoothing and Throttling - Microsoft Fabric | Microsoft Learn

Scale your Fabric capacity - Microsoft Fabric | Microsoft Learn

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.

We appreciate your engagement and thank you for being an active part of the community.

Best regards,
LakshmiNarayana

ana.

Re: Full Load from API Skipping some files

gslick — Tue, 20 May 2025 18:43:56 GMT

Instead of using a pipeline, could you use a Notebook using Python instead to connect to the API?

Re: Full Load from API Skipping some files

v-lgarikapat — Wed, 21 May 2025 11:40:23 GMT

@gslick ,
Thanks for the follow-up question.
Yes — if your API integration involves complex logic or you're encountering throttling or capacity limits, moving to a Notebook (Python) is a recommended approach. It provides more control over pagination, retries, throttling, and error handling. You can also implement custom logging and batching logic more easily. This approach can either complement or replace pipelines, depending on your use case
Use Pagination with Fabric REST APIs - Microsoft Fabric REST APIs | Microsoft Learn

We appreciate your engagement and thank you for being an active part of the community.

Best regards,
LakshmiNarayana.

Re: Full Load from API Skipping some files

v-lgarikapat — Mon, 26 May 2025 06:08:29 GMT

Hi @max_mrc ,

If your issue has been resolved, please consider marking the most helpful reply as the accepted solution. This helps other community members who may encounter the same issue to find answers more efficiently.

If you're still facing challenges, feel free to let us know—we’ll be glad to assist you further.

Looking forward to your response.

Best regards,
LakshmiNarayana.