Speed of writing to Fabric Data Warehouse from Fab...

smoqt · ‎04-26-2024

I need to move a large amount of data from CSVs stored in Azure Blob Storage to a Microsoft Fabric Data Warehouse.

Initially I was planning to use ADF to do this because I may need to use the Tumbling Window Triggers (which to my knowledge are not yet available in Fabric).

Doing so has proven to be quite slow. These files are stored with varying paths within the container, thus requiring the use of wildcard paths to access them. When attempting to use wildcard paths in ADF, I am required to enable staging on the Copy Data activity. If I don't, it fails the pipeline validation.

With staging enabled, it takes a significant amount of time to load the data. For example, loading just 2.6 GB took roughly 2.5 hours.

If I were to recreate and run the same pipeline from Microsoft Fabric, would it be faster? Is there a better way?

v-nikhilan-msft · ‎04-30-2024

Hi @smoqt
The internal team replied as follows:
Moving 2.6 GB files in general should: be much faster than 2.5 hours. Then only difference that impacts the throughput between Fabric and ADF would be in Fabric you can levage the native workspace staging storage while in ADF you have to bind it to an Azure storage.

Hope this helps. Please let me know if you have any further questions.

smoqt · ‎04-30-2024

Thank you. I will continue testing.

v-nikhilan-msft · ‎04-30-2024

Hi @smoqt

Please do let me know if you have any further questions.

smoqt · ‎05-01-2024

I wanted to share another example where this time I did not receive throttling errors. Writing 123 MB took only 4 min to write to blob storage but 2.5 hours to write to Fabric.

v-nikhilan-msft

Hi @smoqt
Can you please provide the run id's for the above two pipelines?

Thanks

smoqt

Example 1:

Pipeline Run ID: ddbd9f52-da90-43f3-afe1-aa1458b65f59

Activity Run ID: c9f1a5e0-0ff3-4402-b317-ef23111b505c

Example 2:

Pipeline Run ID: 31a31bab-42df-4b36-9777-c83ebb758b52

Activity Run ID: dc622eca-8676-4a13-a120-ca1fa962c2e1

v-nikhilan-msft

Thanks for the details @smoqt
The internal team is looking into the issue. Meanwhile you can try this as adviced by the team:

If the source has too many small files, the loading to Warehouse copy command will be significantly slow. You can always use a separate copy job with 'Copy Behavior' = 'merge files' to merge small files into one single large file, then the performance will be better.

Hope this helps. Please let me know if you have any further questions.

smoqt

Thanks, @v-nikhilan-msft. I will test.

smoqt · ‎04-30-2024

Here is an example from today.

Currently at 1hr and 12 min for 152 MB.

I see the throttling errors and a significant difference between the performance when writing to Azure Blob Storage vs when writing to Fabric Warehouse.

I am using the default settings for Maximum DIU and Degree of Copy Parallelism ("Auto").

I will research how to mitigate throttling errors, however I'm curious if the throttling is happening strictly on the Azure side or on the Fabric side.

v-nikhilan-msft · ‎04-28-2024

Hi @smoqt
Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this. We will update you once we hear back from them.
Thanks

Speed of writing to Fabric Data Warehouse from Fabric Pipeline vs ADF Pipeline

Helpful resources

Fabric Monthly Update - April 2024

Microsoft Fabric Learn Together

Fabric Community Update - April 2024