Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply
v-cyu
Microsoft Employee
Microsoft Employee

Ways to Efficiently Load Large Amount of Data from Cosmos Structured Streams?

We are looking for way to efficiently load data from Cosmos structured stream to Fabric Lakehouse/KQL Database. 

 

For a day worth of data, there are 12 structured stream where they are ~1 tb/stream and ~2 billion rows/stream.
So far, we have tested at the scale of 2 hour, 1 day, and 7 days worth of data.

 

Current Fabric Capacity: F256


We have been relying on Data Pipeline - Copy Activity to copy from each Cosmos structured stream into a Lakehouse table/KQL 

Questions:

1. Are there ways to configurate copy activity to load multiple structured streams at once?

Right now, it seems that 1 copy activity can only been configurated to copy from 1 structured stream.

Because of that, for example, to load 1 day of data, we need to combine the 12 structure streams belonging to the same day into 1 structured stream and load that structured stream.

 

2. Is there Partition option in copying from Cosmos Structured stream?

vcyu_1-1714428216354.png

While running the copy activity, we are seeing the above performance tip.

However, there does not seem to exist the "Partition Option" setting from the copy activity of Cosmos structure stream.
There is indeed the Partition index option but this will limit the Copy activity to copy only rows with that particular index, and this is not what we are looking for. 

vcyu_2-1714428320126.png

 


3. What would be the best practice to efficiently load data this big? 
To test capacity, we have extended the test to load 7 days worth of data using above method. 

Load to Lakehouse table took 32 hours, and to KQL database took 21 hours.

However, while doing so, the load to KQL caused us to exceed our capacity limit, putting the whole enrivonment to a halt. 
We are eventually looking at the possibility to load 18 month of data.

Any pointers/recommendation/Documentation could be helpful. 

4 REPLIES 4
Anonymous
Not applicable

Hi @v-cyu ,

Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this .
We will update you once we hear back from them.

Anonymous
Not applicable

Hi @v-cyu,

We got a response from internal team -

This Cosmos Structured Stream source... this is not Cosmos DB is it not? This sounds like the legacy internal tool to MSFT called Cosmos. If that's the case there's not much we can do on that front but nonetheless, 29TB is not that big in the grand scheme of things so KQL Database should have no issues with this. What I suspect is going on here is that you have a "small files" problem. If that 29TB is comprised of billions of small files stored on ADLS Gen1, simply reading in all these individual items will lead to memory consumption issues. In other words, if these were nice compressed parquet files we'd have a better luck. Can you confirm this hypothesis? Can you check the individual file sized and overall number of file to ingest? 


Can you please share the above details?

Anonymous
Not applicable

Hi @v-cyu ,

We haven’t heard from you on the last response and was just checking back to see if you got a chance to look into my last response.

Thanks

Anonymous
Not applicable

Hi @v-cyu ,

We haven’t heard from you on the last response and was just checking back to see if you got a chance to look into my last response.

Thanks

Helpful resources

Announcements
November Fabric Update Carousel

Fabric Monthly Update - November 2025

Check out the November 2025 Fabric update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.