Difficulties with data from Lakehouse to Warehouse...

Scott_Powell · ‎06-21-2023

Hi all,

I landed data into a Lakehouse, with the idea of using shortcuts to then add it to a warehouse. But I've hit a couple of issues with this:

1. It seems like you can't add shortcuts from a Lakehouse to a Warehouse?

2. Using the "Copy" utility in a Pipeline seems VERY slow.

2a. It doesn't allow me to do the copy straight to the warehouse, I'm forced to write to staging storage first (even though later on the "running" dialog says I can...but it won't actually allow it

2b. It seems really slow. I'd heard about how fast the copy utility is supposed to be - but see image below. It's been running for almost 90 minutes and has only moved 235 MB of data:

So my questions are:

1. Should I flip my approach and always land structured data in a Warehouse, then make shortcuts to it in the lakehouse (vs. vice versa)

2. Is there something going on that's making the COPY go so slowly? I don't think any of this traffic is going through my gateway servers - it's all happening in the cloud

Thanks!

Scott

ajarora · ‎06-21-2023

Hello Scott, thanks for reporting this feedback.

We are aware of the specific performance bottleneck you observed, and are working on it with high priority. While you were able to load 600 million rows in ~2 hrs, but there is scope to improve.

As for why data is getting staged: You do need to have your data staged first, because it may not be in the state where it can be directly ingested by Data warehouse Copy command, which has limitations in terms of the data stores it supports, the authentication to the source data, the source file format etc. Read the limitations here: https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest

So one way to reduce the copy duration is by having your source data compatible with DW copy cmd, then the pipeline UX will not require you to stage your data. The other is to wait for the performance improvements I mentioned earlier. Hope this helps.

Scott_Powell · ‎06-22-2023

Hi @ajarora , appreciate the response! Great to hear the performance is being worked on - it was 600 million rows, but it's also < 250 MBytes.

Re: the source vs. target compatibility - I'm moving this data between a Lakehouse table to a warehouse table. Since those are both delta parquet format files, shouldn't they have to be the same?

Sorry for the newb questions, I just did a straight pipeline with a copy to move between a Lakehouse table to a Warehouse table. Kinda thought this would be pretty straight forward. It's a lot of rows (just over a billion altogether) but the size of the files is small, less than half a GB total.

Thanks!

Scott

ajarora · ‎06-22-2023

It seems you have a single source file, which prevents parallelism to kick in efficiently.

Another issue is that Warehouse copy command doesnt support Lakehouse as a supported source, this kicks in staging step.

These are the goals we are shooting for:

1. Remove need for a stage step when copying data from lakehouse to a warehouse, when source format is already compatible with copy command (parquet, csv etc).

2. If source data is not compatible, improve performance, whether for a single file case or multiple, through better partitioning.

Hope this helps. Stay tuned.

Scott_Powell · ‎06-21-2023

FYI one update - it took a total of 2 hours 22 minutes to copy from the Lakehouse to the ADLSgen2 staging, but only 7 minutes to go from staging to the final warehouse. If the 2 hours 22 minutes could also be chopped down to only 7 minutes it would be awesome.

Thanks,

Scott

Difficulties with data from Lakehouse to Warehouse?

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

New Offer! Become a Certified Fabric Data Engineer

Difficulties with data from Lakehouse to Warehouse?

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025