The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
What is the best way to load huge amount of data from an Oracle source to a Lakehouse? We are trying to load our FACT tables which has millions of rows onto a Fabric Lakehouse. Typically, these tables are terabytes in size. Just wanted to know the best way to approach this with a balance of speed and efficiency. Do we go using the Dataflow Gen2 route or Data Pipeline route?
We are currently testing the pipeline route, but its taking too long to copy over the data. We had to cancel the current run of this pipeline since its still running after 8 hours.
Hi @trebllaw,
Perhaps you can try to use shortcut feature to getting data. It will create a shortcut to load data from remote data source without data movement.
Since current not support direct create oracle shortcut, you may need to mapping oracle data to supported data source(e.g. Azure, Dataverse…), then you can use shortcut to getting data from azure.
Unify data sources with OneLake shortcuts - Microsoft Fabric | Microsoft Learn
Regards,
Xiaoxin Sheng
As much as possible, we'd like to avoid additional costs from using Azure since we are already paying for 3 P3 capacities as of the moment. That is why we didnt go the use staging route as that would consume Data Lake Gen 2 storage, which is an additional cost.
HI @trebllaw,
Perhaps you can consider using sempy link which can used to getting data from Semantic model.
What is semantic link? - Microsoft Fabric | Microsoft Learn
You can design a PBI report on desktop with live connection model to the oracle data source and publish it to power bi service workspace that assigned with fabric capacity.
After you assign gateway to handle its refresh, you can use notebook to getting data from that model and loading to Lakehouse.
Semantic link and Power BI connectivity - Microsoft Fabric | Microsoft Learn
Regards,
Xiaoxin Sheng
Unfortunately, this wouldnt work for us since we are going to be loading huge amount of data from multiple FACT tables.
The main reason we are trying to use pipelines is that it is advertised to be the ideal option to move large amounts of data, being able to load petabytes of data pretty quick, but based from our tests, we are getting about 5Kb/s of throughput. For 3.5, we only managed to process and transfer about 1.5M rows. We may be doing something wrong here we are not sure.
Here are some of the things that might be of note:
1. Oracle source system isnt partitioned.
2. We've set the pipeline to dynamic partitioning and set one of the key fields as index.
3. Parallel copies automatically set to 3
Hi,
I'm encountering the same performance issues in my testing. Have you had any success? I'm also loading very large onprem oracle tables that are not partitioned into MS Fabric Lakehouses.
HI @trebllaw,
What type of SKU are you worked? The endpoint and data pipeline also existed data read/write limitations. You can check the following links to know more about them if they helps for your scenario:
What is Power BI Premium? - Power BI | Microsoft Learn
Data Factory limitations overview - Microsoft Fabric | Microsoft Learn
Regards,
Xiaoxin Sheng
Currently with our testing, it would be a one time thing since we are trying to replicate what we have in the on prem oracle warehouse to the Fabric Lakehouse.
In the future, we would also like do updates on the Lakehouse whenever the warehouse changes, maybe this would be scheduled like once a day or something.
Its a direct copy and paste, no transformation whatsoever. Well, does changing of the detected datatype count? Sometimes Fabric detects the data type of a column as float when really its an integer, so we change that during the copy.
Hi @trebllaw,
Some questions to clarify what you're looking for: