The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hello Everyone,
I need one suggestion from community members. I feel maybe this is the right platform to ask.
Bronze Layer :
Our source raw data is directly loading into Fabric Lakehouse Files section.
Silver Layer :
I have a notebook which is loading data incrementally from Fabric Lakehouse Files section to Fabric Lakehouse SQL Endpoint.
GOLD Layer : (Approach Suggestion needed for Fabric Warehouse)
We created custom structure for all the fact tables.
Now we need to apply the business logics and rules and data transformations on top of silver layer lakehouse tables. By selecting the needed columns from multiple tables points to single Fact table like this, it is my requirement.
Also, data should be first loaded into fact tables as Full load only for the first time and from next time onwards based on the data column changes data should be incrementally loaded into Fact tables in Fabric warehouse.
Data should be upserted everytime not full load everytime
Here are the things am thinking of let me know is these correct, tell me if am wrong and suggest me the good approach.
1. Dataflows Gen2 i thought of using it but i observed we cannot upsert the data it seems from here.
2. Spark notebooks i thought of using it but I came to know that we cannot write data from lakehouse to warehouse as it supports data write only for lakehouse.
3. I saw the T-SQL notebooks but i see it supports only transact sql and am not sure can i use it or not, because our data size can gradually increase currently it is near to 50M records in multple tables so size can be of less than 50 GB but in future it might increase to 100GB so in this case the T-SQL notebooks works as expected even if data increase.
4. Other option i though of using stored procedures in warehouse and i can read data from lakehouse tables using cross database querying. so here also same does stored procedures upserts data for millions of records wont it come under performance issues.
5. Last one i thought of using spark notebooks with PySpark which is loading the transformed data into lakehouse tables only in seperate schema, and use pipeline to load data incrementally to warehouse tables. (Mostly i dont want to go with this approach because it will create extra copy of aggregated data again in lakehouse as it might increase the memory size of it and loading same to warehouse again)
Above are the approaches am thinking, Can anyone suggest me the best approach to go with:
My requirement is simple i have source data in lakehouse tables in structured format. So i have to use these lakehouse tables as source for my data transformations and once it is done i will do a full load of table data into fact table in Fabric warehouse. From next time onwards it should load data only in incremental manner like upserting data, it should update the existing records if it matches the IDs of table columns and it should create/add new records if it doesnt matches with IDs of table columns.
As i said above data size is gradually increasing so in future i dont want to run into performance issues even if they occur i should be able to handle in the approach i go. So i want to know should i go with Spark Notebooks or Stored Procedures or T-SQL Notebooks or any other approach.
Suggest me the best approach, your efforts are greatly appreciated
Thanks in advance,
Solved! Go to Solution.
Hi @pavannarani
Thank you for reaching out to the Microsoft Community Forum regarding your query.
After evaluating the approaches,the best approach is to use Stored Procedures in Fabric Warehouse to read from Silver layer Lakehouse tables (via SQL Endpoint), apply transformations, and upsert data into Gold layer fact tables.
This solution supports initial full loads and incremental upserts using MERGE, avoids extra data copies in the Lakehouse, scales to 100 GB with proper indexing and partitioning, handles complex business logic in T-SQL with performance optimizations, and is orchestrated via a Fabric Data Pipeline for scheduling and control.
To implement the above approach, consider the following steps:
For more details, please refer to the official Microsoft documentation:
Incrementally load data from Data Warehouse to Lakehouse - Microsoft Fabric
If my response has resolved your query, please mark it as the Accepted Solution to assist others. Additionally, a 'Kudos' would be appreciated if you found my response helpful.
Thank You
Let's keep this simple. First, there are certain limitations that come with a Fabric warehouse, which I do not remember at the moment, so please look it up yourself by asking Copilot or some other GPT. For this reason, and for maximum future flexibility I would recommend using a lakehouse instead of a warehouse. Later on, if and when needed, you have PySpark available from a notebook with a huge amount of libraries to basically do whatever you can think of. Underneath the hood, no more SQL engine, but Delta tables/files. You still get a SQL Analytics Endpoint where you can perform quick lookups and tests, create views, stored procedure and table valued functions if you must use SQL no matter what. But be aware all new table creation in the lakehouse will be done either from a pipeline Copy data activity or a Notebook.
Step 1 - Bronze layer
As regards the Bronze layer, you could use the Fabric database mirroring capability using CDC (Change Data Capture) on whatever SQL Server you use. You will find a very good and short explanation on how to use it on this YT channel: Mirroring SQL Server Database into Microsoft Fabric | How to Mirror SQL Server in Microsoft Fabric
Or instead, use a pipeline with a conditional activity:
one branch uses Copy data and a T-SQL script to load all your data and a second branch performs incremental ingest also by using Copy data but with a different T-SQL script.
So in this example the source is an on-prem SQL Server, or some Azure SQL Server in the cloud.
The sink or destination is a Fabric mirrored DB, ie a SQL Server in the cloud.
Step 2 - Silver layer
You will use another pipeline to transform, clean, or any other operations you need to perform on your data that now sits in the mirrored DB. Here you can use notebooks or dataflows Gen 2 (which I still do not recommentd considering they cost more in CUs and force you to use the M language--why learn yet another language, add another layer of complexity, when you can already do everything in Python?).
Source: the Fabric mirrored DB.
Sink: a lakehouse to store the Silver data if further processing is required after this step by yet another pipeline which in turn would use another lakehouse as the final sink to store the final gold data; but if one pipeline is enough, then the data output from this pipeline will be considered final and stored in the one lakehouse used as a source in the gold layer.
Step 3 - Gold layer
Another pipeline that interacts with a semantic model, or simply a semantic model connecting to the final version of the data either using DirectLake, or DirectQuery, or Import mode. Regarding the latter, define an incremental refresh policy from within the UI of Power BI Desktop app. If doing so, you can the use SSMS and log into the SSAS that is the semantic model in Fabric and check that the incremental policy did indeed create the necessary partitions.
Speaking of partitions, if you are going to use lakehouses only, you need to manage your Delta files from a notebook. By default, a Fabric lakehouse will optimize your delta files and so on. You also use the UI to override the define settings for the VACUUM command. The potential problem here is called the 'large small files' problem. For example, if you run your pipeline(s) every hour or even more than once every hour, each time a Copy data activity runs, it will create a delta file, and as you can imagine, over a month of doing this, you are going to have a lot of files in your folders.
Usually, the higher the ingestion frequency and the smaller the dataset size coming in just because of latency. This in turn would create a lot of small files. If this is your case, you need to create a Notebook and use the following PySpark and SQL Spark commands (adjusting the numbers as per your requirements, of course):
# spark.conf.set("spark.databricks.delta.autoCompact.minNumFiles", "10")
# enable optimized writes (same as delta.autoOptimize.optimizeWrite = 'true')
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
# enable auto-compaction (same as delta.autoOptimize.autoCompact = 'true')
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", "256MB")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize", "128MB")
To process multiple tables from one Notebook, you will need to define pipeline parameters and base parameters on your Notebook activities. Also, because pipeline activity is faster than the lakehouse I/O, if after ingesting some data in a lakehouse, some other activity needs it further downstream, this activity might not see the newly ingested data due to sync taking some time. You can then use a Notebook as an intermediary step to force a sync on the lakehouse using the old method published by the Microsoft PM responsible for the lakehouse/warehouse part of the product, or use the newly released REST API method (haven't tried it yet, but since there is a Web activity in pipeline, I would imagine that we can configure this Web activity to do the same REST API call that would be done from a notebook, and perhaps also cheaper in terms of CU burned).
%%sql
OPTIMIZE TABLE_0 ZORDER BY (COL_0);
OPTIMIZE TABLE_1 ZORDER BY (COL_0, COL_1, ..., COL_n);
and so on....
%%sql
VACUUM TABLE_0 RETAIN 0 HOURS;
VACUUM TABLE_1 RETAIN 0 HOURS;
and so on....
Now a quick word regarding the VACUUM command and the option RETAIN 0 HOURS: anything less than 7 days with the RETAIN option is NOT recommended by Microsoft. If you need to use time travel later on, or do some parallel processing with a Spark cluster that needs access to files created yesterday for example, RETAIN 0 HOURS will have deleted these files and then Boom!
But if you do not need time travel, and only do sequential processing of the data, meaning:
ingest data into mirrored DB --> sequential ETL on data in the order the data comes in --> persist data to storage layer in the same sequential order
then the OPTIMIZE and VACUUM RETAIN 0 HOURS commands will get rid of all the small files and you will be left with a compressed, optimized, snappy.parquet file in the folder that will help streamline the I/O process between your semantic model and the data store.
This is of course assuming a workload that runs frequently during a 24-hour period.
One last thing to keep in mind: if using a pipeline with Copy data activity, configure Copy data activity to create partitions in your destination using [YEAR], [MONTH] or whatever else makes sense considering the data access patterns of further downstream processes. Or if using a Notebook, then the same in PySpark.
If I miswrote or committed egregious errors, all feel free to correct me. After all, there is something new to learn with Fabric literally almost every week!
Fabric supports loading data from lakehouse to warehouse using notebook by involking libraries.
Hi @pavannarani
This about this way,
Fabric SQL = Fabric Data Warehouse = Fabric Lakehouse = Direct Lake Semantic Model = Unified Data Architecture.
For saving money, I should start with Notebooks and Lakehouse. You can definitely save money if you know exactly what are you doing.
Also, single notebook = single table ( Fact <--> Dimension ). Also, We don't need to use stored procedure ( optional now ). You don't need to write MERGE. Consult Good data/solution architect..
I had the same issue and rather than use a warehouse, I decided to do everything with a lakehouse, which works. The incremental refresh on the fact tables is set to append mode, of course, after the first initial load of all the data.
I avoid using Dataflow Gen2 to build fact tables as much as possible, since it is more expensive than doing everything at source in T-SQL and in the pipeline.
I also rely on views and user defined stored procedures, which you can easily define in your lakehouse SQL endpoint.
With this approach, you get all the benefits of the lakehouse technology, and none of the annoyances of a warehouse. It futureproofs your data estate so that in future, when maybe some new requirement comes up that makes you realize you should have used delta tables instead of a SQL warehouse, you will be in a position of upmost flexibility that will allow you to adapt to any crazy requirement.
Hi @pavannarani
Thank you for reaching out to the Microsoft Community Forum regarding your query.
After evaluating the approaches,the best approach is to use Stored Procedures in Fabric Warehouse to read from Silver layer Lakehouse tables (via SQL Endpoint), apply transformations, and upsert data into Gold layer fact tables.
This solution supports initial full loads and incremental upserts using MERGE, avoids extra data copies in the Lakehouse, scales to 100 GB with proper indexing and partitioning, handles complex business logic in T-SQL with performance optimizations, and is orchestrated via a Fabric Data Pipeline for scheduling and control.
To implement the above approach, consider the following steps:
For more details, please refer to the official Microsoft documentation:
Incrementally load data from Data Warehouse to Lakehouse - Microsoft Fabric
If my response has resolved your query, please mark it as the Accepted Solution to assist others. Additionally, a 'Kudos' would be appreciated if you found my response helpful.
Thank You
Hi @pavannarani
Use a MERGE query in a stored procedure to load all data from Lakehouse tables into Warehouse fact tables.
Use a ModifiedDate or similar column to pick only new/changed rows.
Inside the stored procedure, read Lakehouse tables like:
Use Fabric Pipeline to run the stored procedure automatically on a schedule.
User | Count |
---|---|
4 | |
3 | |
2 | |
2 | |
2 |
User | Count |
---|---|
16 | |
14 | |
11 | |
6 | |
5 |