Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedJoin us at the 2025 Microsoft Fabric Community Conference. March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for $400 discount. Register now
Hi,
in Data Factory of Microsoft Fabric how I could implement an incremental data load?
In Azure Data Factory I could use a mapping data flow to accomplish a such task.
In Fabric is there the good practice to follow? Using a data pipeline or a dataflow gen2?
Thanks
There are different strategies for incremental loading.
I'm using medallion architecture with one additional layer - Landing.
There is lakehouse Landing where I ingest data with an overwrite option.
Here is the example of a data pipeline for 1 table
1. Get last WaterMark - read the IncrementalLoadInfo table (columns TableName and LastLoadDate) for the specific table to get the last loaded value:
SELECT COALESCE(
(SELECT [LastLoadDate] FROM [dbo].[IncrementalLoadInfo] WHERE [TableName] = '@{item().TableName}'),
CAST('2020-01-01' AS DATETIME2)
) AS [LastLoadDate];
2. Loading to Landing - copies the data from the origin to Landing (with Owerwrite). Query to source is:
SELECT * FROM [dbo].[@{item().TableName}]
WHERE Timestamp = '@{activity('Get last WaterMark').output.firstRow.LastLoadDate}'
3. Create Raw table - simply create the same table but in warehouse (Raw/Bronze)
CREATE TABLE [dbo].@{item().TableName} AS
SELECT *
FROM [Landing].[dbo].@{item().TableName};
4. Loading from Landing to Raw - runs the Landing2Raw stored procedure
CREATE PROC [dbo].[Landing2Raw]
@TableSource NVARCHAR(MAX),
@TableTarget NVARCHAR(MAX),
@Condition NVARCHAR(MAX)
AS
BEGIN
BEGIN TRANSACTION
DECLARE @SetColumns NVARCHAR(MAX) = '';
SELECT
@SetColumns = @setColumns + QUOTENAME(name) + ' = S.' + QUOTENAME(name) + ', '
FROM [Landing].[sys].[columns]
WHERE object_id = OBJECT_ID(@TableSource);
SET @SetColumns = LEFT(@SetColumns, LEN(@SetColumns) - 1); -- Remove trailing comma
DECLARE @Update NVARCHAR(MAX) = N'UPDATE ' + @TableTarget +
' SET ' + @setColumns +
' FROM ' + @TableSource + ' AS S' +
' INNER JOIN ' + @TableTarget + ' AS T ON S.' + @Condition +' = T.' + @Condition;
EXEC sp_executesql @Update;
DECLARE @Insert NVARCHAR(MAX) = 'INSERT ' + @TableTarget +
' SELECT * FROM ' + @TableSource + ' AS S WHERE NOT EXISTS (SELECT 1 FROM ' + @TableTarget +' WHERE ' + @Condition +' = S.' + @Condition + ')';
EXEC sp_executesql @Insert;
COMMIT TRANSACTION
END
it takes the data from Landing (just uploaded batch) and inserts or updates the data in the Raw warehouse.
5. Update WaterMark - just updates the load date for the table in the IncrementalLoadInfo. Here I use stored procedure as well because the table could not have the record for this table that I'm loading:
CREATE PROC [dbo].[UpdateIncrementalLoadInfo]
@TableName [VARCHAR](100),
@CollectDate [DATETIME2](6)
AS
BEGIN
IF EXISTS (SELECT 1 FROM [dbo].[IncrementalLoadInfo] WHERE [TableName] = @TableName)
BEGIN
UPDATE [dbo].[IncrementalLoadInfo]
SET [LastLoadDate] = @CollectDate
WHERE [TableName] = @TableName;
END
ELSE
BEGIN
INSERT [dbo].[IncrementalLoadInfo]
SELECT @TableName, @CollectDate, NULL
END;
END
Hope it will be helpful to give you the idea. Ping me if you have more questions.
Note: this approach doesn't handle data deletion (if you remove a record from your origin database)! For this use soft delete or consider CDC pattern
Useful videos:
Hi @pmscorca
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.
Thanks.
Hi @pmscorca
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.
Thanks.
Depending on your requirement of source and data structure, availability you can try to see if it can be achieved within pipeline using UI, however in scenarios requiring larger control and complex mechanism you can try to use dataflow. You can also use pipeline with spark code of your choose provide much more control in CICD. It all would come down to the mechanisms in place, existing expertise available and the path organization would like to adapt.
Hi, for example to ingest a customer table to handle as a SCD2 reading data from an on-premise source (e.g. Oracle, csv files).
Thanks
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Check out the February 2025 Fabric update to learn about new features.
User | Count |
---|---|
6 | |
4 | |
2 | |
1 | |
1 |
User | Count |
---|---|
13 | |
10 | |
5 | |
5 | |
4 |