Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at the 2025 Microsoft Fabric Community Conference. March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for $400 discount. Register now

Reply
pmscorca
Post Patron
Post Patron

How implementing an incremental data loading in Data Factory

Hi,

in Data Factory of Microsoft Fabric how I could implement an incremental data load?

In Azure Data Factory I could use a mapping data flow to accomplish a such task.

In Fabric is there the good practice to follow? Using a data pipeline or a dataflow gen2?

Thanks

5 REPLIES 5
Marusyk
Advocate II
Advocate II

There are different strategies for incremental loading. 

I'm using medallion architecture with one additional layer - Landing.

Marusyk_0-1715775232125.png

There is lakehouse Landing where I ingest data with an overwrite option.

Here is the example of a data pipeline for 1 table

Marusyk_1-1715775355114.png

1. Get last WaterMark - read the IncrementalLoadInfo table (columns TableName and LastLoadDate) for the specific table to get the last loaded value:

 

SELECT COALESCE(
    (SELECT [LastLoadDate] FROM [dbo].[IncrementalLoadInfo] WHERE [TableName] =  '@{item().TableName}'),
    CAST('2020-01-01' AS DATETIME2)
) AS [LastLoadDate];

 

2. Loading to Landing - copies the data from the origin to Landing (with Owerwrite). Query to source is:

 

SELECT * FROM [dbo].[@{item().TableName}]
WHERE Timestamp = '@{activity('Get last WaterMark').output.firstRow.LastLoadDate}'

 

3. Create Raw table - simply create the same table but in warehouse (Raw/Bronze)

 

CREATE TABLE [dbo].@{item().TableName} AS
SELECT *
FROM [Landing].[dbo].@{item().TableName};

 

4. Loading from Landing to Raw - runs the Landing2Raw stored procedure

 

CREATE PROC [dbo].[Landing2Raw]
    @TableSource NVARCHAR(MAX),
    @TableTarget NVARCHAR(MAX),
    @Condition NVARCHAR(MAX)
AS
BEGIN
    BEGIN TRANSACTION
        DECLARE @SetColumns NVARCHAR(MAX) = '';
        SELECT 
            @SetColumns = @setColumns + QUOTENAME(name) + ' = S.' + QUOTENAME(name) + ', '
        FROM [Landing].[sys].[columns]
        WHERE object_id = OBJECT_ID(@TableSource);
        SET @SetColumns = LEFT(@SetColumns, LEN(@SetColumns) - 1); -- Remove trailing comma
        DECLARE @Update NVARCHAR(MAX) = N'UPDATE ' + @TableTarget +
            ' SET ' + @setColumns +
            ' FROM ' + @TableSource + ' AS S' +
            ' INNER JOIN ' + @TableTarget + ' AS T ON S.' + @Condition +' = T.' + @Condition;
        EXEC sp_executesql @Update;
        DECLARE @Insert NVARCHAR(MAX) = 'INSERT ' + @TableTarget +
            ' SELECT * FROM ' + @TableSource + ' AS S WHERE NOT EXISTS (SELECT 1 FROM ' + @TableTarget +' WHERE ' + @Condition +' = S.' + @Condition + ')';
        EXEC sp_executesql @Insert;
    COMMIT TRANSACTION
END

 

it takes the data from Landing (just uploaded batch) and inserts or updates the data in the Raw warehouse.

5. Update WaterMark - just updates the load date for the table in the IncrementalLoadInfo. Here I use stored procedure as well because the table could not have the record for this table that I'm loading:

 

CREATE PROC [dbo].[UpdateIncrementalLoadInfo]
    @TableName [VARCHAR](100),
    @CollectDate [DATETIME2](6)
AS
BEGIN
    IF EXISTS (SELECT 1 FROM [dbo].[IncrementalLoadInfo] WHERE [TableName] = @TableName)
    BEGIN
        UPDATE [dbo].[IncrementalLoadInfo]
        SET [LastLoadDate] = @CollectDate
        WHERE [TableName] = @TableName;
    END
    ELSE
    BEGIN
        INSERT [dbo].[IncrementalLoadInfo]
        SELECT @TableName, @CollectDate, NULL
    END;
END

 

 Hope it will be helpful to give you the idea. Ping me if you have more questions.

 

Note: this approach doesn't handle data deletion (if you remove a record from your origin database)! For this use soft delete or consider CDC pattern  

 

Useful videos: 

Hi @pmscorca 

 

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

 

Thanks.

Hi @pmscorca 

 

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

 

Thanks.

bhanu_bi
Frequent Visitor

Depending on your requirement of source and data structure, availability you can try to see if it can be achieved within pipeline using UI, however in scenarios requiring larger control and complex mechanism you can try to use dataflow. You can also use pipeline with spark code of your choose provide much more control in CICD. It all would come down to the mechanisms in place, existing expertise available and the path organization would like to adapt.

Hi, for example to ingest a customer table to handle as a SCD2 reading data from an on-premise source (e.g. Oracle, csv files).

Thanks

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

FebFBC_Carousel

Fabric Monthly Update - February 2025

Check out the February 2025 Fabric update to learn about new features.

Feb2025 NL Carousel

Fabric Community Update - February 2025

Find out what's new and trending in the Fabric community.