How implementing an incremental data loading in Da...

pmscorca · ‎05-14-2024

Hi,

in Data Factory of Microsoft Fabric how I could implement an incremental data load?

In Azure Data Factory I could use a mapping data flow to accomplish a such task.

In Fabric is there the good practice to follow? Using a data pipeline or a dataflow gen2?

Thanks

Marusyk · ‎05-15-2024

There are different strategies for incremental loading.

I'm using medallion architecture with one additional layer - Landing.

There is lakehouse Landing where I ingest data with an overwrite option.

Here is the example of a data pipeline for 1 table

1. Get last WaterMark - read the IncrementalLoadInfo table (columns TableName and LastLoadDate) for the specific table to get the last loaded value:

SELECT COALESCE(
    (SELECT [LastLoadDate] FROM [dbo].[IncrementalLoadInfo] WHERE [TableName] =  '@{item().TableName}'),
    CAST('2020-01-01' AS DATETIME2)
) AS [LastLoadDate];

2. Loading to Landing - copies the data from the origin to Landing (with Owerwrite). Query to source is:

SELECT * FROM [dbo].[@{item().TableName}]
WHERE Timestamp = '@{activity('Get last WaterMark').output.firstRow.LastLoadDate}'

3. Create Raw table - simply create the same table but in warehouse (Raw/Bronze)

CREATE TABLE [dbo].@{item().TableName} AS
SELECT *
FROM [Landing].[dbo].@{item().TableName};

4. Loading from Landing to Raw - runs the Landing2Raw stored procedure

CREATE PROC [dbo].[Landing2Raw]
    @TableSource NVARCHAR(MAX),
    @TableTarget NVARCHAR(MAX),
    @Condition NVARCHAR(MAX)
AS
BEGIN
    BEGIN TRANSACTION
        DECLARE @SetColumns NVARCHAR(MAX) = '';
        SELECT 
            @SetColumns = @setColumns + QUOTENAME(name) + ' = S.' + QUOTENAME(name) + ', '
        FROM [Landing].[sys].[columns]
        WHERE object_id = OBJECT_ID(@TableSource);
        SET @SetColumns = LEFT(@SetColumns, LEN(@SetColumns) - 1); -- Remove trailing comma
        DECLARE @Update NVARCHAR(MAX) = N'UPDATE ' + @TableTarget +
            ' SET ' + @setColumns +
            ' FROM ' + @TableSource + ' AS S' +
            ' INNER JOIN ' + @TableTarget + ' AS T ON S.' + @Condition +' = T.' + @Condition;
        EXEC sp_executesql @Update;
        DECLARE @Insert NVARCHAR(MAX) = 'INSERT ' + @TableTarget +
            ' SELECT * FROM ' + @TableSource + ' AS S WHERE NOT EXISTS (SELECT 1 FROM ' + @TableTarget +' WHERE ' + @Condition +' = S.' + @Condition + ')';
        EXEC sp_executesql @Insert;
    COMMIT TRANSACTION
END

it takes the data from Landing (just uploaded batch) and inserts or updates the data in the Raw warehouse.

5. Update WaterMark - just updates the load date for the table in the IncrementalLoadInfo. Here I use stored procedure as well because the table could not have the record for this table that I'm loading:

CREATE PROC [dbo].[UpdateIncrementalLoadInfo]
    @TableName [VARCHAR](100),
    @CollectDate [DATETIME2](6)
AS
BEGIN
    IF EXISTS (SELECT 1 FROM [dbo].[IncrementalLoadInfo] WHERE [TableName] = @TableName)
    BEGIN
        UPDATE [dbo].[IncrementalLoadInfo]
        SET [LastLoadDate] = @CollectDate
        WHERE [TableName] = @TableName;
    END
    ELSE
    BEGIN
        INSERT [dbo].[IncrementalLoadInfo]
        SELECT @TableName, @CollectDate, NULL
    END;
END

Hope it will be helpful to give you the idea. Ping me if you have more questions.

Note: this approach doesn't handle data deletion (if you remove a record from your origin database)! For this use soft delete or consider CDC pattern

Useful videos:

v-cboorla-msft · ‎05-17-2024

Hi @pmscorca

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

Thanks.

v-cboorla-msft · ‎05-20-2024

Hi @pmscorca

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

Thanks.

bhanu_bi · ‎05-14-2024

Depending on your requirement of source and data structure, availability you can try to see if it can be achieved within pipeline using UI, however in scenarios requiring larger control and complex mechanism you can try to use dataflow. You can also use pipeline with spark code of your choose provide much more control in CICD. It all would come down to the mechanisms in place, existing expertise available and the path organization would like to adapt.

pmscorca · ‎05-14-2024

Hi, for example to ingest a customer table to handle as a SCD2 reading data from an on-premise source (e.g. Oracle, csv files).

Thanks

How implementing an incremental data loading in Data Factory

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - February 2025

Fabric Community Update - February 2025

New Offer! Become a Certified Fabric Data Engineer

How implementing an incremental data loading in Data Factory

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - February 2025

Fabric Community Update - February 2025