Solved: QUESTION::COPY DATA::INCREMENTAL APPEND

Element115 · ‎04-09-2024

Is there an ETA for when incremental append will become available for the Copy data activity? For both on-prem and cloud sources?

I am wondering if I should start implementing this by hand only to have to change all my pipelines in a couple of weeks.

Element115 · ‎04-11-2024

EDIT: re the discussion of the capitalization of the JSON key refering to the script return value, I just realized the system is using the alias name the user provides in the SELECT statement; I thought the generated JSON would simply contain a generic key named and spelled 'result' and not the actual SQL alias. Triple face palm moment.

@v-nikhilan-msft

Here is the solution. But first, let me lay it all out from the first step.

0__create a user defined stored procedure in the lakehouse SQL endpoint like so:

CREATE OR ALTER PROCEDURE [dbo].[usp_is_exist_table]
    @table_name VARCHAR(32) = '',
    @result     INT OUTPUT
AS
BEGIN
    IF OBJECT_ID(@table_name, 'U') IS NOT NULL
        SET @result = 1
    ELSE
        SET @result = 0
END

This procedure checks whether or not a table exists. Why? In order to implement incremental refresh within a pipeline, we need to know 1 of 2 things: Does the table exist, OR if it exists, is it not empty and thus there is a sequential index number to get and from which only we shall start loading new data from the source. We will use a pipeline variable later to store the name of the table for whose existence we want to check.

1__create the following pipeline flow:

Because I have a simple scenario, I can simply hard code 1 or 2 table names in their respective pipeline variable. But if this needs to be done for more tables than this, then of course, you would use a control flow loop activity looping over an array variable containing all the table names in the lakehouse (but given the pain of doing even something as simple as this incremental load, I don't think it's worth trying only to find out that suddenly I'm stuck doing even more beta testing on behalf of Microsoft).

2__configuring the pipeline variables: click on the pipeline design surface to make the pipeline config pane appear:

Since I only have 2 tables, intead of setting up a loop and so on, I opted to just hard code the table names in 2 distinct variables (since I already spent way too much time on this than should be necessary. It is left as an exercise for the time when I feel intrepid).

The variable latest_index is going to be used inside the If Condition activity.

3__configuring the Script activity:

For some reason, I can't get a scalar singleton, and so we are stuck with the script output being a result set. (If you figure it out, please let me know.) Note the last Select statement needs an alias or it won't work.

Also note the pipeline variable interpolation in the T-SQL script using @{...}, namely: @{variables('usp_input_1')}:

DECLARE @usp_output INT;
EXEC [dbo].[usp_is_exist_table] @table_name = @{variables('usp_input_1')}, @result = @usp_output OUTPUT;
SELECT @usp_output AS Result;

No script parameters defined in the GUI, and no Advanced settings either.

Also important to know is the JSON output by the Script activity on success because you will need to reference some of its keys to retrieve the output value of the script. It looks like this:

{
	"resultSetCount": 1,
	"recordsAffected": 0,
	"resultSets": [
		{
			"rowCount": 1,
			"rows": [
				{
					"Result": 1
				}
			]
		}
	],
	"outputParameters": {},
	"outputLogs": "",
	"outputLogsLocation": "",
	"outputTruncated": false,
	"executionDuration": 1
}

Note how the word 'result' is spelled: Result, and not result. The reason I mention this is that the JSON output from the other Script activity inside the If Condition activity returns a JSON similar to this one except that the word 'result' starts with a small cap instead of a capitalized 'R.' Yeah, go figure.

4__configuring the If Condition activity:

Click on the expression and the editor opens up showing this code:

@equals(activity('Check MUP_Status exists').output.resultSets[0].rows[0].Result, 1)

Note how after selecting the first option in the list of Activity outputs, we get

activity('Check MUP_Status exists').output

and then we use the '.' operator to reference the JSON structure

.resultSets[0].rows[0].Result

in order to get the output from the Script activity outside the If Condition activity.

5__configuring the Script activity titled 'Get latest status ID' inside the If Condition activity:

Again, the Select statement needs to alias the column returned.

As mentioned above, the JSON output from this Script activity is slightly different in terms of the capitalization of the result key, which is spelled 'result' and not 'Result' as shown below:

{
	"resultSetCount": 1,
	"recordsAffected": 0,
	"resultSets": [
		{
			"rowCount": 1,
			"rows": [
				{
					"result": 670826
				}
			]
		}
	],
	"outputParameters": {},
	"outputLogs": "",
	"outputLogsLocation": "",
	"outputTruncated": false,
	"executionDuration": 1
}

5__configuring the Set variable activity:

The expression (in green above) looks like this and is why I was saying pay attention to the fact that, even though the JSON structure is exactly the same as the first script output, the key 'result' is spelled differently and this is the spelling you must use as shown below (still banging my head against the wall about this one) :

@activity('Get latest status ID').output.resultSets[0].rows[0].result

Whatever the latest ID that has been ingested into the lakehouse for a particular table happens to be, it is stored in the latest_index variable which is used in the next activity, which is a Copy data activity using a T-SQL Select statement with a WHERE clause to filter for the most recently added data to the source DB. I'll leave that last step as a small exercise for the newbie to figure out.

In conclusion: What a mess!!! But there you have it. I hope the result of struggling for the better part of a day+ will prove helpful to the naive and hapless soul who still belabors under the illusion that using Fabric is an easy matter of click-click and we are done. If only!

View solution in original post

Element115 · ‎04-11-2024

EDIT: re the discussion of the capitalization of the JSON key refering to the script return value, I just realized the system is using the alias name the user provides in the SELECT statement; I thought the generated JSON would simply contain a generic key named and spelled 'result' and not the actual SQL alias. Triple face palm moment.

@v-nikhilan-msft

Here is the solution. But first, let me lay it all out from the first step.

0__create a user defined stored procedure in the lakehouse SQL endpoint like so:

CREATE OR ALTER PROCEDURE [dbo].[usp_is_exist_table]
    @table_name VARCHAR(32) = '',
    @result     INT OUTPUT
AS
BEGIN
    IF OBJECT_ID(@table_name, 'U') IS NOT NULL
        SET @result = 1
    ELSE
        SET @result = 0
END

This procedure checks whether or not a table exists. Why? In order to implement incremental refresh within a pipeline, we need to know 1 of 2 things: Does the table exist, OR if it exists, is it not empty and thus there is a sequential index number to get and from which only we shall start loading new data from the source. We will use a pipeline variable later to store the name of the table for whose existence we want to check.

1__create the following pipeline flow:

Because I have a simple scenario, I can simply hard code 1 or 2 table names in their respective pipeline variable. But if this needs to be done for more tables than this, then of course, you would use a control flow loop activity looping over an array variable containing all the table names in the lakehouse (but given the pain of doing even something as simple as this incremental load, I don't think it's worth trying only to find out that suddenly I'm stuck doing even more beta testing on behalf of Microsoft).

2__configuring the pipeline variables: click on the pipeline design surface to make the pipeline config pane appear:

Since I only have 2 tables, intead of setting up a loop and so on, I opted to just hard code the table names in 2 distinct variables (since I already spent way too much time on this than should be necessary. It is left as an exercise for the time when I feel intrepid).

The variable latest_index is going to be used inside the If Condition activity.

3__configuring the Script activity:

For some reason, I can't get a scalar singleton, and so we are stuck with the script output being a result set. (If you figure it out, please let me know.) Note the last Select statement needs an alias or it won't work.

Also note the pipeline variable interpolation in the T-SQL script using @{...}, namely: @{variables('usp_input_1')}:

DECLARE @usp_output INT;
EXEC [dbo].[usp_is_exist_table] @table_name = @{variables('usp_input_1')}, @result = @usp_output OUTPUT;
SELECT @usp_output AS Result;

No script parameters defined in the GUI, and no Advanced settings either.

Also important to know is the JSON output by the Script activity on success because you will need to reference some of its keys to retrieve the output value of the script. It looks like this:

{
	"resultSetCount": 1,
	"recordsAffected": 0,
	"resultSets": [
		{
			"rowCount": 1,
			"rows": [
				{
					"Result": 1
				}
			]
		}
	],
	"outputParameters": {},
	"outputLogs": "",
	"outputLogsLocation": "",
	"outputTruncated": false,
	"executionDuration": 1
}

Note how the word 'result' is spelled: Result, and not result. The reason I mention this is that the JSON output from the other Script activity inside the If Condition activity returns a JSON similar to this one except that the word 'result' starts with a small cap instead of a capitalized 'R.' Yeah, go figure.

4__configuring the If Condition activity:

Click on the expression and the editor opens up showing this code:

@equals(activity('Check MUP_Status exists').output.resultSets[0].rows[0].Result, 1)

Note how after selecting the first option in the list of Activity outputs, we get

activity('Check MUP_Status exists').output

and then we use the '.' operator to reference the JSON structure

.resultSets[0].rows[0].Result

in order to get the output from the Script activity outside the If Condition activity.

5__configuring the Script activity titled 'Get latest status ID' inside the If Condition activity:

Again, the Select statement needs to alias the column returned.

As mentioned above, the JSON output from this Script activity is slightly different in terms of the capitalization of the result key, which is spelled 'result' and not 'Result' as shown below:

{
	"resultSetCount": 1,
	"recordsAffected": 0,
	"resultSets": [
		{
			"rowCount": 1,
			"rows": [
				{
					"result": 670826
				}
			]
		}
	],
	"outputParameters": {},
	"outputLogs": "",
	"outputLogsLocation": "",
	"outputTruncated": false,
	"executionDuration": 1
}

5__configuring the Set variable activity:

The expression (in green above) looks like this and is why I was saying pay attention to the fact that, even though the JSON structure is exactly the same as the first script output, the key 'result' is spelled differently and this is the spelling you must use as shown below (still banging my head against the wall about this one) :

@activity('Get latest status ID').output.resultSets[0].rows[0].result

Whatever the latest ID that has been ingested into the lakehouse for a particular table happens to be, it is stored in the latest_index variable which is used in the next activity, which is a Copy data activity using a T-SQL Select statement with a WHERE clause to filter for the most recently added data to the source DB. I'll leave that last step as a small exercise for the newbie to figure out.

In conclusion: What a mess!!! But there you have it. I hope the result of struggling for the better part of a day+ will prove helpful to the naive and hapless soul who still belabors under the illusion that using Fabric is an easy matter of click-click and we are done. If only!

v-nikhilan-msft · ‎04-10-2024

Hi @Element115
Thanks for using Fabric Community.
As for now we do not have any official information regarding this feature. Any details regarding the new features will be mentioned here: Microsoft Fabric release plan documentation - Microsoft Fabric | Microsoft Learn

Hope this helps. Please let me know if you have any further questions.

v-nikhilan-msft · ‎04-11-2024

Hi @Element115
We haven’t heard from you on the last response and was just checking back to see if your got resolved.
Otherwise, will respond back with the more details and we will try to help.
Thanks

Element115 · ‎04-11-2024

@v-nikhilan-msft Well, I implemented the incremental append by hand, however not quite there yet as it is not clear how to access the output of a stored procedure inside a pipeline execution flow. Not even Copilot is useful in this regard as it keeps giving all sorts of answers that are non-answers.

In the meantime, if you know which expression one should use to set the value of a variable to the output from the Stored procedure activity, please let me know because the following don't seem to work, unless they return some JSON and you gotta know which key in that JSON to reference, which I don't since I couldn't find any documentation on this:

The stored procedure looks like this:

CREATE OR ALTER PROCEDURE [dbo].[usp_is_exist_table]

    @table_name VARCHAR(32) = '',
    @result     INT OUTPUT
AS
BEGIN
    IF OBJECT_ID(@table_name, 'U') IS NOT NULL
        SET @result = 1
    ELSE
        SET @result = 0
END
GO

v-nikhilan-msft · ‎04-11-2024

Hi @Element115
Please refer to these links:

How to process output from stored procedure in Azure Data Factory - Microsoft Q&A
Transform data by using the Stored Procedure activity - Azure Data Factory & Azure Synapse | Microso...

Hope this helps.

Element115 · ‎04-11-2024

Furthermore, even though I follow the instructions from the 2nd link you provided, the Script activity fails because of cast error from NVARCHAR to INT. Yet, this is how I specified the Script activity setting:

Of course, I also tried setting the result parameter to type Int64, didn't work.

So how is one supposed to do this? What are the exact and accurate steps to follow to capture the singleton ouptut of a stored procedure as a scalar value of T-SQL type INT?

Also, code used to call the stored procedure is as follows (not standard btw, but the system doesn't seem to mind)--please advise if a different syntax should be used:

EXEC [dbo].[usp_is_exist_table] table_name, result

Element115 · ‎04-11-2024

@v-nikhilan-msft

First. re your first link: if you look at my sp code, you will see it does not return a result set, but a scalar value, ie an integer equal to either 0 or 1. So, if my sp were to return a result set, I understand from the first link that I should use a Lookup activity instead of a 'Stored procedure' activity.

However, the second link states: 'When the stored procedure has Output parameters, instead of using stored procedure activity, use lookup acitivty and Script activity. Stored procedure activity does not support calling SPs with Output parameter yet.'

But that's not all. Further down, the writer shows the Azure Script activity UI and it does not match the UI we have in Fabric!! In Fabric, there is no 'Stored Procedure' option.

Screenshot 2024-04-11 174628.jpg

So how exactly are we supposed to do this, ie retrieve the stored procedure OUTPUT parameter value?

Now I am pissed off. What on earth is the point of having a 'Stored procedure' activity if OUTPUT params can't even be handled? I mean, seriously, this is so mindboggling that it is beyond unfathomable.

QUESTION::COPY DATA::INCREMENTAL APPEND

Helpful resources

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

QUESTION::COPY DATA::INCREMENTAL APPEND

Helpful resources

New forum boards available in Real-Time Intelligence.

Fabric Monthly Update - May 2024