Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
arpost
Advocate V
Advocate V

Help! Is it possible to use Row_Number with Concurrent Inserts?

Greetings, community. I have a scenario where I need several stored procedures to be able to insert data into the same table. Since Fabric doesn't have an Identity/Auto-increment solution yet, I am using the ROW_NUMBER trick (example here). This works fine in cases where I can run sequential loads, but I am now needing to run numerous inserts into the same table at the same time.

 

Since the ROW_NUMBER solution depends on getting the Max Key from the destination table, I believe concurrent inserts will conflict since two jobs may start at the same time and use the same max value to start the numbering. ☹️

 

Does anyone know of a possible way to ensure that duplicate numbers aren't generated? Right now, there's no real way of which I'm aware to do so.

12 REPLIES 12
v-nikhilan-msft
Community Support
Community Support

Hi @arpost 
Thanks for using Fabric Community.
Just wanted to share some information, if that could help you.
In Microsoft Fabric, you can generate unique identifiers in a warehouse table using a workaround technique .
Generate unique identifiers in a warehouse table in Microsoft Fabric - Microsoft Fabric | Microsoft ...

This method ensures that each row inserted into the table has a unique identifier, even when multiple stored procedures are inserting data into the same table concurrently.

Hope this helps. Please let me know if you have any further questions.

v-nikhilan-msft
Community Support
Community Support

Hi @arpost 
Thanks for using Fabric Community.
Just wanted to share some information, if that could help you.
In Microsoft Fabric, you can generate unique identifiers in a warehouse table using a workaround technique .
Generate unique identifiers in a warehouse table in Microsoft Fabric - Microsoft Fabric | Microsoft ...

This method ensures that each row inserted into the table has a unique identifier, even when multiple stored procedures are inserting data into the same table concurrently.

Hope this helps. Please let me know if you have any further questions.

@v-nikhilan-msft, thank you for that link. The issue is this doesn't ensure unique identifiers. If two jobs run concurrently, they insert duplicate keys. This workaround gets the max key and uses that for the row-numbering.

 

In the example below, the yellow items were script 1 and the red were script 2 executed almost simultaneously:

 

arpost_0-1711125547936.png

Hi @arpost 
1) Have you considered a control table for storing and managing that key value?

 

2) I agree that for concurrent load requirement, you will need additional logic to maintain and check for the high water mark value before the insert.  If this is a batch process, you can load your data in parallel threads into a stage table, and then have a transform process and use the ROW_NUMBER() over to transform (built IDENTITY column) into the final table.  This will be much more simplified process and more optimal throughput.

 

Please try this and let me know if the issue still persists.

Thanks.

Can you expound a bit more on both #1 and #2? Trying to think through how that'd work or solve the issue.

Hi @arpost 
#1
I presume you are working on a Warehouse. Given that delta tables support transactions (Transactions in Warehouse tables - Microsoft Fabric | Microsoft Learn), you can use a (control) table just to persist the max key, using the strategy of your preferences.

Just a simple example: Lets suppose you have a table ControlTable with a column id. The following sp will update the id and return the new value atomically, therefore all the processes must invoke this sp to get their unique id.

CREATE PROC [dbo].[GetMaxId]
    @currentID INT OUTPUT
AS
BEGIN
    -- Start a new transaction
    BEGIN TRANSACTION;

    -- Set the transaction isolation level to SERIALIZABLE
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

    -- Get the current ID
    SELECT @currentID = id FROM dbo.ControlTable;

    -- Increment the ID
    SET @currentID = @currentID + 1;

    -- Update the table
    UPDATE dbo.ControlTable SET id = @currentID;

    -- Commit the transaction
    COMMIT TRANSACTION;
END



Another option would be using GUID instead of a number if the data type is not strictly necessary. You can get a unique identifier by running SELECT NEWID();

Hope this helps. Please let me know if you have any further questions.

Is the use of the Serializable isolation level above supported in Fabric, though? Per this doc, only snapshot isolation level is supported; all others are ignored at time of query execution:

 

https://learn.microsoft.com/en-us/fabric/data-warehouse/transactions#transactional-capabilities

 

I've thought about GUIDs, but my concerns are two-fold:

  1. Millions of records will require millions of GUIDs.
  2. I could be wrong, but wouldn't GUIDs complicate things like drop tables and reinsert updated records, which is essential given that Fabric doesn't support ALTER TABLE?

 Hi @arpost

The internal team has replied as follows:
Good catch... it appears that the customer is employing an OLTP-like process to populate the warehouse. While warehouses prioritize performance over integrity, it might be beneficial for them to reassess their hydration process and make it sequential to avoid these conflicts. Here is another example but I am afraid it may show the same issue under concurrent calls.
https://learn.microsoft.com/en-us/fabric/data-warehouse/generate-unique-identifiers

I would recommend the approach I mentioned earlier, in which the stored procedures can concurrently load into a table, and then CTAS (historical load)/INSERT(incremental) into the final table with the ROW_NUMBER.

Hope this helps.

I appreciate the reply, but that approach still introduces a lot of complications around job sequencing. Do we know if/when Microsoft is planning to add some kind of identity feature?

 

There are often cases where you need to have your primary keys (PK) generated in order to use those PKs as foreign keys (FK) in dependent tables. The lack of an auto-generated identity means all of your pipelines have to be built in such a way that they insert into the PK table, wait until all other loads complete, and THEN resume with inserts into the dependent tables that need to use the PK as an FK. To make matters worse, that means unrelated jobs have to wait on each other (File Type A's load is held up by File Type B's load).

 

Typically, any given pipeline would run through steps of a process for a particular ELT job. So the logic would be something like this:

  1. Raw data would be staged via Copy activity into DW.
  2. "Parent" dimension records would be inserted.
  3. "Child" dimension records would be inserted with FK references to the PK for #2.
  4. Dimension keys for #2 and #3 would be assigned to staged data.
  5. Data with keys would be loaded from stage to final DW tables.

Having to assign PKs at the end for the final table load introduces a huge amount of overhead if you have, say, 15 different data feeds that all insert into the same tables for #2 and #3.

@v-nikhilan-msft, any update on this? Also, I saw mentioned somewhere that some kind of Identity/Primary Key functionality was planned for 2024. Can you confirm this is in Microsoft's scope? I didn't see this listed in any release plan docs.

Hi @arpost 
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

Hi @arpost 
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

MayFabricCarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.