Solved: Copy Activity from Rest API Duplicating Results

pace-jp · ‎01-02-2025

I created a pipeline that copies data from a Rest API source to a lakehouse table. The api uses pagination to loop through 8 pages (50 records each) to get 384 records, sorted by a unique id field. However, it is currently writing 18,273 records to that table each time it runs. The the records are duplicated somewhere between 0 and 700 times. The destination is setup to overwrite data, but I've dropped the table before rerunning just to make sure it wasn't appending data by accident. The results are consisent every time. I assumed it was a problem with the way my pagination was configured, but if it was a pagination issued I'd expect to see some uniformity in the record count (i.e. similar IDs are duplicated the same number of times) but there really is none.

Below is a count of records by unique ids

To check if it was indeed a pagination issue, I changed the destination to a JSON file in the same lakehouse. The results were written to the file correctly without any duplicates. The ids were in order and there were 8 pages each with 50 results as expected. So it must be an issue with the process of writing data to a table.

I did see one stack overflow question (https://stackoverflow.com/questions/79278758/duplicate-values-after-copy-activity-in-fabric-pipeline) from someone having the same issue from about 20 days ago but there wasn't a solution posted there either. Is this is a known bug? Is there something I should be looking for in the mapping schema (I'm using the default "import schema"), since that is the only difference between writing to file and table? Thanks!

v-veshwara-msft · ‎01-02-2025

Hi @pace-jp ,
Thanks for using Microsoft Fabric Community for posting your query.

Thanks for sharing the chart. The fact that some IDs are duplicated up to 700 times while others aren’t duplicated at all shows this might not be just a pagination issue. It seems like something is going wrong when the data is written to the Lakehouse table. Here are a few things you can check to figure out what’s happening:

Check the Default Schema Mapping:
Make sure the table schema (columns and data types) matches what’s coming from the API. Pay attention to the unique ID field, if it isn’t handled properly in the mapping, it could lead to duplicates.
Parallel Processing:
Check if your pipeline is processing or writing data in parallel . If it is, this might cause duplicate rows to be written.
How the Data is Written:
Look at how the pipeline writes data to the table. For example, is it writing in batches or one row at a time? Sometimes, retries or errors in batch writes can result in duplicates.
Test with a Smaller Dataset:
Change your pagination to pull just 1 or 2 pages (e.g., 100 records) and see how the table behaves. This smaller dataset will make it easier to spot what’s going wrong.
Clean Up the Table:
You can add a step to remove duplicates from the table after the data is written. Use the unique ID field to keep only one copy of each row.
Test Writing the JSON Data:
Take the JSON output (which worked fine) and try writing it to the Lakehouse table directly. If that also causes duplicates, it confirms the problem lies in the table write process.

If these workarounds don't resolve the issue, it may indicate a bug. Consider reaching out to Microsoft by raising a support ticket.

If this post helps, please accept it as solution to help others benefit and a kudos would be appreciated.
Please reach out for further assistance

Regards,
Vinay.

View solution in original post

mattiasdesmet · ‎03-28-2025

I observed very similar behavior, and I believe it's a bug in the Fabric Copy Activity, creating cartesian products of all arrays in the JSON file. I did find a workaround and described it in detail here : Fabric : Hidden Collection Reference in Copy Activity - Mattias De Smet

Hope it helps!

View solution in original post

mattiasdesmet · ‎03-28-2025

I observed very similar behavior, and I believe it's a bug in the Fabric Copy Activity, creating cartesian products of all arrays in the JSON file. I did find a workaround and described it in detail here : Fabric : Hidden Collection Reference in Copy Activity - Mattias De Smet

Hope it helps!

pace-jp · ‎03-28-2025

Wow, that is incredibly helpful and a great find. Thank you for sharing!

pace-jp · ‎01-03-2025

Thanks for the help, @v-veshwara-msft. It seems to have something to do with a couple of nested arrays in the response leading to a combinatoric explosion of records. Going to stick with dropping it into a JSON file and working with the results in a more controlled manner (Pyspark and TSQL).

v-veshwara-msft · ‎01-02-2025

Hi @pace-jp ,
Thanks for using Microsoft Fabric Community for posting your query.

Thanks for sharing the chart. The fact that some IDs are duplicated up to 700 times while others aren’t duplicated at all shows this might not be just a pagination issue. It seems like something is going wrong when the data is written to the Lakehouse table. Here are a few things you can check to figure out what’s happening:

Check the Default Schema Mapping:
Make sure the table schema (columns and data types) matches what’s coming from the API. Pay attention to the unique ID field, if it isn’t handled properly in the mapping, it could lead to duplicates.
Parallel Processing:
Check if your pipeline is processing or writing data in parallel . If it is, this might cause duplicate rows to be written.
How the Data is Written:
Look at how the pipeline writes data to the table. For example, is it writing in batches or one row at a time? Sometimes, retries or errors in batch writes can result in duplicates.
Test with a Smaller Dataset:
Change your pagination to pull just 1 or 2 pages (e.g., 100 records) and see how the table behaves. This smaller dataset will make it easier to spot what’s going wrong.
Clean Up the Table:
You can add a step to remove duplicates from the table after the data is written. Use the unique ID field to keep only one copy of each row.
Test Writing the JSON Data:
Take the JSON output (which worked fine) and try writing it to the Lakehouse table directly. If that also causes duplicates, it confirms the problem lies in the table write process.

If these workarounds don't resolve the issue, it may indicate a bug. Consider reaching out to Microsoft by raising a support ticket.

If this post helps, please accept it as solution to help others benefit and a kudos would be appreciated.
Please reach out for further assistance

Regards,
Vinay.

Copy Activity from Rest API Duplicating Results

Helpful resources

Fabric Community Update - July 2025

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Party with Power BI’s own Guy in a Cube

Copy Activity from Rest API Duplicating Results

Helpful resources

Fabric Community Update - July 2025

Join our Fabric User Panel

Fabric Monthly Update - June 2025