Solved: Warehouse records missing after Dataflow Inserts b...

Stevengood · ‎02-06-2025

I have a data warehouse table that seems to have had all of its existing records removed during the first scheduled run of a new data pipeline.

I have a Data Pipeline that I have successfully triggered manually and achieved the expected results... but after setting it to run on a schedule one of the tables only had the 14400 new records along with 7500 records that were in a staging lakehouse:

The activities that are listed above are numbered here to help understand the process:

The process is designed to:

the first two dataflows
- Make oData connection and retrieve specific data and replace the existing records in each table with the new ones
- Destination: Warehouse
- Method: Replace
the third dataflow is connecting to an oData endpoint and returning the top 14400 records filtered to be ordered StartTime Desc so that I will get the most recent Hourly Data
- Destination: Lakehouse
- Method: Append
The fourth activity is a triggering another data pipeline.
- The child data pipeline is getting all of those device/circuit records and iterating through the results in a forEach activity with an inner copy activity
  - Copy activity makes an oData connection and copys each iterations response records into the Lakehouse
Next I refresh the lakehouse semantic model.. I don't think this is strictly necessary, but it seemed reasonable to do before the next dataflow
the next dataflow gets the new records from the lakehouse, and the existing records from the warehouse and creates a delta table via a antileft join so that I only get new records. (I do some transformations in there also so that they are checking on the same composite key)
- Delta_HourlyData:
  - Destination: Warehouse
  - Method: Append
- Delta_MinutelyData
  - Destination: Warehouse
  - Method: Append

I have been looking at forums and done training and watched videos enough to know that I am probably doing something wrong with my architecture in the first place.. but having to make oData connections has really restricted what I could do here. I am not sure how to utilize OneLake in order to automatically create Delta tables..

But here is my main problem. Here are record counts in the Warehoud from yesterday:

And after the above dataflow ran overnight I have been left with:

I should have 151,729 records for hourly.

I reviewed the Warehouse's queryinsights.exec_requests_history for when this pipeline executed. The only Delete statements are for CircuitInfo and DeviceInfo tables from the first two dataflows. I don't see any Truncate queries, only the insert queries. Also of note, the Insert statement for HourlyData is sourced from DataflowsStagingLakehouse1. And these are the 7,500 records. This is relatively close to the 7,080 new records I would expect to have been written. But 7,500 from the DataflowsStagingLakehouse1 and presumably the 14,400 from the third dataflow get us to the 21,900 records that are in the Hourly table of the Warehouse.

I double checked that dataflow to make sure it was in fact set to append to the Lakehouse and not replace the Warehouse and it is:

I'm really hoping someone smarter than me can help me figure out what is happening here!

v-kpoloju-msft · ‎03-03-2025

Hi @Stevengood,

After thoroughly reviewing the details you provided, I have identified alternative workarounds that may help resolve the issue. Please follow these steps:

Use Azure Data Factory or Synapse Pipelines instead of Dataflows for built-in optimizations like parallel copy, compression, and batch processing.
Prevent data duplication by using Change Tracking (CT) or Change Data Capture (CDC) to fetch only new or updated records. Utilize MERGE in SQL for incremental changes. If OData supports $delta or $filter, use these queries to retrieve only modified data. A Delta Lake table in OneLake can also help reduce unnecessary writes.
Improve performance by handling joins and transformations in SQL rather than Power Query. Partition large tables (e.g., by date-time) for better query efficiency. Apply filters at the source using WHERE clauses in SQL instead of filtering within Power BI to reduce data load.
If a direct database connection is restricted, use a private link or create a read-replica with different access policies. Optimize OData queries with Azure API Management (APIM) to cache responses and reduce redundant queries.
Prevent duplicates at the source using SQL constraints like ON CONFLICT (PostgreSQL) or MERGE (SQL Server). Alternatively, use a pre-copy script in Copy Activity to filter duplicates before loading data.

If this post helps, then please give us ‘Kudos’ and consider Accept it as a solution to help the other members find it more quickly.

Thankyou.

View solution in original post

v-kpoloju-msft · ‎02-07-2025

Hi @Stevengood,
Thank you for contacting the Microsoft Fabric Community.

We understand that you are facing data discrepancies due to the Replace and Append settings in your Dataflows. Go through a few steps to troubleshoot and resolve this issue.

Ensure that your Replace dataflows are not running after your Append dataflows, as this may overwrite records in the Data Warehouse. Since two dataflows use Replace, please double check that this does not accidentally delete required records before the append operation runs.
Use Power Query to confirm that all expected records exist before they are inserted into Fabric. The Delta_HourlyData and Delta_MinutelyData use an anti-left join to filter out existing records. If this filtering is incorrect, records may be missing from the final dataset.
If manual execution works but scheduled runs fail, check for overlapping schedules or delays between dependent jobs. To automatically manage Delta tables in One Lake, consider using Autoloader instead of manual appending. Direct Lake mode can also facilitate faster data retrieval in Power BI.

If this post helps, then please give us 'Kudos' and consider Accept it as a solution to help the other members find it more quickly.

Hope this works for you.
Thank you.

v-kpoloju-msft · ‎02-11-2025

Hi @Stevengood,

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

Stevengood · ‎02-12-2025

Hi @v-kpoloju-msft,

Not quite to answer to the issue. But it did give me something to think about. In my case my replace and append dataflows were all mutually exclusive so I could rule that out. The queries were showing the results that I was looking for in the previews... but I think the reason that the existing records were being deleted was because all of the existing records in the warehouse didn't exist in the lakehouse. I had setup the lakehouse after those records were already in the warehouse.

The day after experiencing this, I imported all of the records into the lakehouse and then let the process run as scheduled. The new records came through as expected.

However the process ended up using more compute than I could afford.

After doing some more research I have actually opted to remove the lakehouse steps from the infrastructure entirely and only moving data from the source to the Data Warehouse. I have also opted to remove some of the dataflows from my pipeline in general, they were using more CU than I think was needed. I have instead opted to use Web activities to get the data that will be iterated trough in the ForEach loops.

And while perhaps crude, I have now opted to just run a script to delete duplicate records after any import happens. This achieves the same result that I had wanted:

I would have preferred to use a Pre-Copy script during the Copy Activity, but part of the composite key for HourlyData and MinutelyData actually comes from the Source (oData Connection) of that copy activity. If there were a way to access a Copy Activity Source Records then I could perhaps filter out any existing records there in order to limit the number of writes to the warehouse. But I also figure that would require reading the records from the warehouse and doing a bunch of compute to filter them out. So I think that just copying the full set of records in and deleting any duplicates will require less Compute and achieve what I am looking for.

v-kpoloju-msft · ‎02-22-2025

Hi @Stevengood,

After thoroughly reviewing the details you provided, here are a few alternative workarounds that may help resolve the issue. Please follow these steps:

Configure incremental refresh in dataflows to insert only new or updated records into the warehouse, reducing unnecessary writes. Enable incremental refresh in Power Query and set up rules like "Detect changes" using a timestamp column.
Use a staging table for filtering existing records and modify the Copy Activity to load data into it first. Then, use a MERGE SQL statement to insert only new records, reducing compute overhead.
Leverage DirectQuery mode in Power BI to fetch only the required data at query time, minimizing duplication issues. If the OData source supports change tracking, retrieve only new or updated records to avoid post-copy cleanup.
Optimize Power Query transformations by reducing unnecessary steps, scheduling dataflows during off-peak hours, and using native SQL queries where possible.

If this post helps, please give us ‘Kudos’ and consider accepting it as a solution to help other members find it more quickly.

Best regards,

v-kpoloju-msft · ‎02-27-2025

Hi @Stevengood,

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

Stevengood · ‎02-27-2025

Hello @v-kpoloju-msft,

I appreciate the help. Ultimately the limitations of the oData source and the compute cost of that many copy activies just was far exceeding our capacity limits.
Ultimately we did go to our inital plan of directly connecting to the database that the oData endpoint was serving. We originally did not do this because there was a policy to limit direct connections to this database, but the volume of data we are transfering really requires this direct connection. (The entire pipeline runs in about 2 minutes now... and costs very little in CU... so that's great..).

One option we considered was creating a mirrored database connection, but this would require service provider credentials added to the server. We did not move forward with that because there was some uncertainty that by adding the service provider credential - which applies to the whole server and not at the database level - if it would wipe out our existing authentication methods. There are several production databases on that server that we do not want to impact.

If these other methods were possible to do with lower CU costs we would prefer to explore them.. but at the moment we are satisfied with the direct connection the the DB.

v-kpoloju-msft · ‎03-03-2025

Hi @Stevengood,

After thoroughly reviewing the details you provided, I have identified alternative workarounds that may help resolve the issue. Please follow these steps:

Use Azure Data Factory or Synapse Pipelines instead of Dataflows for built-in optimizations like parallel copy, compression, and batch processing.
Prevent data duplication by using Change Tracking (CT) or Change Data Capture (CDC) to fetch only new or updated records. Utilize MERGE in SQL for incremental changes. If OData supports $delta or $filter, use these queries to retrieve only modified data. A Delta Lake table in OneLake can also help reduce unnecessary writes.
Improve performance by handling joins and transformations in SQL rather than Power Query. Partition large tables (e.g., by date-time) for better query efficiency. Apply filters at the source using WHERE clauses in SQL instead of filtering within Power BI to reduce data load.
If a direct database connection is restricted, use a private link or create a read-replica with different access policies. Optimize OData queries with Azure API Management (APIM) to cache responses and reduce redundant queries.
Prevent duplicates at the source using SQL constraints like ON CONFLICT (PostgreSQL) or MERGE (SQL Server). Alternatively, use a pre-copy script in Copy Activity to filter duplicates before loading data.

If this post helps, then please give us ‘Kudos’ and consider Accept it as a solution to help the other members find it more quickly.

Thankyou.

v-kpoloju-msft · ‎03-06-2025

Hi @Stevengood,

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

Warehouse records missing after Dataflow Inserts but there are no Delete or Truncate queries

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Warehouse records missing after Dataflow Inserts but there are no Delete or Truncate queries

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025