Join us for an expert-led overview of the tools and concepts you'll need to pass exam PL-300. The first session starts on June 11th. See you there!
Get registeredJoin us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered
I have a data warehouse table that seems to have had all of its existing records removed during the first scheduled run of a new data pipeline.
I have a Data Pipeline that I have successfully triggered manually and achieved the expected results... but after setting it to run on a schedule one of the tables only had the 14400 new records along with 7500 records that were in a staging lakehouse:
The activities that are listed above are numbered here to help understand the process:
The process is designed to:
I have been looking at forums and done training and watched videos enough to know that I am probably doing something wrong with my architecture in the first place.. but having to make oData connections has really restricted what I could do here. I am not sure how to utilize OneLake in order to automatically create Delta tables..
But here is my main problem. Here are record counts in the Warehoud from yesterday:
And after the above dataflow ran overnight I have been left with:
I should have 151,729 records for hourly.
I reviewed the Warehouse's queryinsights.exec_requests_history for when this pipeline executed. The only Delete statements are for CircuitInfo and DeviceInfo tables from the first two dataflows. I don't see any Truncate queries, only the insert queries. Also of note, the Insert statement for HourlyData is sourced from DataflowsStagingLakehouse1. And these are the 7,500 records. This is relatively close to the 7,080 new records I would expect to have been written. But 7,500 from the DataflowsStagingLakehouse1 and presumably the 14,400 from the third dataflow get us to the 21,900 records that are in the Hourly table of the Warehouse.
I double checked that dataflow to make sure it was in fact set to append to the Lakehouse and not replace the Warehouse and it is:
I'm really hoping someone smarter than me can help me figure out what is happening here!
Solved! Go to Solution.
Hi @Stevengood,
After thoroughly reviewing the details you provided, I have identified alternative workarounds that may help resolve the issue. Please follow these steps:
If this post helps, then please give us ‘Kudos’ and consider Accept it as a solution to help the other members find it more quickly.
Thankyou.
Hi @Stevengood,
Thank you for contacting the Microsoft Fabric Community.
We understand that you are facing data discrepancies due to the Replace and Append settings in your Dataflows. Go through a few steps to troubleshoot and resolve this issue.
If this post helps, then please give us 'Kudos' and consider Accept it as a solution to help the other members find it more quickly.
Hope this works for you.
Thank you.
Hi @Stevengood,
May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.
Thank you.
Hi @v-kpoloju-msft,
Not quite to answer to the issue. But it did give me something to think about. In my case my replace and append dataflows were all mutually exclusive so I could rule that out. The queries were showing the results that I was looking for in the previews... but I think the reason that the existing records were being deleted was because all of the existing records in the warehouse didn't exist in the lakehouse. I had setup the lakehouse after those records were already in the warehouse.
The day after experiencing this, I imported all of the records into the lakehouse and then let the process run as scheduled. The new records came through as expected.
However the process ended up using more compute than I could afford.
After doing some more research I have actually opted to remove the lakehouse steps from the infrastructure entirely and only moving data from the source to the Data Warehouse. I have also opted to remove some of the dataflows from my pipeline in general, they were using more CU than I think was needed. I have instead opted to use Web activities to get the data that will be iterated trough in the ForEach loops.
And while perhaps crude, I have now opted to just run a script to delete duplicate records after any import happens. This achieves the same result that I had wanted:
I would have preferred to use a Pre-Copy script during the Copy Activity, but part of the composite key for HourlyData and MinutelyData actually comes from the Source (oData Connection) of that copy activity. If there were a way to access a Copy Activity Source Records then I could perhaps filter out any existing records there in order to limit the number of writes to the warehouse. But I also figure that would require reading the records from the warehouse and doing a bunch of compute to filter them out. So I think that just copying the full set of records in and deleting any duplicates will require less Compute and achieve what I am looking for.
Hi @Stevengood,
After thoroughly reviewing the details you provided, here are a few alternative workarounds that may help resolve the issue. Please follow these steps:
If this post helps, please give us ‘Kudos’ and consider accepting it as a solution to help other members find it more quickly.
Best regards,
Hi @Stevengood,
May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.
Thank you.
Hello @v-kpoloju-msft,
I appreciate the help. Ultimately the limitations of the oData source and the compute cost of that many copy activies just was far exceeding our capacity limits.
Ultimately we did go to our inital plan of directly connecting to the database that the oData endpoint was serving. We originally did not do this because there was a policy to limit direct connections to this database, but the volume of data we are transfering really requires this direct connection. (The entire pipeline runs in about 2 minutes now... and costs very little in CU... so that's great..).
One option we considered was creating a mirrored database connection, but this would require service provider credentials added to the server. We did not move forward with that because there was some uncertainty that by adding the service provider credential - which applies to the whole server and not at the database level - if it would wipe out our existing authentication methods. There are several production databases on that server that we do not want to impact.
If these other methods were possible to do with lower CU costs we would prefer to explore them.. but at the moment we are satisfied with the direct connection the the DB.
Hi @Stevengood,
After thoroughly reviewing the details you provided, I have identified alternative workarounds that may help resolve the issue. Please follow these steps:
If this post helps, then please give us ‘Kudos’ and consider Accept it as a solution to help the other members find it more quickly.
Thankyou.
Hi @Stevengood,
May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.
Thank you.
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |
User | Count |
---|---|
5 | |
3 | |
3 | |
3 | |
2 |