Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedDon't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.
I have a multiple dataflowgen2 inserts data into the same lakehouse table. Before inserting data into lakehouse table, I would like to truncate the lakehouse table. How can I do that. The lakehouse table already have 40 milions rows.
Solved! Go to Solution.
Notebook is the only option to drop the table right ?
what are the repercussions if the destination table didn't exist downstream? Eg I'm thinking that dropping the table beforehand and then recreating via the data flow may have less overhead
Notebook is the only option to drop the table right ?
Yep I think so. But again a Gen2 dataflow just set to "replace" will truncate the table before it starts loading the new data.
Hope this helps!
Scott
Just saw this thread pop up again, and I will say that we 100% don't use gen2 dataflows any more. Months ago (March or April 2024 I think) Microsoft released the capability for a copy activity in a pipeline to source on-prem data.
All of our copy jobs use this now, and with dynamic partitioning we're able to load data very quickly - I can load 10+ billion rows of data in a few hours. And we don't have to worry about updating a ton of gen2 dataflows when something changes.
Hope this helps,
Scott
@AndyDDC not really sure - I decided to keep as few moving parts as possible. Didn't really notice any issues with using the "replace" option to clear out the data. That job didn't seem to run abnormally long or anything.
Hope this helps!
Scott
That's exactly what I'm doing - running multiple jobs in parallel to populate a large table (in my case just over 1 billion rows). The first Gen2 dataflow does a "replace" which truncates the data, and then loads it's slice. After it's done, I launch the other Gen2 dataflows, all of which append in parallel.
Hope this helps,
Scott
Let me become precise. In one dataflowgen2 I have multiple tables. When dataflowgen2 refresh triggers all the tables within this flow will start inserting data into same lakehouse table parallely. I took this decision because my source table has 40 million rows. I divided this huge table into chunks. I am thinking that we would need to spinup spark cluster and run the pyspark script to wipe out all the rows in the lakehouse table and insert data. I am not sure about the script to be written to notebook. I don't want log files also to trace back. Data will be always available in source system.
I just create a copy of the dataflow and set the method to "replace" instead of append. Run that first, and then have the other "Append" dataflows after that (using pipelines can really help keep this straight)
Hope this helps!
Scott
p.s. what I completely DON'T know how to do is delete "some" of the data out. For example "I accidentally loaded Jan 2023 twice, I need to delete just that month and then reload it once". No clue on that - maybe partitioning and deleting the partition files, but that doesn't help if the tables aren't partitioned by the slice you want to delete
Thank for adding your thoughts and workaround. Let me become precise. In one dataflowgen2 I have multiple tables. When dataflowgen2 refresh triggers all the tables within this flow will start inserting data into same lakehouse table parallely. I took this decision because my source table has 40 million rows. I divided this huge table into chunks. I am thinking that we would need to spinup spark cluster and run the pyspark script to wipe out all the rows in the lakehouse table and insert data. I am not sure about the script to be written to notebook. I don't want log files also to trace back. Data will be always available in source system.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
7 | |
5 | |
2 | |
2 | |
1 |
User | Count |
---|---|
13 | |
7 | |
5 | |
5 | |
4 |