Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
Dinesh09
Frequent Visitor

How to truncate a lakehouse table.

I have a multiple dataflowgen2 inserts data into the same lakehouse table. Before inserting data into lakehouse table, I would like to truncate the lakehouse table. How can I do that. The lakehouse table already have 40 milions rows.

1 ACCEPTED SOLUTION

Notebook is the only option to drop the table right ?

View solution in original post

9 REPLIES 9
AndyDDC
Most Valuable Professional
Most Valuable Professional

what are the repercussions if the destination table didn't exist downstream? Eg I'm thinking that dropping the table beforehand and then recreating via the data flow may have less overhead 

Notebook is the only option to drop the table right ?

Yep I think so. But again a Gen2 dataflow just set to "replace" will truncate the table before it starts loading the new data. 

 

Hope this helps!

Scott

Just saw this thread pop up again, and I will say that we 100% don't use gen2 dataflows any more. Months ago (March or April 2024 I think) Microsoft released the capability for a copy activity in a pipeline to source on-prem data. 

 

All of our copy jobs use this now, and with dynamic partitioning we're able to load data very quickly - I can load 10+ billion rows of data in a few hours. And we don't have to worry about updating a ton of gen2 dataflows when something changes.

 

Hope this helps,

Scott

@AndyDDC  not really sure - I decided to keep as few moving parts as possible. Didn't really notice any issues with using the "replace" option to clear out the data. That job didn't seem to run abnormally long or anything.

 

Hope this helps!

Scott

Scott_Powell
Advocate III
Advocate III

That's exactly what I'm doing - running multiple jobs in parallel to populate a large table (in my case just over 1 billion rows). The first Gen2 dataflow does a "replace" which truncates the data, and then loads it's slice. After it's done, I launch the other Gen2 dataflows, all of which append in parallel.

 

Hope this helps,

Scott

Dinesh09
Frequent Visitor

Let me become precise. In one dataflowgen2 I have multiple tables. When dataflowgen2 refresh triggers all the tables within this flow will start inserting data into same lakehouse table parallely. I took this decision because my source table has 40 million rows. I divided this huge table into chunks. I am thinking that we would need to spinup spark cluster and run the pyspark script to wipe out all the rows in the lakehouse table and insert data. I am not sure about the script to be written to notebook. I don't want log files also to trace back. Data will be always available in source system.

Scott_Powell
Advocate III
Advocate III

I just create a copy of the dataflow and set the method to "replace" instead of append. Run that first, and then have the other "Append" dataflows after that (using pipelines can really help keep this straight)

 

Hope this helps!

Scott

 

p.s. what I completely DON'T know how to do is delete "some" of the data out. For example "I accidentally loaded Jan 2023 twice, I need to delete just that month and then reload it once". No clue on that - maybe partitioning and deleting the partition files, but that doesn't help if the tables aren't partitioned by the slice you want to delete

 

Thank for adding your thoughts and workaround. Let me become precise. In one dataflowgen2 I have multiple tables. When dataflowgen2 refresh triggers all the tables within this flow will start inserting data into same lakehouse table parallely. I took this decision because my source table has 40 million rows. I divided this huge table into chunks.  I am thinking that we would need to spinup spark cluster and run the pyspark script to wipe out all the rows in the lakehouse table and insert data. I am not sure about the script to be written to notebook. I don't want log files also to trace back. Data will be always available in source system.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.