The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi all,
I have a simple use-case: A Gen2 Dataflow prepares a table and loads it to a delta table in a Lakehouse using the append mechanism, i.e. each time the dataflow runs a new parquet file is created.
Now let's say I need to perform some housekeeping, e.g. one of the append loads has a problem and needs to be deleted. How do I do that without re-creating the entire delta table? I know I can delete files under "Files", but how do I delete one or many parquet files of which a delta table is based on?
Hi @JayJay11 ,
Thanks for using Fabric Community.
Do you want to delete the latest changes/records added in delta table that was added when dataflow runs?
If that was your ask you can actually refer this - Work with Delta Lake table history - Azure Databricks | Microsoft Learn
Incase if your ask is different, please share your issue with more details so that I can guide you.
Thank you for your quick reply. Though I am not sure I understand the answer, as the link does not tell me how to actually delete one of those files. Is this even possible? If yes, do I need to run code or how to do it?
Hi @JayJay11 ,
Inorder to delete the latest changes taken place to delta table or restore to older version.
You can try using Fabric Notebooks and Delta table time travel feature -
Inorder to list out the versions available:
DESCRIBE HISTORY gopi_test_lakehouse.sales__1_test
Inorder to preview the records of previous version:
SELECT count(*) FROM gopi_test_lakehouse.sales__1_test VERSION AS OF 6
Inorder to restore to previous version - Restoring the table makes the changes permanent.
RESTORE TABLE gopi_test_lakehouse.sales__1_test TO VERSION AS OF 6
Inorder to delete older unused version of delta table you can use VACUUM - Remove unused data files with vacuum | Databricks on AWS
Please feel free to refer below documents:
Delta Lake Time Travel | Delta Lake
apache spark - How to roll back delta table to previous version - Stack Overflow
Table utility commands — Delta Lake Documentation
Hope this is helpful. Please let me know incase of further queries.
Thank you very much, it becomes clearer now to me.
So in order to restore an older version, I use RESTORE and reference the version I want to go back to. Then I must use VACUUM to delete files, which are unused now, due to the RESTORE.
When I try to use VACUUM though, I get an error:
%%sql
VACUUM bi_buildings
Error
Error: Job aborted due to stage failure: Task 23 in stage 159.0 failed 4 times, most recent failure: Lost task 23.3 in stage 159.0 (TID 3133) (vm-12a14818 executor 2): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.delta.commands.VacuumCommandImpl.getDeletionVectorRelativePath(VacuumCommand.scala:537) at org.apache.spark.sql.delta.commands.VacuumCommandImpl.getDeletionVectorRelativePath$(VacuumCommand.scala:522) at org.apache.spark.sql.delta.commands.VacuumCommand$.getDeletionVectorRelativePath(VacuumCommand.scala:51) at org.apache.spark.sql.delta.commands.VacuumCommandImpl.getValidRelativePathsAndSubdirs(VacuumCommand.scala:513) at org.apache.spark.sql.delta.commands.VacuumCommandImpl.getValidRelativePathsAndSubdirs$(VacuumCommand.scala:500) at org.apache.spark.sql.delta.commands.VacuumCommand$.getValidRelativePathsAndSubdirs(VacuumCommand.scala:51) at org.apache.spark.sql.delta.commands.VacuumCommand$.$anonfun$gc$9(VacuumCommand.scala:153) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Driver stacktrace: org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2799) org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2735) org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2734) scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2734) org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1218) org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1218) scala.Option.foreach(Option.scala:407) org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1218) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2998) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2937) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2926) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Hi @JayJay11 ,
This might be due to Delta Lake Table Corruption: The table's metadata or files might be corrupted, causing issues with the vacuum operation.
Try to create a new table and work with Vaccum command.
I didn't encouter any issues while working with VACUUM
Hope this is helpful.
Hi @JayJay11 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
Hello @Anonymous
no I could not resolve it, as soon as I run VACCUM I get the above noted error. I am not sure what i means that the delta files are corrupted.
Also, wenn I run DESCRIBE HISTORY again after the RESTORE, I still see the old versions. Is this because the VACCUM was not successful?
It is a bit concerning, that there is such a random error and why would the files suddenly be corrupted?
Hi @JayJay11 ,
I am not sure what could be the reason for this performance.
I can suggest this - Use table maintenance feature to manage delta tables in Fabric - Microsoft Fabric | Microsoft Learn
All your old data >7 days (minimum) will gets cleaned.
Incase if you are facing some challenge with this delta table, try to to create a new delta table and copy this data to it so that you willl not face this issue.
Hope this is helpful. Please let me know incase of further queries.
Hello @JayJay11 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread .
User | Count |
---|---|
19 | |
13 | |
6 | |
3 | |
3 |
User | Count |
---|---|
49 | |
25 | |
17 | |
12 | |
12 |