Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.
Sign up nowGet Fabric certified for FREE! Don't miss your chance! Learn more
I am loading many different flat files from the Files section of my lakehouse to delta tables via a Notebook. Most of the files I'm working with are relatively small (from 5 to 50,000 rows), with a handful of them being a couple million rows. I've noticed that the slowest part of the process is writing the files to the delta tables.
Here's what I was using to write the data to each table:
df.write.format("delta").mode("overwrite").saveAsTable(table_name)This was taking from 30 seconds to 3 minutes on average.
I recently tried this:
df.write.format("delta").mode("overwrite").save(f'Tables/{table_name}')I am getting much quicker write times, averaging 10 to 50 seconds.
I'm trying to understand if using .save() is going to cause me any longer term issues I'm not aware of vs .saveAsTable(). Everything I've read about the differences between them seems to be for delta lakehouses in general and not specific to Fabric. For example, one thing I read said that .save() doesn't do certain things on the Hive cluster, and so it won't allow you to directly query the table. But I've tested this from the Fabric SQL endpoint, and I'm able to query the tables just fine. So not sure if issues like this are really an issue within Fabric.
Anybody have experience with this, and are there any issues I should watch out for with .save()? The time difference for each table x75 tables is a huge time savings if everything else is equal.
Solved! Go to Solution.
Hello @mmcanelly That's an interesting observation! I have not experienced that myself, but it is probably because of the order of things in which data is written to using those methods. By the way, both methods end up producing a managed Delta table, however internally it interprets as "path-write" (.save) and "table-write" (.saveAsTable).
When you're using table-write Fabric registers with catalog and sql_endpoint so that the table is immediately visible, whereas using path-write it simply writes to the /Tables area and let Fabric auto-register to the catalog, that will eventually surface the table on the sql_endpoint. That explains your experience with latency using the saveAsTable method.
If your use case doesn't demand that the table is immediately available on sql_endpoint, you could follow the appraoch of using save() rather than saveAsTable().
I would like to point out that I could not find any documentation on Fabric that explains the internal working. In documentation it always uses .save to write to /Files and .saveAsTable to write to /Tables.
Hello @mmcanelly That's an interesting observation! I have not experienced that myself, but it is probably because of the order of things in which data is written to using those methods. By the way, both methods end up producing a managed Delta table, however internally it interprets as "path-write" (.save) and "table-write" (.saveAsTable).
When you're using table-write Fabric registers with catalog and sql_endpoint so that the table is immediately visible, whereas using path-write it simply writes to the /Tables area and let Fabric auto-register to the catalog, that will eventually surface the table on the sql_endpoint. That explains your experience with latency using the saveAsTable method.
If your use case doesn't demand that the table is immediately available on sql_endpoint, you could follow the appraoch of using save() rather than saveAsTable().
I would like to point out that I could not find any documentation on Fabric that explains the internal working. In documentation it always uses .save to write to /Files and .saveAsTable to write to /Tables.
Thank you @deborshi_nag! So it sounds like as long as I refresh the SQL endpoint metadata after making my changes (per something like this: Programmatically Refresh & Sync SQL Analytics Endp... - Microsoft Fabric Community) so that the SQL endpoint sees the latest data right away, then .save() should be OK for what I'm doing. Now I'll have to see if it's still faster once I include that extra step. I feel like it should be because I'm only doing it once at the end as opposed to updating the catalog with every table write.
Thanks for your insight!
| User | Count |
|---|---|
| 23 | |
| 5 | |
| 3 | |
| 3 | |
| 3 |
| User | Count |
|---|---|
| 55 | |
| 10 | |
| 9 | |
| 8 | |
| 8 |