Microsoft Fabric Community Conference 2025, March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for a $400 discount.
Register nowGet inspired! Check out the entries from the Power BI DataViz World Championships preliminary rounds and give kudos to your favorites. View the vizzies.
Hi All,
I'm trying to use Spark SQL to run queries to multiple lakehouses from a notebook. Not with PySpark cells, only Spark SQL. These queries are getting different results, and I'm trying to understand why and what's going on under the hood.
SELECT COUNT(*) FROM delta.`table`
Using delta seems to be getting the right results while using parquet gets a lot of duplicates. Why is that? From what I understand, the lakehouse tables are in delta while the data behind it are stored as parquet files so I figure the results should be the same. And is it more effecient/faster to use one over the other?
Also if there's any documentation on it, can you please include it? Tried searching for it, but not getting the right results.
Thanks in advance!
Solved! Go to Solution.
The different results between parquet. and delta. in Spark SQL queries are due to how Delta Lake manages metadata and transactions compared to raw Parquet files.
Delta Lake (delta.table) is more efficient because it leverages:
Parquet (parquet.table) is less efficient because:
"The goal is to turn data into information, and information into insight." – Carly Fiorina
🔗 Need Power BI help? Connect on LinkedIn: Rohit Kumar’s LinkedIn
The different results between parquet. and delta. in Spark SQL queries are due to how Delta Lake manages metadata and transactions compared to raw Parquet files.
Delta Lake (delta.table) is more efficient because it leverages:
Parquet (parquet.table) is less efficient because:
"The goal is to turn data into information, and information into insight." – Carly Fiorina
🔗 Need Power BI help? Connect on LinkedIn: Rohit Kumar’s LinkedIn
Hello @PANDAmonium
When you query a Delta table using the Delta syntax (for example, using
SELECT COUNT(*) FROM delta.`table`), Spark SQL goes through the Delta transaction log (the _delta_log) to determine the current, correct table state. This log records all the changes (inserts, updates, deletes, merges, etc.) that have occurred over time and tells the engine exactly which underlying Parquet files should contribute to the current snapshot of the table.
On the other hand, when you query the same directory using the Parquet syntax (for example, using
SELECT COUNT(*) FROM parquet.`table`), Spark SQL simply reads the raw Parquet files stored on disk without consulting the Delta transaction log. In many lakehouse implementations, updates and deletes don’t immediately remove or physically overwrite Parquet files. Instead, new Parquet files are added while the _delta_log marks the old ones as no longer active. By bypassing the transaction log, your query ends up reading both current and outdated (or duplicate) data files, which leads to duplicate rows and higher counts.
Delta Lake’s integration with Spark provides optimizations like data skipping, Z-ordering, and efficient handling of updates or deletes.
- This not only ensures data correctness through ACID transactions but also enhances query performance for interactive and large-scale analytical workloads
https://delta.io/blog/delta-lake-vs-parquet-comparison
https://learn.microsoft.com/en-us/training/modules/work-delta-lake-tables-fabric
If this is helpful please accept the answer and give kudos
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Check out the February 2025 Power BI update to learn about new features.
User | Count |
---|---|
47 | |
31 | |
30 | |
26 | |
25 |
User | Count |
---|---|
47 | |
33 | |
19 | |
18 | |
16 |