Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!Vote for your favorite vizzies from the Power BI Dataviz World Championship submissions. Vote now!
Hi All,
I'm trying to use Spark SQL to run queries to multiple lakehouses from a notebook. Not with PySpark cells, only Spark SQL. These queries are getting different results, and I'm trying to understand why and what's going on under the hood.
SELECT COUNT(*) FROM delta.`table`
Using delta seems to be getting the right results while using parquet gets a lot of duplicates. Why is that? From what I understand, the lakehouse tables are in delta while the data behind it are stored as parquet files so I figure the results should be the same. And is it more effecient/faster to use one over the other?
Also if there's any documentation on it, can you please include it? Tried searching for it, but not getting the right results.
Thanks in advance!
Solved! Go to Solution.
The different results between parquet. and delta. in Spark SQL queries are due to how Delta Lake manages metadata and transactions compared to raw Parquet files.
Delta Lake (delta.table) is more efficient because it leverages:
Parquet (parquet.table) is less efficient because:
The different results between parquet. and delta. in Spark SQL queries are due to how Delta Lake manages metadata and transactions compared to raw Parquet files.
Delta Lake (delta.table) is more efficient because it leverages:
Parquet (parquet.table) is less efficient because:
Hello @PANDAmonium
When you query a Delta table using the Delta syntax (for example, using
SELECT COUNT(*) FROM delta.`table`), Spark SQL goes through the Delta transaction log (the _delta_log) to determine the current, correct table state. This log records all the changes (inserts, updates, deletes, merges, etc.) that have occurred over time and tells the engine exactly which underlying Parquet files should contribute to the current snapshot of the table.
On the other hand, when you query the same directory using the Parquet syntax (for example, using
SELECT COUNT(*) FROM parquet.`table`), Spark SQL simply reads the raw Parquet files stored on disk without consulting the Delta transaction log. In many lakehouse implementations, updates and deletes don’t immediately remove or physically overwrite Parquet files. Instead, new Parquet files are added while the _delta_log marks the old ones as no longer active. By bypassing the transaction log, your query ends up reading both current and outdated (or duplicate) data files, which leads to duplicate rows and higher counts.
Delta Lake’s integration with Spark provides optimizations like data skipping, Z-ordering, and efficient handling of updates or deletes.
- This not only ensures data correctness through ACID transactions but also enhances query performance for interactive and large-scale analytical workloads
https://delta.io/blog/delta-lake-vs-parquet-comparison
https://learn.microsoft.com/en-us/training/modules/work-delta-lake-tables-fabric
If this is helpful please accept the answer and give kudos
Vote for your favorite vizzies from the Power BI World Championship submissions!
If you love stickers, then you will definitely want to check out our Community Sticker Challenge!
Check out the January 2026 Power BI update to learn about new features.
| User | Count |
|---|---|
| 52 | |
| 40 | |
| 39 | |
| 26 | |
| 25 |