Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get inspired! Check out the entries from the Power BI DataViz World Championships preliminary rounds and give kudos to your favorites. View the vizzies.

Reply
PANDAmonium
Resolver III
Resolver III

Spark SQL to Multiple Lakehouses; parquet.`table` and delta.`table` get different results

Hi All,

 

I'm trying to use Spark SQL to run queries to multiple lakehouses from a notebook. Not with PySpark cells, only Spark SQL. These queries are getting different results, and I'm trying to understand why and what's going on under the hood.

 

  SELECT COUNT(*) FROM parquet.`table`
 

  SELECT COUNT(*) FROM delta.`table`

 

Using delta seems to be getting the right results while using parquet gets a lot of duplicates. Why is that? From what I understand, the lakehouse tables are in delta while the data behind it are stored as parquet files so I figure the results should be the same. And is it more effecient/faster to use one over the other?

 

Also if there's any documentation on it, can you please include it? Tried searching for it, but not getting the right results.

 

Thanks in advance!

1 ACCEPTED SOLUTION
rohit1991
Super User
Super User

Hi @PANDAmonium ,


The different results between parquet. and delta. in Spark SQL queries are due to how Delta Lake manages metadata and transactions compared to raw Parquet files.

1. Why Does parquet.table Return Duplicates?

  • Delta Lake tables store metadata and transaction logs, while raw Parquet tables do not.
  • When you query parquet.table, Spark reads all Parquet files in the directory, including stale versions, invalidated files, and transaction logs, leading to duplicate records.
  • delta.table queries only the latest snapshot of the data, ensuring consistent and correct results.

2. Which is More Efficient: Parquet or Delta?

  • Delta Lake (delta.table) is more efficient because it leverages:

    • Transaction Logs: Ensures atomicity and consistency.
    • Compaction and Optimizations: Uses Z-Ordering and Auto-Optimize to reduce small file issues.
    • ACID Transactions: Prevents duplicate or incorrect data reads.
    • Time Travel and Versioning: Enables rollback and historical data queries.
  • Parquet (parquet.table) is less efficient because:

    • It does not track metadata or transactions.
    • It may read all versions of a table, leading to inconsistencies.
    • It does not support time travel, schema evolution, or automatic optimizations.

3. Best Practice: Always Use delta.table for Lakehouse Queries

  • Avoid using parquet.table unless you are querying raw data files directly.
  • Use Delta Lake (delta.table) for managed tables to ensure correct results and optimal performance.
  • If working with Parquet outside of a Delta Lake table, ensure proper file management to prevent reading duplicate or invalid files.

"The goal is to turn data into information, and information into insight." – Carly Fiorina

🔗 Need Power BI help? Connect on LinkedIn: Rohit Kumar’s LinkedIn

View solution in original post

2 REPLIES 2
rohit1991
Super User
Super User

Hi @PANDAmonium ,


The different results between parquet. and delta. in Spark SQL queries are due to how Delta Lake manages metadata and transactions compared to raw Parquet files.

1. Why Does parquet.table Return Duplicates?

  • Delta Lake tables store metadata and transaction logs, while raw Parquet tables do not.
  • When you query parquet.table, Spark reads all Parquet files in the directory, including stale versions, invalidated files, and transaction logs, leading to duplicate records.
  • delta.table queries only the latest snapshot of the data, ensuring consistent and correct results.

2. Which is More Efficient: Parquet or Delta?

  • Delta Lake (delta.table) is more efficient because it leverages:

    • Transaction Logs: Ensures atomicity and consistency.
    • Compaction and Optimizations: Uses Z-Ordering and Auto-Optimize to reduce small file issues.
    • ACID Transactions: Prevents duplicate or incorrect data reads.
    • Time Travel and Versioning: Enables rollback and historical data queries.
  • Parquet (parquet.table) is less efficient because:

    • It does not track metadata or transactions.
    • It may read all versions of a table, leading to inconsistencies.
    • It does not support time travel, schema evolution, or automatic optimizations.

3. Best Practice: Always Use delta.table for Lakehouse Queries

  • Avoid using parquet.table unless you are querying raw data files directly.
  • Use Delta Lake (delta.table) for managed tables to ensure correct results and optimal performance.
  • If working with Parquet outside of a Delta Lake table, ensure proper file management to prevent reading duplicate or invalid files.

"The goal is to turn data into information, and information into insight." – Carly Fiorina

🔗 Need Power BI help? Connect on LinkedIn: Rohit Kumar’s LinkedIn

nilendraFabric
Community Champion
Community Champion

Hello @PANDAmonium 

 

When you query a Delta table using the Delta syntax (for example, using
  SELECT COUNT(*) FROM delta.`table`), Spark SQL goes through the Delta transaction log (the _delta_log) to determine the current, correct table state. This log records all the changes (inserts, updates, deletes, merges, etc.) that have occurred over time and tells the engine exactly which underlying Parquet files should contribute to the current snapshot of the table.

 


On the other hand, when you query the same directory using the Parquet syntax (for example, using
 SELECT COUNT(*) FROM parquet.`table`), Spark SQL simply reads the raw Parquet files stored on disk without consulting the Delta transaction log. In many lakehouse implementations, updates and deletes don’t immediately remove or physically overwrite Parquet files. Instead, new Parquet files are added while the _delta_log marks the old ones as no longer active. By bypassing the transaction log, your query ends up reading both current and outdated (or duplicate) data files, which leads to duplicate rows and higher counts.

 

Delta Lake’s integration with Spark provides optimizations like data skipping, Z-ordering, and efficient handling of updates or deletes.
- This not only ensures data correctness through ACID transactions but also enhances query performance for interactive and large-scale analytical workloads

https://delta.io/blog/delta-lake-vs-parquet-comparison

 

https://learn.microsoft.com/en-us/training/modules/work-delta-lake-tables-fabric

 

If this is helpful please accept the answer and give kudos

 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

FebPBI_Carousel

Power BI Monthly Update - February 2025

Check out the February 2025 Power BI update to learn about new features.

Feb2025 NL Carousel

Fabric Community Update - February 2025

Find out what's new and trending in the Fabric community.