Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply
AJAJ
Helper II
Helper II

Speed - Database vs Delta Files

Hi there,


Is it just me or others noticed it too. may be im wrong. After working a decade on high performance SQL server. I noticed Spark notebooks are slower than SPs if compute is almost the same. I understand, compute and distributed storage seperation architecture.

 

Like fabric datawarehouse & SPs combination vs spark - notebook combination (ofcourse spark has few other flexibility). I feel options like SQL server/ SPs flies faster in terms of data processing. Am i wrong? Any one tested.

If there is 50gb table, and does pyspark df = <<read from file>>, move 50gb of data from disk into memory and then runs operations on top of it? ( isnt it a double work) where as SPs on Databases & Warehouse just start working directly on disks based on compiler instructions.

 

Moving into delta file systems seem to be similar to losing an arm loosing lots of features, have to import libraries to read/ process a table following industry standard medallion architecture where as SPs just start working.

1 ACCEPTED SOLUTION
nielsvdc
Skilled Sharer
Skilled Sharer

Hi @AJAJ, to answer you question about if pyspark would read a 50gb table into memory, no.

 

Pyspark uses lazy evaluation, which is a concept of transforming data without actually transforming data in memory until a pyspark action statement is called. So if you run the code spark.read("my_files"), pyspark is not actually reading the data at that moment. Also not when you run code like df.filter(...) or df.select(...), nothing is executed. Until you call an actionable function like df.show() or df.save(), then pyspark starts processing.

 

But it will not read 50gb into memory at once. Pyspark will read and process data in partitions and will spread the compute load over the available nodes in your cluster (or Fabric capacity). It's a method that is called distributed processing —for those who don't know yet.

 

A SQL Database is best at providing data at an instance when you invoke a select statement, as all the data is always hot —depending on the configuration of you database. This make a SQL ideal for application development, but because of this, it is also very expensive for storing data and compute. This is also why SPs instantly work, as they are part of the "always-on" environment of SQL.

 

In reporting environments we do not need the data to be "hot" all the times. Most of the time we query the data in batches at a specific time during the day once. Data lakes and data stored in Fabric then provide the cheapest storage option for your 50gb of data.

 

In a company you're not (always) there to code, you are there to make the best descisions for that company, which might be saving the company money or enhancing the business process with new functionality. Choosing between a SQL Server warehouse and Fabric lakehouse or warehouse might be one of those descisions. But, every decision can crown or kill you.

 

Hope this helps. If so, please give a Kudos 👍 and mark as Accepted Solution ✔️.

View solution in original post

6 REPLIES 6
MohdZaid_
Frequent Visitor

Hey @AJAJ  , 


You're not imagining it Spark notebooks and Delta Lake often feel slower than traditional SQL Server stored procedures, especially when you're coming from a decade of tuning high-performance OLTP/OLAP systems. But the reasons are architectural rather than inefficiencies, and the trade-offs depend on workload type.

Why SQL Server / Data Warehouse SPs Often Feel Faster
Traditional databases (SQL Server, Fabric DW, Synapse Dedicated, Snowflake, etc.) are built for tightly coupled compute + storage:

The engine knows the table structures, indexes, stats, partitions, data pages, etc.

Queries operate directly on local/attached storage.

The optimizer can make very aggressive, cost-based decisions.

Data rarely needs to be moved into a separate execution engine.

SPs benefit from plan caching and mature optimization.

Result: For single-node, I/O-optimized, relational workloads, SQL engines are extremely fast.

What Spark Is Doing Instead
Spark was designed for distributed processing of large, semi-structured/unstructured data (logs, events, files, ML workloads), not for low-latency SQL.

When Spark runs:

It loads data from object storage (ADLS/S3/Blob) into executors’ memory.

It executes operations in a distributed DAG (lazy evaluation).

It writes data back to object storage.

Yes — reading a 50GB Delta table means scanning some or all of those files.
But Spark uses:

predicate/file pruning

column pruning

caching

data skipping (Delta)

vectorized Parquet readers

This can make scans surprisingly fast, but it’s still a scan-based architecture, not index-based like SQL Server.

Is it “double work”?
Not exactly — Spark must read files because:

Object storage doesn’t support random access pages like SQL Server data files

There is no buffer pool, no indexes, no rowstore

Distributed compute requires data locality in memory/executors

So Spark doesn't have the same "just start reading pages off disk and execute" model that a database engine does.

For large-scale analytics, that model is actually a feature, not a flaw — but it’s different.

When Spark Is Slower
Spark will usually be slower for:

Small to medium datasets (<100GB)

Highly selective queries (which a SQL index would solve instantly)

Repeat/interactive workloads where DB caching beats Spark cluster startup

Single-record lookups or OLTP-style patterns

Complex joins on non-partitioned data

When Spark Is Faster
Spark shines when:

Data size exceeds what fits comfortably on a single DB engine

You need distributed compute for ML, ETL, or large transformations

You work with raw data in files (JSON, CSV, logs)

You benefit from cluster parallelism

🧠 The Key Point
Spark is a distributed compute engine designed for scale and flexibility.
SQL Server is a database engine designed for performance and efficiency.

 

Don’t treat Spark as a “slower SQL server.” It’s a different tool entirely.

bao_phan
Advocate I
Advocate I

Hi @AJAJ , for your questions:

- Spark is lazy which means all transformations will not be executed immediately; they will be triggered once there is an action (like .show, diplay, count,...).

- Not always spark will perform better than SQL; spark will only when working with high volume of data; when you join muliptle tables together and you might want to cache data to speed up the process => spark might outperform SQL about this... Also while working with Spark there are lot of things you need to pay attention like shuffle, cache, bucket,... to make spark outperform SQL.

- Traditional databases do feel faster because they’re designed for direct query execution using SQL.

Hi @AJAJ ,

Thanks for reaching out to the Microsoft fabric community forum. 

 

I would also take a moment to thank @nielsvdc   , for actively participating in the community forum and for the solutions you’ve been sharing in the community forum. Your contributions make a real difference.

I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you .

 

 

Best Regards, 
Community Support Team  

Hi @AJAJ ,

I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you .

 

 

Best Regards, 
Community Support Team  

nielsvdc
Skilled Sharer
Skilled Sharer

Hi @AJAJ, to answer you question about if pyspark would read a 50gb table into memory, no.

 

Pyspark uses lazy evaluation, which is a concept of transforming data without actually transforming data in memory until a pyspark action statement is called. So if you run the code spark.read("my_files"), pyspark is not actually reading the data at that moment. Also not when you run code like df.filter(...) or df.select(...), nothing is executed. Until you call an actionable function like df.show() or df.save(), then pyspark starts processing.

 

But it will not read 50gb into memory at once. Pyspark will read and process data in partitions and will spread the compute load over the available nodes in your cluster (or Fabric capacity). It's a method that is called distributed processing —for those who don't know yet.

 

A SQL Database is best at providing data at an instance when you invoke a select statement, as all the data is always hot —depending on the configuration of you database. This make a SQL ideal for application development, but because of this, it is also very expensive for storing data and compute. This is also why SPs instantly work, as they are part of the "always-on" environment of SQL.

 

In reporting environments we do not need the data to be "hot" all the times. Most of the time we query the data in batches at a specific time during the day once. Data lakes and data stored in Fabric then provide the cheapest storage option for your 50gb of data.

 

In a company you're not (always) there to code, you are there to make the best descisions for that company, which might be saving the company money or enhancing the business process with new functionality. Choosing between a SQL Server warehouse and Fabric lakehouse or warehouse might be one of those descisions. But, every decision can crown or kill you.

 

Hope this helps. If so, please give a Kudos 👍 and mark as Accepted Solution ✔️.

tayloramy
Community Champion
Community Champion

Hi @AJAJ

 

This would be a very interesting thing to benchmark. 

 

The Fabric documentation does mention that a COPY statement is the most efficient at loading data into a warehouse if the file is already accessible, so what you're saying makes sense, but I would still love to see the numbers. 

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

Helpful resources

Announcements
November Fabric Update Carousel

Fabric Monthly Update - November 2025

Check out the November 2025 Fabric update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.