Re: Speed - Database vs Delta Files

AJAJ

Hi there,

Is it just me or others noticed it too. may be im wrong. After working a decade on high performance SQL server. I noticed Spark notebooks are slower than SPs if compute is almost the same. I understand, compute and distributed storage seperation architecture.

Like fabric datawarehouse & SPs combination vs spark - notebook combination (ofcourse spark has few other flexibility). I feel options like SQL server/ SPs flies faster in terms of data processing. Am i wrong? Any one tested.

If there is 50gb table, and does pyspark df = <<read from file>>, move 50gb of data from disk into memory and then runs operations on top of it? ( isnt it a double work) where as SPs on Databases & Warehouse just start working directly on disks based on compiler instructions.

Moving into delta file systems seem to be similar to losing an arm loosing lots of features, have to import libraries to read/ process a table following industry standard medallion architecture where as SPs just start working.

bao_phan

Hi @AJAJ , for your questions:

- Spark is lazy which means all transformations will not be executed immediately; they will be triggered once there is an action (like .show, diplay, count,...).

- Not always spark will perform better than SQL; spark will only when working with high volume of data; when you join muliptle tables together and you might want to cache data to speed up the process => spark might outperform SQL about this... Also while working with Spark there are lot of things you need to pay attention like shuffle, cache, bucket,... to make spark outperform SQL.

- Traditional databases do feel faster because they’re designed for direct query execution using SQL.

v-menakakota

Hi @AJAJ ,

Thanks for reaching out to the Microsoft fabric community forum.

I would also take a moment to thank @nielsvdc   , for actively participating in the community forum and for the solutions you’ve been sharing in the community forum. Your contributions make a real difference.

I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you .

Best Regards,
Community Support Team

nielsvdc

Hi @AJAJ, to answer you question about if pyspark would read a 50gb table into memory, no.

Pyspark uses lazy evaluation, which is a concept of transforming data without actually transforming data in memory until a pyspark action statement is called. So if you run the code spark.read("my_files"), pyspark is not actually reading the data at that moment. Also not when you run code like df.filter(...) or df.select(...), nothing is executed. Until you call an actionable function like df.show() or df.save(), then pyspark starts processing.

But it will not read 50gb into memory at once. Pyspark will read and process data in partitions and will spread the compute load over the available nodes in your cluster (or Fabric capacity). It's a method that is called distributed processing —for those who don't know yet.

A SQL Database is best at providing data at an instance when you invoke a select statement, as all the data is always hot —depending on the configuration of you database. This make a SQL ideal for application development, but because of this, it is also very expensive for storing data and compute. This is also why SPs instantly work, as they are part of the "always-on" environment of SQL.

In reporting environments we do not need the data to be "hot" all the times. Most of the time we query the data in batches at a specific time during the day once. Data lakes and data stored in Fabric then provide the cheapest storage option for your 50gb of data.

In a company you're not (always) there to code, you are there to make the best descisions for that company, which might be saving the company money or enhancing the business process with new functionality. Choosing between a SQL Server warehouse and Fabric lakehouse or warehouse might be one of those descisions. But, every decision can crown or kill you.

Hope this helps. If so, please give a Kudos 👍 and mark as Accepted Solution ✔️.

tayloramy

Hi @AJAJ,

This would be a very interesting thing to benchmark.

The Fabric documentation does mention that a COPY statement is the most efficient at loading data into a warehouse if the file is already accessible, so what you're saying makes sense, but I would still love to see the numbers.

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.