Re: Spark SQL vs. Spark SQL

frithjof_v · ‎03-24-2024

Hi,

I am new to Spark SQL, and I am wondering about the differences of these two ways to use SQL in Fabric Notebooks:

-- Method A -- (spark.sql)

-- Method B -- (%%sql magic)

I am wondering about the differences of Method A (spark.sql) and Method B (%%sql)

Is it only Method A which can be used when working with dataframes?
Is Method A better suited when you need to store the outputs of the query in a dataframe, and perform further transformations on the outputs in subsequent steps?

Method B - is this method more suitable for standalone commands (create lakehouse table, drop lakehouse table, etc.) and/or for doing ad-hoc queries?
Is it possible to store the outputs of this kind of query in a dataframe or variable in the notebook, and use it in subsequent steps in the notebook? Or we need to use Method A (spark.sql) in that case.

Are there performance differences between these two methods?
Are both methods called Spark SQL?
Do these two methods have different names, so they can be refered to separately in conversations and chats?

I am curious to learn more about the differences between these two methods.

Thank you! 😀

frithjof_v · ‎03-24-2024

Thank you @Anonymous,

Your reply is really helpful and clarifying! 😀

I have a further query:

I prefer the simplicity of using the %%sql magic.

I think it's easier to write Spark SQL code in a %%sql magic cell than in a PySpark cell (spark.sql-method).

(However I realized there are some ways of making it easier to write sql inside the spark.sql-function, see next reply from me).

There is one thing I am curious about in your reply, because it seems like an attractive option:

"Method B (%%sql Magic)

Purpose: This method is a magic command used within Fabric Notebooks to execute Spark SQL queries directly within the notebook cells. It's more concise and interactive.
DataFrames: While it doesn't directly create DataFrames, you can capture the output using the display() function or assign the query to a variable like any other Python expression."

How can I capture the output by using the display() function or assigning the query to a variable?

Let's say I have this %%sql magic code in a notebook cell, how can I capture the output of this cell for use in subsequent cells?

%%sql
SELECT 
    SalesOrderNumber,
    CustomerName,
    OrderDate,
    Item,
    Quantity,
    UnityPrice,
    Tax,
    Year,
    AVG(UnityPrice) OVER (PARTITION BY CustomerName) AS AvgUnitPricePerCustomer
FROM orders
ORDER BY OrderDate DESC;

One method I found, which seems works for me, is to use the %%sql magic cell to create temporary view, and then I can reference this temporary view in other notebook cells. However it requires me to add another cell with spark.sql if I want to use the temporary view in further transformations.

I can do something like this to create a temporary view:

%%sql
CREATE OR REPLACE TEMPORARY VIEW vw_temporary AS 
SELECT 
    SalesOrderNumber,
    CustomerName,
    OrderDate,
    Item,
    Quantity,
    UnityPrice,
    Tax,
    Year,
    AVG(UnityPrice) OVER (PARTITION BY CustomerName) AS AvgUnitPricePerCustomer
FROM orders
ORDER BY OrderDate DESC;

Then I can reference the temporary view inside a spark.sql() function in a PySpark cell, and easily assign it to a dataframe:

df = spark.sql("SELECT * FROM vw_temporary")
display(df)

Currently, this is the only way I have been able to use the outputs from %%sql magic code in subsequent steps. It works fine, but this way requires two notebook cells (one for the %%sql magic, and another one to assign it to a dataframe).

I am curious how I can capture the output from the %%sql magic code by using the display() function or assigning the query to a variable.
Also, I am curious if it is possible to combine %%sql magic code and PySpark code inside the same notebook cell?

Thank you 😀

frithjof_v · ‎03-25-2024

I just learned that enclosing the code inside triple quotes """ """ also works if I want to write multi-line SQL in a PySpark cell (spark.sql-method) 😊

I'm new to this so I didn't know that from before. That is a lot simpler than adding backslack \ at the end of each line (unless someone tells me there is a keyboard shortcut to add \ to a code block ...EDIT: I just realized it's possible to hold the ALT key on the keyboard while using the mouse to click at multiple locations in the code and then insert the backslash \)

This works:

df = spark.sql("""
SELECT *
FROM orders
""")

display(df)

This works:

df = spark.sql("\
SELECT *\
FROM orders\
")

display(df)

This will not work:

df = spark.sql("
SELECT *
FROM orders
")

display(df)

Anonymous · ‎03-25-2024

Hi @frithjof_v ,

I apologize for this statement - "I am curious how I can capture the output from the %%sql magic code by using the display() function or assigning the query to a variable."

As you said you can create temporary view and use it with pyspark cell is the only way.
Unfortunately, you cannot directly mix %%sql and PySpark code within a single cell in Fabric Notebooks.

Using triple quotes (""" """) is the recommended way to write multi-line SQL code within the spark.sql` function for better readability and easier line breaks.

Hope this is helpful. Please let me know incase of furher queries.

frithjof_v · ‎03-25-2024

Thank you @Anonymous,

I am getting a clearer understanding of what are the possibilities with both methods ☺️

Anonymous · ‎03-25-2024

Hi @frithjof_v ,

I hope we have answered your query. Otherwise, will respond back with the more details and we will try to help .

Element115 · ‎03-26-2024

@frithjof_v @Anonymous

What are the implications for memory usage when using PySpark with a dataframe and a temp view as opposed to %%sql in a Notebook?

In other words, say your view contains billions of rows. I am assuming on the backend this is somehow on disk and not in memory.

What about when you try to load this view in a dataframe? Does the Python engine manage this intelligently or will you get an out of mem error? Or is it the notebook that handles this? Or... ?

frithjof_v · ‎03-26-2024

I don't have enough knowledge about this to answer, and I am curious about the same.

I can write my assumptions here, and I am hoping to be corrected if I'm wrong because these are only assumptions and this is brand new for me, but I started reading a little bit so I got some theories at this point:

About the temp view: The temp view itself is just a piece of code (it holds no data, it is only a query code which can be referenced, like a regular SQL view)
About the dataframe: Apache Spark (which the Fabric notebooks uses) uses lazy evaluation. It doesn't load any data when you define a dataframe. Also, you can do many transformations on the dataframe in many notebook cells, like adding or removing columns, and filtering, etc. However, these transformations are only code, and they can be optimized by Apache Spark to be performed in a single step, and some transformations can also be pushed back to the source system.

Lazy evaluation means that the actual load and processing of data from the data source, only happens when you use the dataframe to perform an action (like displaying the data, or saving the data to a file).

So Apache Spark will look at the final step (=the action), and then go backwards to trace all the transformation steps until it reaches the first step where you connected to the data source.
Apache Spark analyzes all the transformations you are doing in your code, and creates an optimized execution plan. When possible, it pushes transformations (like filtering and column selections) down to the data source (analogous to query folding in Power Query, I guess?).

So, if you use display(df.limit(10)) then Apache Spark only needs to load enough data from the data source to display the first 10 rows of the dataframe.

And if you have done transformations on your dataframe, so that when the dataframe arrives at the display() action it has been narrowed down to only contain 2 columns, then Apache Spark will analyze the job and only needs to load these 2 columns from the data source.

I'm not sure, but this is my understanding per now.

I am hoping @Anonymous can answer 😀

Anonymous · ‎03-28-2024

Hi @Element115 ,

As @frithjof_v said -

Regardless of whether you use a DataFrame, temporary view, or %%sql magic, the underlying data itself resides on disk. Spark doesn't keep the entire dataset in memory by default.

When you create a DataFrame from data or a temporary view from a DataFrame/query, Spark reads the data from disk into memory only as much as needed to perform the specific operation (e.g., schema inference, initial partitioning). It doesn't load the entire dataset in memory at once.

As you perform operations (filtering, joining, aggregation) on the DataFrame or query results in a temporary view, Spark might shuffle data around or create additional intermediate datasets in memory (depending on the complexity of the operation). The amount of memory used depends on the specific operation and the size of the data being processed in that step.

Anonymous · ‎03-29-2024

Hi @frithjof_v ,

We haven’t heard from you on the last response and was just checking back to see if we answered your query.
Otherwise, will respond back with the more details and we will try to help .

Anonymous · ‎03-31-2024

Hi @frithjof_v ,

We haven’t heard from you on the last response and was just checking back to see if we answered your query.
Otherwise, will respond back with the more details and we will try to help .

Anonymous · ‎03-24-2024

Hi @frithjof_v ,

Thanks for using Fabric Community.
I'd be glad to explain the differences between Method A (spark.sql) and Method B (%%sql magic) in Spark SQL for Fabric Notebooks:

Method A (spark.sql)

Purpose: This method allows you to directly execute Spark SQL queries within your Python code using the spark.sql function.
DataFrames: You can create DataFrames as the result of your queries and store them in variables for further manipulation.
Use Cases: Ideal for scenarios where you need to:
- Write complex Spark SQL queries programmatically.
- Chain and combine multiple SQL queries with DataFrame operations.
- Store query results in DataFrames for subsequent transformations.

Method B (%%sql Magic)

Purpose: This method is a magic command used within Fabric Notebooks to execute Spark SQL queries directly within the notebook cells. It's more concise and interactive.
DataFrames: While it doesn't directly create DataFrames, you can capture the output using the display() function or assign the query to a variable like any other Python expression.
Use Cases: Well-suited for:
- Running ad-hoc queries for quick exploration and analysis.
- Debugging and inspecting intermediate results.
- Creating lakehouse tables, dropping tables, and other DDL (Data Definition Language) operations.

Performance:

There are generally minimal performance differences between the two methods. Both translate your queries into optimized distributed Spark operations. However, complex logic within Method A code might have a slight overhead compared to simpler %%sql queries.

Naming and Terminology:

Spark SQL: Both methods leverage Spark SQL's capabilities behind the scenes.
Separate Names: Yes, they have distinct names for clarity:
- spark.sql: Refers to the Python function for programmatic execution.
- %%sql: Denotes the magic command specific to Fabric Notebooks.

In Summary:

Choose Method A (spark.sql) when you need to:

Integrate SQL queries with Python code for complex data processing workflows.
Store query results as DataFrames for further transformations.

Use Method B (%%sql) for:

Quick ad-hoc queries and exploration within notebooks.
DDL operations like creating or dropping tables.

Hope this is helpful. Please let me know incase of further queries.

Spark SQL vs. Spark SQL

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Spark SQL vs. Spark SQL

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025