Solved: Spark Session configuration

lsabetta · ‎10-03-2025

Hi community,

Currently I'm working with an F32 based in Europe.

I have a notebook in which I am forced to use pyspark because I have to get tables from my lakehouse - if not I would totally use python instead of pyspark because my notebook runs a code with not too many transformations and the tables are small.

I am not a big fan of pyspark because of the time it takes to create the spark session - it takes at least 5 minutes and I need to run the code fast.

My question is, is there any specific configuration to make in my spark enviroment in order to create the session faster?

Thanks community!

v-venuppu

Hi @lsabetta ,

The issue happens because your Lakehouse table is stored in Delta format, not plain Parquet.
When you overwrite the table, new Parquet files are created, but old ones remain in the folder for versioning.
If you read the folder directly as Parquet, it loads both the old and new files - that’s why you see duplicate or outdated records.

To fix this, make sure you read the table as a Delta table instead of as raw Parquet.
This ensures that only the latest valid version of the data is returned, without mixing older files.

Thank you.

View solution in original post

v-venuppu

Hi @lsabetta ,

The issue happens because your Lakehouse table is stored in Delta format, not plain Parquet.
When you overwrite the table, new Parquet files are created, but old ones remain in the folder for versioning.
If you read the folder directly as Parquet, it loads both the old and new files - that’s why you see duplicate or outdated records.

To fix this, make sure you read the table as a Delta table instead of as raw Parquet.
This ensures that only the latest valid version of the data is returned, without mixing older files.

Thank you.

v-venuppu

Hi @lsabetta ,

Thank you @BhaveshPatel @BalajiL @anilgavhane for the prompt response.

I wanted to check if you had the opportunity to review the information provided and resolve the issue..?Please let us know if you need any further assistance.We are happy to help.

Thank you.

BalajiL · ‎10-06-2025

@lsabetta : If concern is spark takes longer time for spin then use starter pool, it will be up within few seconds and then you can query it.

BhaveshPatel · ‎10-06-2025

Hi @lsabetta

You should use Power BI Dataflow Gen 2.

This is how it works :

Either use Notebooks ( Python ( Pandas) + Data Lake + Delta Lake ) or Use Power BI Dataflow Gen 2 ( UI + UX).

Yes You can read the tables in Lakehouse using Python Pandas library. ( pd.read_tables)

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.

lsabetta · ‎10-06-2025

Hi @BhaveshPatel ,

Could you please give me an example of how to read the tables in a Lakehouse only with python?

BhaveshPatel · ‎10-06-2025

Hi Isabella,

This is how it works using Pandas:

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.

lsabetta

Hi @BhaveshPatel

The problem that has reading the parquet is that if I overwrite the table, the parquet brings me the new records and the old ones.

anilgavhane · ‎10-03-2025

@lsabetta

Yes, you can read tables from a Microsoft Fabric Lakehouse without creating a Spark session, by using Lakehouse shortcuts and native connectors in Python. Here are a few options:

🔹 1. Use the fabric Python SDK (Preview)

Microsoft Fabric offers a Python SDK that allows you to interact with Lakehouse tables directly from a notebook using Pandas, without spinning up Spark.

from fabric import LakehouseClient client = LakehouseClient(workspace_id="your_workspace_id", lakehouse_id="your_lakehouse_id") df = client.read_table("your_table_name")



This avoids Spark entirely and loads the table as a Pandas DataFrame.

🔹 2. Use REST APIs or ODBC/JDBC Connectors

You can access Lakehouse tables via:

REST API: For metadata and file-level access.
ODBC/JDBC: If your Lakehouse is exposed via SQL endpoints (like Warehouse or SQL Analytics).

These methods allow you to query structured data using Python libraries like pyodbc, sqlalchemy, or pandas.read_sql().

🔹 3. Export Lakehouse Tables to Files

If your Lakehouse tables are stored as Delta or Parquet files:

You can mount or access the file path directly using pyarrow, fastparquet, or pandas.read_parquet().

import pandas as pd df = pd.read_parquet("https://yourlakehouseurl/path/to/table")

This bypasses Spark and loads the data directly into memory.

lsabetta · ‎10-06-2025

@anilgavhane

v-venuppu · ‎10-03-2025

Hi @lsabetta ,

Thank you for reaching out to Microsoft Fabric Community.

Thank you @pallavi_r for the prompt response.

The slow startup isn’t from your code or data size - it’s Spark cluster spin-up, which always takes a few minutes.

There’s no Spark config that makes session creation instant.

Below are few Options:

If tables are small, skip Spark and load them directly into Pandas/SQL (faster, no cluster).Keep the Spark session alive instead of restarting often.Ask your admin if a smaller/faster Spark pool is available.Use Spark only when you need distributed compute; otherwise stick with Python. you can’t make Spark spin up faster, but you can avoid Spark altogether or keep the session warm.

lsabetta · ‎10-03-2025

HI @v-venuppu ,

Thanks for your answer.

My notebooks are written in Pandas because my tables are small and transformations are simple. I could use Python instead of Pyspark but the thing is that I need to read tables from my lakehouse.

Is there any way to read tables from a lakehouse without creating a sparksession?

pallavi_r · ‎10-03-2025

Hi @lsabetta ,

Here are couple of tips shared to reduce spark session by keeping the high concurrency mode enabled, detaching it from one notebook session, so it frees up for other sessions and keep it alive.

https://www.linkedin.com/pulse/fabric-notebook-performance-hack-reuse-spark-sessions-tiago-balabuch-...

https://blog.fabric.microsoft.com/en-US/blog/supercharge-your-workloads-write-optimized-default-spar...

Thanks,

Pallavi