Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
lsabetta
Frequent Visitor

Spark Session configuration

Hi community,

 

Currently I'm working with an F32 based in Europe. 

I have a notebook in which I am forced to use pyspark because I have to get tables from my lakehouse - if not I would totally use python instead of pyspark because my notebook runs a code with not too many transformations and the tables are small.

I am not a big fan of pyspark because of the time it takes to create the spark session - it takes at least 5 minutes and I need to run the code fast. 

My question is, is there any specific configuration to make in my spark enviroment in order to create the session faster? 

 

Thanks community! 

 

1 ACCEPTED SOLUTION
v-venuppu
Community Support
Community Support

Hi @lsabetta ,

The issue happens because your Lakehouse table is stored in Delta format, not plain Parquet.
When you overwrite the table, new Parquet files are created, but old ones remain in the folder for versioning.
If you read the folder directly as Parquet, it loads both the old and new files - that’s why you see duplicate or outdated records.

To fix this, make sure you read the table as a Delta table instead of as raw Parquet.
This ensures that only the latest valid version of the data is returned, without mixing older files.

Thank you.

View solution in original post

12 REPLIES 12
v-venuppu
Community Support
Community Support

Hi @lsabetta ,

The issue happens because your Lakehouse table is stored in Delta format, not plain Parquet.
When you overwrite the table, new Parquet files are created, but old ones remain in the folder for versioning.
If you read the folder directly as Parquet, it loads both the old and new files - that’s why you see duplicate or outdated records.

To fix this, make sure you read the table as a Delta table instead of as raw Parquet.
This ensures that only the latest valid version of the data is returned, without mixing older files.

Thank you.

v-venuppu
Community Support
Community Support

Hi @lsabetta ,

Thank you @BhaveshPatel @BalajiL @anilgavhane for the prompt response.

I wanted to check if you had the opportunity to review the information provided and resolve the issue..?Please let us know if you need any further assistance.We are happy to help.

Thank you.

BalajiL
Helper III
Helper III

@lsabetta : If concern is spark takes longer time for spin then use starter pool, it will be up within few seconds and then you can query it. 

BhaveshPatel
Community Champion
Community Champion

Hi @lsabetta 

 

You should use Power BI Dataflow Gen 2. 

 

This is how it works :

Either  use Notebooks ( Python ( Pandas) + Data Lake + Delta Lake ) or Use Power BI Dataflow Gen 2 ( UI + UX).

 

Yes You can read the tables in Lakehouse using Python Pandas library. ( pd.read_tables)

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.

Hi @BhaveshPatel ,

Could you please give me an example of how to read the tables in a Lakehouse only with python?

Hi Isabella,

 

This is how it works using Pandas:

BhaveshPatel_0-1759790178210.png

 

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.

Hi @BhaveshPatel 

The problem that has reading the parquet is that if I overwrite the table, the parquet brings me the new records and the old ones. 

anilgavhane
Resolver III
Resolver III

@lsabetta 

Yes, you can read tables from a Microsoft Fabric Lakehouse without creating a Spark session, by using Lakehouse shortcuts and native connectors in Python. Here are a few options:

 

🔹 1. Use the fabric Python SDK (Preview)

Microsoft Fabric offers a Python SDK that allows you to interact with Lakehouse tables directly from a notebook using Pandas, without spinning up Spark.

 

from fabric import LakehouseClient client = LakehouseClient(workspace_id="your_workspace_id", lakehouse_id="your_lakehouse_id") df = client.read_table("your_table_name")



 

This avoids Spark entirely and loads the table as a Pandas DataFrame.

 

🔹 2. Use REST APIs or ODBC/JDBC Connectors

You can access Lakehouse tables via:

  • REST API: For metadata and file-level access.
  • ODBC/JDBC: If your Lakehouse is exposed via SQL endpoints (like Warehouse or SQL Analytics).

These methods allow you to query structured data using Python libraries like pyodbc, sqlalchemy, or pandas.read_sql().

 

🔹 3. Export Lakehouse Tables to Files

If your Lakehouse tables are stored as Delta or Parquet files:

  • You can mount or access the file path directly using pyarrow, fastparquet, or pandas.read_parquet().

 

import pandas as pd df = pd.read_parquet("https://yourlakehouseurl/path/to/table")

 

This bypasses Spark and loads the data directly into memory.

v-venuppu
Community Support
Community Support

Hi @lsabetta ,

Thank you for reaching out to Microsoft Fabric Community.

Thank you @pallavi_r for the prompt response.

The slow startup isn’t from your code or data size - it’s Spark cluster spin-up, which always takes a few minutes.

There’s no Spark config that makes session creation instant.

Below are few Options:

If tables are small, skip Spark and load them directly into Pandas/SQL (faster, no cluster).Keep the Spark session alive instead of restarting often.Ask your admin if a smaller/faster Spark pool is available.Use Spark only when you need distributed compute; otherwise stick with Python. you can’t make Spark spin up faster, but you can avoid Spark altogether or keep the session warm.

HI @v-venuppu ,

Thanks for your answer.

My notebooks are written in Pandas because my tables are small and transformations are simple. I could use Python instead of Pyspark but the thing is that I need to read tables from my lakehouse. 

Is there any way to read tables from a lakehouse without creating a sparksession? 

pallavi_r
Super User
Super User

Hi @lsabetta ,

 

Here are couple of tips shared to reduce spark session by keeping the high concurrency mode enabled, detaching it from one notebook session, so it frees up for other sessions and keep it alive.

https://www.linkedin.com/pulse/fabric-notebook-performance-hack-reuse-spark-sessions-tiago-balabuch-...

https://blog.fabric.microsoft.com/en-US/blog/supercharge-your-workloads-write-optimized-default-spar...

 

Thanks,

Pallavi

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

September Fabric Update Carousel

Fabric Monthly Update - September 2025

Check out the September 2025 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.