Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
joshuaking
Regular Visitor

Pandas or Spark

Hi all,

 

I am beginning the research and setup of Fabric in a Trial Environment and had some questions about Spark vs. Pandas.

 

I understand that Spark is more tuned for "Big Data", due to it's distributed processing system. But I am trying to decide if my datasets will be large enough to benefit from PySpark or if we should just stick to Pandas.

 

I've done some counting and have found that a majority of my fact tables will not exceed 500,000 row, and will only introduce approximately 20,000 new rows each per fact table per semester (I work in higher ed). We plan to use an F8 capacity, but we don't how it would handle a Pandas DataFrame of that size because we are using the F64 capacity trial. If you were in a similar environment, would you find Pandas or PySpark within your notebooks to be a better fit? Why would you choose the option you chose? Thanks!

1 ACCEPTED SOLUTION
6 REPLIES 6

A Python Notebook (which doesn't use Spark, instead it just runs on a single node) is in the release plans:

 

https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#python-notebook

 

 

It will be suitable for Pandas, Polars, etc. because those Python dialects only run on a single node.

 

People say Polars (which is quite similar to Pandas) is more performant:

 

https://www.reddit.com/r/MicrosoftFabric/s/awK3fcctsx

 

(See the response from PawarBI)

This is very helpful! Thanks for the additional threads!

FabianSchut
Super User
Super User

PySpark is more native as format in Fabric than Pandas. Especially for read and write to Tables in a Lakehouse. There are possibilities to directly write a table a Pandas DataFrame to a delta table (https://delta.io/blog/2023-04-01-create-append-delta-lake-table-pandas/), but that still feels more like a workaround then the native way.

 

However, I you read data from other sources and/or write data to other destinations than the Tables section of the Lakehouse, Pandas will have scenario's where it will be preferred over PySpark. Your tables are indeed not very large and Pandas will perform well in these cases. Looking at your description, an F8 will be performing well with both Pandas and PySpark dataframes.

Yeah, the main reason we fell back onto Pandas for this particular CSV is because the format sucked and wasn't playing nicely with the native pySpark ingestion routines.
All of our other CSV sources don't suffer from these problems.
There are also reasons to use Pandas for file export too - it's the easiest way of getting a single file output, rather than 1 file per partition with system generated names.

spencer_sa
Super User
Super User

We've got a 1.5M row, 145 column (poorly formed*) CSV that we are ingesting monthly using Pandas to then fix and write to a Delta table.
Pandas quite happily handles 1.5M rows on an F8.  (We currently run at F32 becuase our semantic model is >5GB)

* CRLF in fields, 2 columns with same name, columns with bad characters in the header.

Helpful resources

Announcements
Fabric July 2025 Monthly Update Carousel

Fabric Monthly Update - July 2025

Check out the July 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.