Solved: Pandas or Spark

joshuaking · ‎11-14-2024

Hi all,

I am beginning the research and setup of Fabric in a Trial Environment and had some questions about Spark vs. Pandas.

I understand that Spark is more tuned for "Big Data", due to it's distributed processing system. But I am trying to decide if my datasets will be large enough to benefit from PySpark or if we should just stick to Pandas.

I've done some counting and have found that a majority of my fact tables will not exceed 500,000 row, and will only introduce approximately 20,000 new rows each per fact table per semester (I work in higher ed). We plan to use an F8 capacity, but we don't how it would handle a Pandas DataFrame of that size because we are using the F64 capacity trial. If you were in a similar environment, would you find Pandas or PySpark within your notebooks to be a better fit? Why would you choose the option you chose? Thanks!

frithjof_v · ‎11-14-2024

Related threads:

https://www.reddit.com/r/MicrosoftFabric/s/JZUqGysQDY

https://www.reddit.com/r/MicrosoftFabric/s/gNUCoSsKOp

View solution in original post

frithjof_v · ‎11-14-2024

Related threads:

https://www.reddit.com/r/MicrosoftFabric/s/JZUqGysQDY

https://www.reddit.com/r/MicrosoftFabric/s/gNUCoSsKOp

frithjof_v · ‎11-20-2024

A Python Notebook (which doesn't use Spark, instead it just runs on a single node) is in the release plans:

https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#python-notebook

It will be suitable for Pandas, Polars, etc. because those Python dialects only run on a single node.

People say Polars (which is quite similar to Pandas) is more performant:

https://www.reddit.com/r/MicrosoftFabric/s/awK3fcctsx

(See the response from PawarBI)

joshuaking · ‎11-20-2024

This is very helpful! Thanks for the additional threads!

FabianSchut · ‎11-14-2024

PySpark is more native as format in Fabric than Pandas. Especially for read and write to Tables in a Lakehouse. There are possibilities to directly write a table a Pandas DataFrame to a delta table (https://delta.io/blog/2023-04-01-create-append-delta-lake-table-pandas/), but that still feels more like a workaround then the native way.

However, I you read data from other sources and/or write data to other destinations than the Tables section of the Lakehouse, Pandas will have scenario's where it will be preferred over PySpark. Your tables are indeed not very large and Pandas will perform well in these cases. Looking at your description, an F8 will be performing well with both Pandas and PySpark dataframes.

spencer_sa · ‎11-14-2024

Yeah, the main reason we fell back onto Pandas for this particular CSV is because the format sucked and wasn't playing nicely with the native pySpark ingestion routines.
All of our other CSV sources don't suffer from these problems.
There are also reasons to use Pandas for file export too - it's the easiest way of getting a single file output, rather than 1 file per partition with system generated names.

spencer_sa · ‎11-14-2024

We've got a 1.5M row, 145 column (poorly formed*) CSV that we are ingesting monthly using Pandas to then fix and write to a Delta table.
Pandas quite happily handles 1.5M rows on an F8. (We currently run at F32 becuase our semantic model is >5GB)

* CRLF in fields, 2 columns with same name, columns with bad characters in the header.

Pandas or Spark

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Pandas or Spark

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025