The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi all,
I am beginning the research and setup of Fabric in a Trial Environment and had some questions about Spark vs. Pandas.
I understand that Spark is more tuned for "Big Data", due to it's distributed processing system. But I am trying to decide if my datasets will be large enough to benefit from PySpark or if we should just stick to Pandas.
I've done some counting and have found that a majority of my fact tables will not exceed 500,000 row, and will only introduce approximately 20,000 new rows each per fact table per semester (I work in higher ed). We plan to use an F8 capacity, but we don't how it would handle a Pandas DataFrame of that size because we are using the F64 capacity trial. If you were in a similar environment, would you find Pandas or PySpark within your notebooks to be a better fit? Why would you choose the option you chose? Thanks!
Solved! Go to Solution.
A Python Notebook (which doesn't use Spark, instead it just runs on a single node) is in the release plans:
https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#python-notebook
It will be suitable for Pandas, Polars, etc. because those Python dialects only run on a single node.
People say Polars (which is quite similar to Pandas) is more performant:
https://www.reddit.com/r/MicrosoftFabric/s/awK3fcctsx
(See the response from PawarBI)
This is very helpful! Thanks for the additional threads!
PySpark is more native as format in Fabric than Pandas. Especially for read and write to Tables in a Lakehouse. There are possibilities to directly write a table a Pandas DataFrame to a delta table (https://delta.io/blog/2023-04-01-create-append-delta-lake-table-pandas/), but that still feels more like a workaround then the native way.
However, I you read data from other sources and/or write data to other destinations than the Tables section of the Lakehouse, Pandas will have scenario's where it will be preferred over PySpark. Your tables are indeed not very large and Pandas will perform well in these cases. Looking at your description, an F8 will be performing well with both Pandas and PySpark dataframes.
Yeah, the main reason we fell back onto Pandas for this particular CSV is because the format sucked and wasn't playing nicely with the native pySpark ingestion routines.
All of our other CSV sources don't suffer from these problems.
There are also reasons to use Pandas for file export too - it's the easiest way of getting a single file output, rather than 1 file per partition with system generated names.
We've got a 1.5M row, 145 column (poorly formed*) CSV that we are ingesting monthly using Pandas to then fix and write to a Delta table.
Pandas quite happily handles 1.5M rows on an F8. (We currently run at F32 becuase our semantic model is >5GB)
* CRLF in fields, 2 columns with same name, columns with bad characters in the header.
User | Count |
---|---|
4 | |
4 | |
2 | |
2 | |
2 |
User | Count |
---|---|
18 | |
15 | |
11 | |
6 | |
5 |