Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
AlexanderPowBI
Advocate I
Advocate I

General question - Pandas vs Spark in Fabric Notebook

Hi,

 

I have a general question when using fabric notebooks. Im quite new to this world so I'm trying to learn best practices and so on. 

 

So far for my data pipelines, I have used spark DF for my transformations & save table, even for small datasets. My understanding is that a spark DF is required to be able to save the DF as a delta table. However, some of my data sets are quite small.

 

Is there an actual benefit of using pandas over spark, then convert to spark DF before saving as delta table? Would that save resources? My understanding is that it still runs on a spark cluster, right?  

 

So in general the question is, when to use pandas over spark in Fabric Notebook, if spark has all functinality that I need for the specific job? 

 

Best regards,

Alexander 

1 ACCEPTED SOLUTION
v-nikhilan-msft
Community Support
Community Support

Hi @AlexanderPowBI 
Thanks for using Fabric Community.

The choice between using Pandas and Spark in a Fabric notebook depends on several factors, including the size of your dataset, the complexity of your data processing, and the resources available in your Spark cluster.

Here are some points to consider:

  1. Dataset Size: If your dataset is small enough to fit into memory, then using Pandas could be faster and more efficient because it avoids the overhead of distributing the computation across a Spark cluster. On the other hand, if your dataset is too large to fit into memory, then you’ll need to use Spark.

  2. Data Processing Complexity: Spark DataFrames have many built-in functions for complex data processing and transformations that might not be available or as efficient in Pandas. If you’re doing complex data processing, Spark might be a better choice.

  3. Resource Utilization: Even though a Spark DataFrame operation runs on a Spark cluster, it doesn’t mean it’s always more resource-intensive. Spark is designed to handle big data workloads and can efficiently manage resources. However, if you’re working with small datasets and your computations are not complex, using Pandas could potentially save resources.

  4. Saving as Delta Table: You’re correct that you need a Spark DataFrame to save as a Delta table. If you decide to use Pandas for your computations, you’ll need to convert your Pandas DataFrame to a Spark DataFrame before you can save it as a Delta table.

  5. Resource Considerations:

    • Overhead: Spark introduces some overhead compared to pandas, so for very small datasets, it might not be the most resource-efficient choice.
    • Cluster Management: If you're not actively using Spark for other distributed tasks, you might want to consider resource utilization when using it only for saving Delta tables with small datasets.

In general, if Spark has all the functionality you need for your specific job and you’re comfortable using it, then it’s a good choice. However, if you’re working with small datasets and your data processing isn’t too complex, then using Pandas could potentially be faster and more resource-efficient. 

Hope this helps. Please let me know if you have any further questions.

View solution in original post

2 REPLIES 2
v-nikhilan-msft
Community Support
Community Support

Hi @AlexanderPowBI 
Thanks for using Fabric Community.

The choice between using Pandas and Spark in a Fabric notebook depends on several factors, including the size of your dataset, the complexity of your data processing, and the resources available in your Spark cluster.

Here are some points to consider:

  1. Dataset Size: If your dataset is small enough to fit into memory, then using Pandas could be faster and more efficient because it avoids the overhead of distributing the computation across a Spark cluster. On the other hand, if your dataset is too large to fit into memory, then you’ll need to use Spark.

  2. Data Processing Complexity: Spark DataFrames have many built-in functions for complex data processing and transformations that might not be available or as efficient in Pandas. If you’re doing complex data processing, Spark might be a better choice.

  3. Resource Utilization: Even though a Spark DataFrame operation runs on a Spark cluster, it doesn’t mean it’s always more resource-intensive. Spark is designed to handle big data workloads and can efficiently manage resources. However, if you’re working with small datasets and your computations are not complex, using Pandas could potentially save resources.

  4. Saving as Delta Table: You’re correct that you need a Spark DataFrame to save as a Delta table. If you decide to use Pandas for your computations, you’ll need to convert your Pandas DataFrame to a Spark DataFrame before you can save it as a Delta table.

  5. Resource Considerations:

    • Overhead: Spark introduces some overhead compared to pandas, so for very small datasets, it might not be the most resource-efficient choice.
    • Cluster Management: If you're not actively using Spark for other distributed tasks, you might want to consider resource utilization when using it only for saving Delta tables with small datasets.

In general, if Spark has all the functionality you need for your specific job and you’re comfortable using it, then it’s a good choice. However, if you’re working with small datasets and your data processing isn’t too complex, then using Pandas could potentially be faster and more resource-efficient. 

Hope this helps. Please let me know if you have any further questions.

Hi @AlexanderPowBI 
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.

Helpful resources

Announcements
Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.