Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Hi,
I have a general question when using fabric notebooks. Im quite new to this world so I'm trying to learn best practices and so on.
So far for my data pipelines, I have used spark DF for my transformations & save table, even for small datasets. My understanding is that a spark DF is required to be able to save the DF as a delta table. However, some of my data sets are quite small.
Is there an actual benefit of using pandas over spark, then convert to spark DF before saving as delta table? Would that save resources? My understanding is that it still runs on a spark cluster, right?
So in general the question is, when to use pandas over spark in Fabric Notebook, if spark has all functinality that I need for the specific job?
Best regards,
Alexander
Solved! Go to Solution.
Hi @AlexanderPowBI
Thanks for using Fabric Community.
The choice between using Pandas and Spark in a Fabric notebook depends on several factors, including the size of your dataset, the complexity of your data processing, and the resources available in your Spark cluster.
Here are some points to consider:
Dataset Size: If your dataset is small enough to fit into memory, then using Pandas could be faster and more efficient because it avoids the overhead of distributing the computation across a Spark cluster. On the other hand, if your dataset is too large to fit into memory, then you’ll need to use Spark.
Data Processing Complexity: Spark DataFrames have many built-in functions for complex data processing and transformations that might not be available or as efficient in Pandas. If you’re doing complex data processing, Spark might be a better choice.
Resource Utilization: Even though a Spark DataFrame operation runs on a Spark cluster, it doesn’t mean it’s always more resource-intensive. Spark is designed to handle big data workloads and can efficiently manage resources. However, if you’re working with small datasets and your computations are not complex, using Pandas could potentially save resources.
Saving as Delta Table: You’re correct that you need a Spark DataFrame to save as a Delta table. If you decide to use Pandas for your computations, you’ll need to convert your Pandas DataFrame to a Spark DataFrame before you can save it as a Delta table.
Resource Considerations:
In general, if Spark has all the functionality you need for your specific job and you’re comfortable using it, then it’s a good choice. However, if you’re working with small datasets and your data processing isn’t too complex, then using Pandas could potentially be faster and more resource-efficient.
Hope this helps. Please let me know if you have any further questions.
Hi @AlexanderPowBI
Thanks for using Fabric Community.
The choice between using Pandas and Spark in a Fabric notebook depends on several factors, including the size of your dataset, the complexity of your data processing, and the resources available in your Spark cluster.
Here are some points to consider:
Dataset Size: If your dataset is small enough to fit into memory, then using Pandas could be faster and more efficient because it avoids the overhead of distributing the computation across a Spark cluster. On the other hand, if your dataset is too large to fit into memory, then you’ll need to use Spark.
Data Processing Complexity: Spark DataFrames have many built-in functions for complex data processing and transformations that might not be available or as efficient in Pandas. If you’re doing complex data processing, Spark might be a better choice.
Resource Utilization: Even though a Spark DataFrame operation runs on a Spark cluster, it doesn’t mean it’s always more resource-intensive. Spark is designed to handle big data workloads and can efficiently manage resources. However, if you’re working with small datasets and your computations are not complex, using Pandas could potentially save resources.
Saving as Delta Table: You’re correct that you need a Spark DataFrame to save as a Delta table. If you decide to use Pandas for your computations, you’ll need to convert your Pandas DataFrame to a Spark DataFrame before you can save it as a Delta table.
Resource Considerations:
In general, if Spark has all the functionality you need for your specific job and you’re comfortable using it, then it’s a good choice. However, if you’re working with small datasets and your data processing isn’t too complex, then using Pandas could potentially be faster and more resource-efficient.
Hope this helps. Please let me know if you have any further questions.
Hi @AlexanderPowBI
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.
Check out the September 2024 Fabric update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
6 | |
6 | |
4 | |
2 | |
2 |
User | Count |
---|---|
24 | |
14 | |
10 | |
4 | |
4 |