Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
In my current project, I am performing entity resolution using Spark. The library I use is Splink, and my dataset is huge; Splink is naturally a computationally heavy package. Let us say my dataset has around 13 M rows. In the current setting, if I try to run the comparisons for 500k rows, it takes a huge amount of time on a starter pool. I have an F64 capacity, and I need some suggestions on the spark config that would help my use case.
Solved! Go to Solution.
Hi @sreedharshan_10,
Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.
Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.
Thank you.
Hi @sreedharshan_10,
Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.
Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.
Thank you.
Hi @sreedharshan_10,
May I ask if you have gotten this issue resolved?
If it is solved, please mark the helpful reply or share your solution and accept it as solution, it will be helpful for other members of the community who have similar problems as yours to solve it faster.
Thank you for using Microsoft Fabric Community Forum.