Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
sreedharshan_10
Frequent Visitor

Spark sessions for entity resolution

In my current project, I am performing entity resolution using Spark. The library I use is Splink, and my dataset is huge; Splink is naturally a computationally heavy package. Let us say my dataset has around 13 M rows. In the current setting, if I try to run the comparisons for 500k rows, it takes a huge amount of time on a starter pool. I have an F64 capacity, and I need some suggestions on the spark config that would help my use case.

1 ACCEPTED SOLUTION
v-sgandrathi
Community Support
Community Support

Hi @sreedharshan_10,

Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.

 

Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.

 

Thank you.

 

View solution in original post

2 REPLIES 2
v-sgandrathi
Community Support
Community Support

Hi @sreedharshan_10,

Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.

 

Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.

 

Thank you.

 

Hi @sreedharshan_10,

 

May I ask if you have gotten this issue resolved?

If it is solved, please mark the helpful reply or share your solution and accept it as solution, it will be helpful for other members of the community who have similar problems as yours to solve it faster.

 

Thank you for using Microsoft Fabric Community Forum.

Helpful resources

Announcements
July 2025 community update carousel

Fabric Community Update - July 2025

Find out what's new and trending in the Fabric community.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.