Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Special holiday offer! You and a friend can attend FabCon with a BOGO code. Supplies are limited. Register now.

Reply
sreedharshan_10
Frequent Visitor

Spark sessions for entity resolution

In my current project, I am performing entity resolution using Spark. The library I use is Splink, and my dataset is huge; Splink is naturally a computationally heavy package. Let us say my dataset has around 13 M rows. In the current setting, if I try to run the comparisons for 500k rows, it takes a huge amount of time on a starter pool. I have an F64 capacity, and I need some suggestions on the spark config that would help my use case.

1 ACCEPTED SOLUTION
v-sgandrathi
Community Support
Community Support

Hi @sreedharshan_10,

Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.

 

Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.

 

Thank you.

 

View solution in original post

2 REPLIES 2
v-sgandrathi
Community Support
Community Support

Hi @sreedharshan_10,

Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.

 

Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.

 

Thank you.

 

Hi @sreedharshan_10,

 

May I ask if you have gotten this issue resolved?

If it is solved, please mark the helpful reply or share your solution and accept it as solution, it will be helpful for other members of the community who have similar problems as yours to solve it faster.

 

Thank you for using Microsoft Fabric Community Forum.

Helpful resources

Announcements
December Fabric Update Carousel

Fabric Monthly Update - December 2025

Check out the December 2025 Fabric Holiday Recap!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Solution Authors
Top Kudoed Authors