Solved: Spark sessions for entity resolution

sreedharshan_10 · ‎05-18-2025

In my current project, I am performing entity resolution using Spark. The library I use is Splink, and my dataset is huge; Splink is naturally a computationally heavy package. Let us say my dataset has around 13 M rows. In the current setting, if I try to run the comparisons for 500k rows, it takes a huge amount of time on a starter pool. I have an F64 capacity, and I need some suggestions on the spark config that would help my use case.

v-sgandrathi · ‎05-19-2025

Hi @sreedharshan_10,

Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.

Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.

Thank you.

View solution in original post

v-sgandrathi · ‎05-19-2025

Hi @sreedharshan_10,

Thanks for reaching out to Microsoft Fabric Community Forum!
Since you're using Splink for entity resolution on a large dataset (~13M rows) and seeing slowness with just 500K rows on an F64 capacity within a Starter Pool, this is likely due to both the intensive nature of the workload and the resource limits of Starter Pools in Microsoft Fabric.
For better performance, we recommend moving to a custom Spark pool in Fabric, which allows tuning of executors, memory, and parallelism. You can apply Spark config settings such as increasing executor.memory to 16g, driver.memory to 32g, and setting spark.sql.shuffle.partitions to 300.
Consider repartitioning your data and leveraging Splink’s blocking rules to reduce the number of comparisons. Microsoft Fabric provides integrated Spark monitoring tools that can help identify skew, memory pressure, or shuffle bottlenecks.

Glad I could assist! If this answer helped resolve your issue, please mark it as Accept as Solution and give us Kudos to guide others facing the same concern.

Thank you.

v-sgandrathi · ‎05-22-2025

Hi @sreedharshan_10,

May I ask if you have gotten this issue resolved?

If it is solved, please mark the helpful reply or share your solution and accept it as solution, it will be helpful for other members of the community who have similar problems as yours to solve it faster.

Thank you for using Microsoft Fabric Community Forum.

Spark sessions for entity resolution

Helpful resources

Fabric Community Update - August 2025

Huge last-minute discounts for FabCon Vienna from September 15-18, 2025

Spark sessions for entity resolution

Helpful resources

Fabric Community Update - August 2025