Explore and share Fabric Notebooks to boost Power BI insights in the new community notebooks gallery.
Check it out now!Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
I am trying to do a left join fuzzy merge and DF Gen2 just too slow and takes hours before failing.
Started with data of over 12 million rows as the left table and right table about 1000 rows, then scaled down the left table to 1000 rows too, thinking it was the size, but I still got the same issue.
Tried to use the notebook and script option, but the large left table is not even showing the data in the lakehouse delta table.
I understand fuzzy matching can be computationally intensive, but it works fine but slow with smaller data in PowerBI power query, thought DF Gen2 will be better and able to handle more data. Staging and Fastcopy did not make any difference.
Kindly help and advice. Thank you.
@nilendraFabric @spencer_sa @all
Solved! Go to Solution.
Hi @Teskay,
Thanks for reaching out to the Microsoft fabric community forum.
It looks like you're facing performance issues with fuzzy merging in Dataflow Gen2, especially when dealing with large datasets. Fuzzy matching is inherently resource-intensive, and DF Gen2 may not be the best tool for handling such operations at scale. You can use Fabric Notebooks with PySpark which is best suited for large datasets and is optimized for distributed processing, instead of performing fuzzy matching in DF Gen2. You can also use a SQL-based approach with Soundex or Levenshtein distance as it runs faster in Lakehouse compared to DF Gen2.
If you still want to go with fuzzy matching then, reduce the number of comparisons by preprocessing. You can apply Standardization like convert all to lowercase or remove spaces and special characters before matching or use blocking techniques for example: compare only rows starting with the same letter.
If I misunderstand your needs or you still have problems on it, please feel free to let us know.
Best Regards,
Hammad.
Community Support Team
If this post helps then please mark it as a solution, so that other members find it more quickly.
Thank you.
As @v-mdharahman says, fuzzy matching is computationally expensive - you effectively have to do a cross-join between the two tables first.
There are a few optimisations you can do - as Hammad mentions you can reduce the number of comparisons. You can also remove exact matches (as equality is much quicker than Levenshtein) and you can also consider broadcasting the smaller table to all nodes (8GB limit).
Pyspark has both Soundex and Levenshtein built into it.
I did a basic proof of concept that took a table of 1 million random 8 character strings and another of 1000 strings and ran a filter on distances of 1 - on an F64 it took around 8 minutes (which was far faster than I had anticipated) and returned a single row (which was not what I was expecting so I'll need to convince myself it did what I wanted.)
Hi @Teskay,
Thanks for reaching out to the Microsoft fabric community forum.
It looks like you're facing performance issues with fuzzy merging in Dataflow Gen2, especially when dealing with large datasets. Fuzzy matching is inherently resource-intensive, and DF Gen2 may not be the best tool for handling such operations at scale. You can use Fabric Notebooks with PySpark which is best suited for large datasets and is optimized for distributed processing, instead of performing fuzzy matching in DF Gen2. You can also use a SQL-based approach with Soundex or Levenshtein distance as it runs faster in Lakehouse compared to DF Gen2.
If you still want to go with fuzzy matching then, reduce the number of comparisons by preprocessing. You can apply Standardization like convert all to lowercase or remove spaces and special characters before matching or use blocking techniques for example: compare only rows starting with the same letter.
If I misunderstand your needs or you still have problems on it, please feel free to let us know.
Best Regards,
Hammad.
Community Support Team
If this post helps then please mark it as a solution, so that other members find it more quickly.
Thank you.
User | Count |
---|---|
11 | |
4 | |
3 | |
2 | |
1 |
User | Count |
---|---|
5 | |
5 | |
4 | |
3 | |
3 |