Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi,
I am currently looking at writing some data cleansing routines to run over our Dynamics CRM data held in the dataverse. I have written a data flow in Azure data Factory that queries for all contacts (in our dev environment that is around 30K rows, in production it's nearer to 70K) and then applies various data cleansing routines such as finding all lower / upper case names and applying sentence case, stripping out trailing punctuation etc.
In all there are five routines that I have branched off from the one data source and run them in parallel. Trying to run it in the Dev environment seems to be taking several hours so I isolated one routine (finding lower case names) and after giving the data flow a medium compute size it completed in around four hours. Is there a more efficient way of running routines like this or is what I am doing ok and it's just a case of throwing compute resource at it?
Hi @eyeballkid
This forum is for discussing issues related to the Fabric Data Factory, for Azure issues you can go to the following link, there will be more professional people to help you solve the problem:
Azure Data Factory - Microsoft Community Hub
Here are some possible ideas for your reference:
Partitioning: Ensure your data is partitioned effectively. This can significantly reduce the time taken for transformations by distributing the workload across multiple nodes.
Caching: Use the cache transformation to store intermediate results, which can be reused in subsequent transformations, reducing redundant computations.
Dynamic Scaling: Adjust the compute size dynamically based on the workload. You can set up triggers to scale up during peak loads and scale down during off-peak times.
Integration Runtime Configuration: Optimize your integration runtime settings to ensure they are appropriately sized for your workload. Placing the integration runtime in the same region as your data sources can also reduce latency.
Incremental Processing: Instead of processing all data every time, consider implementing incremental data processing. This way, you only process new or changed data, which can drastically reduce processing time.
Parallelism: While you are already running routines in parallel, ensure that the parallelism is optimized. Sometimes, too many parallel processes can lead to resource contention.
Debug Mode: Use the debug mode to test and optimize your data flows with smaller datasets before running them on the full dataset. This can help identify bottlenecks and optimize transformations.
Databricks: For more complex transformations, consider using Azure Databricks. It integrates well with ADF and can handle large-scale data processing more efficiently.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Fabric update to learn about new features.