Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
eyeballkid
Frequent Visitor

Data Cleansing Data Flow

Hi,

I am currently looking at writing some data cleansing routines to run over our Dynamics CRM data held in the dataverse. I have written a data flow in Azure data Factory that queries for all contacts (in our dev environment that is around 30K rows, in production it's nearer to 70K) and then applies various data cleansing routines such as finding all lower / upper case names and applying sentence case, stripping out trailing punctuation etc.

 

In all there are five routines that I have branched off from the one data source and run them in parallel. Trying to run it in the Dev environment seems to be taking several hours so I isolated one routine (finding lower case names) and after giving the data flow a medium compute size it completed in around four hours. Is there a more efficient way of running routines like this or is what I am doing ok and it's just a case of throwing compute resource at it?

 

Screenshot 2024-10-08 105810.png

1 REPLY 1
Anonymous
Not applicable

Hi @eyeballkid 

 

This forum is for discussing issues related to the Fabric Data Factory, for Azure issues you can go to the following link, there will be more professional people to help you solve the problem:

 

Azure Data Factory - Microsoft Community Hub

 

Here are some possible ideas for your reference:

 

Partitioning: Ensure your data is partitioned effectively. This can significantly reduce the time taken for transformations by distributing the workload across multiple nodes.


Caching: Use the cache transformation to store intermediate results, which can be reused in subsequent transformations, reducing redundant computations.

 

Mapping data flow performance and tuning guide - Azure Data Factory & Azure Synapse | Microsoft Lear...


Dynamic Scaling: Adjust the compute size dynamically based on the workload. You can set up triggers to scale up during peak loads and scale down during off-peak times.


Integration Runtime Configuration: Optimize your integration runtime settings to ensure they are appropriately sized for your workload. Placing the integration runtime in the same region as your data sources can also reduce latency.

 

Incremental Processing: Instead of processing all data every time, consider implementing incremental data processing. This way, you only process new or changed data, which can drastically reduce processing time.


Parallelism: While you are already running routines in parallel, ensure that the parallelism is optimized. Sometimes, too many parallel processes can lead to resource contention.


Debug Mode: Use the debug mode to test and optimize your data flows with smaller datasets before running them on the full dataset. This can help identify bottlenecks and optimize transformations.

 

Databricks: For more complex transformations, consider using Azure Databricks. It integrates well with ADF and can handle large-scale data processing more efficiently.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.

Top Solution Authors
Top Kudoed Authors