Data Cleansing Data Flow

eyeballkid · ‎10-08-2024

Hi,

I am currently looking at writing some data cleansing routines to run over our Dynamics CRM data held in the dataverse. I have written a data flow in Azure data Factory that queries for all contacts (in our dev environment that is around 30K rows, in production it's nearer to 70K) and then applies various data cleansing routines such as finding all lower / upper case names and applying sentence case, stripping out trailing punctuation etc.

In all there are five routines that I have branched off from the one data source and run them in parallel. Trying to run it in the Dev environment seems to be taking several hours so I isolated one routine (finding lower case names) and after giving the data flow a medium compute size it completed in around four hours. Is there a more efficient way of running routines like this or is what I am doing ok and it's just a case of throwing compute resource at it?

Anonymous · ‎10-08-2024

Hi @eyeballkid

This forum is for discussing issues related to the Fabric Data Factory, for Azure issues you can go to the following link, there will be more professional people to help you solve the problem:

Azure Data Factory - Microsoft Community Hub

Here are some possible ideas for your reference:

Partitioning: Ensure your data is partitioned effectively. This can significantly reduce the time taken for transformations by distributing the workload across multiple nodes.

Caching: Use the cache transformation to store intermediate results, which can be reused in subsequent transformations, reducing redundant computations.

Mapping data flow performance and tuning guide - Azure Data Factory & Azure Synapse | Microsoft Lear...

Dynamic Scaling: Adjust the compute size dynamically based on the workload. You can set up triggers to scale up during peak loads and scale down during off-peak times.

Integration Runtime Configuration: Optimize your integration runtime settings to ensure they are appropriately sized for your workload. Placing the integration runtime in the same region as your data sources can also reduce latency.

Incremental Processing: Instead of processing all data every time, consider implementing incremental data processing. This way, you only process new or changed data, which can drastically reduce processing time.

Parallelism: While you are already running routines in parallel, ensure that the parallelism is optimized. Sometimes, too many parallel processes can lead to resource contention.

Debug Mode: Use the debug mode to test and optimize your data flows with smaller datasets before running them on the full dataset. This can help identify bottlenecks and optimize transformations.

Databricks: For more complex transformations, consider using Azure Databricks. It integrates well with ADF and can handle large-scale data processing more efficiently.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Data Cleansing Data Flow

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Party with Power BI’s own Guy in a Cube

Data Cleansing Data Flow

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025