Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM. Register now.

Reply
eyeballkid
Frequent Visitor

Data Cleansing Data Flow

Hi,

I am currently looking at writing some data cleansing routines to run over our Dynamics CRM data held in the dataverse. I have written a data flow in Azure data Factory that queries for all contacts (in our dev environment that is around 30K rows, in production it's nearer to 70K) and then applies various data cleansing routines such as finding all lower / upper case names and applying sentence case, stripping out trailing punctuation etc.

 

In all there are five routines that I have branched off from the one data source and run them in parallel. Trying to run it in the Dev environment seems to be taking several hours so I isolated one routine (finding lower case names) and after giving the data flow a medium compute size it completed in around four hours. Is there a more efficient way of running routines like this or is what I am doing ok and it's just a case of throwing compute resource at it?

 

Screenshot 2024-10-08 105810.png

1 REPLY 1
Anonymous
Not applicable

Hi @eyeballkid 

 

This forum is for discussing issues related to the Fabric Data Factory, for Azure issues you can go to the following link, there will be more professional people to help you solve the problem:

 

Azure Data Factory - Microsoft Community Hub

 

Here are some possible ideas for your reference:

 

Partitioning: Ensure your data is partitioned effectively. This can significantly reduce the time taken for transformations by distributing the workload across multiple nodes.


Caching: Use the cache transformation to store intermediate results, which can be reused in subsequent transformations, reducing redundant computations.

 

Mapping data flow performance and tuning guide - Azure Data Factory & Azure Synapse | Microsoft Lear...


Dynamic Scaling: Adjust the compute size dynamically based on the workload. You can set up triggers to scale up during peak loads and scale down during off-peak times.


Integration Runtime Configuration: Optimize your integration runtime settings to ensure they are appropriately sized for your workload. Placing the integration runtime in the same region as your data sources can also reduce latency.

 

Incremental Processing: Instead of processing all data every time, consider implementing incremental data processing. This way, you only process new or changed data, which can drastically reduce processing time.


Parallelism: While you are already running routines in parallel, ensure that the parallelism is optimized. Sometimes, too many parallel processes can lead to resource contention.


Debug Mode: Use the debug mode to test and optimize your data flows with smaller datasets before running them on the full dataset. This can help identify bottlenecks and optimize transformations.

 

Databricks: For more complex transformations, consider using Azure Databricks. It integrates well with ADF and can handle large-scale data processing more efficiently.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

September Fabric Update Carousel

Fabric Monthly Update - September 2025

Check out the September 2025 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.