Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric certified for FREE! Don't miss your chance! Learn more

Automate fuzzy duplicates and outlier removal and missing value imputation

Dear Microsoft Fabric Community,

I’m a strong advocate of Power Query and the Power BI ecosystem. Its capabilities in shaping, merging, and transforming data are impressive. However, after working extensively with complex, real-world datasets, I’ve encountered some critical limitations when it comes to automated data cleaning — especially at scale.

In one recent use case, my dataset contained:

  • Fuzzy duplicates (e.g., “Tom Clark” vs. “Tom Clarke”)

  • Outliers (e.g., Age = 999, or negative income values)

  • Missing values across dozens of columns

While Power Query could help detect issues, resolving them was highly manual and time-consuming. Key limitations I observed:

  • No native support for dynamic median/mode-based imputation

  • No column-wide outlier handling (e.g., using IQR)

  • Fuzzy duplicate detection is not natively supported

  • Manual replacements must be done per column

When facing hundreds of messy columns, this process quickly becomes unmanageable.

My Solution

To address these challenges, I developed an automated data cleaning layer using R, integrated within Power BI. The solution provides:

  • Outlier detection and handling using IQR logic
  • On-the-fly imputation of missing values:
     – Median for numeric columns
     – Mode for categorical columns
  • Fuzzy duplicate detection and removal using string similarity
  • No hardcoding required — fully dataset-agnostic
  • Runs natively inside Power BI using the R script visual or Power Query R steps
  • Full transparency — changes are logged step by step

This tool doesn’t replace Power Query — it extends its capabilities. It empowers analysts to work efficiently with complex, inconsistent data while maintaining full control and auditability.

Suggested Feature Enhancement

I believe Power BI and Microsoft Fabric could benefit from native support for:

  • Dynamic column-wise missing value imputation (mean/median/mode)

  • Built-in outlier handling logic (e.g., via IQR or z-score)

  • Fuzzy duplicate detection with adjustable similarity thresholds

Such enhancements would dramatically reduce the manual burden on analysts and accelerate time to insight.

I would love to share this tool and discuss how it might align with the vision of Microsoft Fabric and Power BI.

Thank You

 

Screenshot (155).pngScreenshot (156).pngScreenshot (157).pngScreenshot (158).pngScreenshot (159).pngScreenshot (160).png

Status: New