Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Score big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount

Sahir_Maharaj

The Ultimate Guide to Data Imputation for Data Science in Microsoft Fabric

Every data professional has been there... opening a dataset that looks promising, only to be greeted by the dreaded NaN values scattered like potholes on a road you thought was smooth. They’re small, they’re easy to ignore at first glance, but they can derail your entire analysis if left untreated. Missing data doesn’t just make your visualizations look incomplete - it can bias your models, break your pipelines, and reduce stakeholder trust.

 

I’ve seen situations where a high-performing model suddenly tanked in production simply because the missing values weren’t handled correctly during deployment. And if you’re working in environments where decisions are worth millions, you can’t afford to treat imputation as an afterthought.

 

What you will learn: In this edition, we’re exploring how to fill in those gaps in your dataset without losing its integrity. By the time you’re through, you’ll know exactly how to handle missing values using sklearn.impute for quick, reliable fixes, fancyimpute for more advanced, context-aware approaches, and KNNImputer when similarity-based estimates make the most sense. You’ll learn when each technique shines, when it’s best to avoid them, and how to put them into action in Python using a Microsoft Fabric notebook.

 

Read Time: 9 minutes

 

Source: Sahir MaharajSource: Sahir Maharaj

 

Before we even start with libraries and code, let’s pause and think about why imputation matters. In the simplest terms, imputation is the process of filling in missing values with substituted values so your dataset remains usable. This might sound straightforward, but in reality, it’s about making informed guesses - guesses that preserve the integrity of your dataset. A poor choice here could introduce bias, while a smart one can improve model accuracy and stability.

 

%pip install fancyimpute==0.7.0 cvxpy==1.7.1 cvxopt==1.3.2 osqp==1.0.4 scs==3.2.8 clarabel==0.11.1 ecos==2.0.13

 

For example, imagine you’re building a credit risk model. Some applicants are missing income data. Simply filling in the average income might work in a pinch, but if high-income applicants are systematically missing this field, you’ll skew your model toward underestimating risk. That’s why imputation isn’t just a technical step, but it’s a strategic one. When you understand that missing values can be Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR), you start to see why different techniques exist. You can’t treat a random missing value the same way you’d treat a missing value that hides a deeper pattern. This is where the libraries we’re about to discuss start to shine.

 

Source: Sahir MaharajSource: Sahir Maharaj

 

Let’s start simple. The sklearn.impute module in scikit-learn is like that dependable colleague who may not overcomplicate things but always delivers. It gives you straightforward strategies: mean, median, most frequent, and constant. These are quick to implement, easy to explain to stakeholders, and often good enough for many practical cases. I often use SimpleImputer as a first pass when exploring a dataset for the first time - it gets me to a baseline quickly without locking me into a complex approach too early.

 

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

raw_df = pd.DataFrame({
    "employee_id": [101, 102, 103, 104, 105, 106],
    "age":         [29, 34, np.nan, 41, np.nan, 32],
    "salary":      [52000, np.nan, 61000, np.nan, 73000, 68000],
    "department":  ["Sales", "Finance", "Sales", np.nan, "Tech", "Finance"]
})

print("Original data with missing values:")
print(raw_df)
print("\nMissingness summary:\n", raw_df.isna().sum())

 

We’ll use the same tiny employee dataset throughout: columns for age, salary, and a categorical department, each with some missing values so you can actually see the methods at work. First, I create the dataset and take a quick look at the missingness so I know what I’m dealing with. Then I apply a numeric imputer with the mean strategy and a categorical imputer with most_frequent. This gives a clean, predictable baseline that works well when your goal is “get moving fast” rather than “capture every dependency.”

 

num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")

df = raw_df.copy()
df[["age", "salary"]] = num_imputer.fit_transform(df[["age", "salary"]])
df[["department"]]    = cat_imputer.fit_transform(df[["department"]])

print("\nAfter SimpleImputer (mean for numeric, mode for categorical):")
print(df)

 

Once that baseline is in place, I usually wrap the logic in a ColumnTransformer + Pipeline so it’s production-friendly and easy to reuse. This also lets you add things like scaling for downstream models without changing your imputation logic. Notice how I separate numeric and categorical features, apply the right imputer to each, then put everything back together consistently. It’s tidy, explainable, and easy to lift into model training.

 

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

numeric_features = ["age", "salary"]
categorical_features = ["department"]

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scale",   StandardScaler())  # optional, but handy if you’ll model later
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent"))
    # (you could add OneHotEncoder here later if you’re modeling)
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
], remainder="drop")

processed = preprocessor.fit_transform(raw_df)
processed_df = pd.DataFrame(
    processed,
    columns=[f"num__{c}" for c in numeric_features] + [f"cat__{c}" for c in categorical_features]
)

print("\nColumnTransformer output (imputed + scaled numerics, imputed categoricals):")
print(processed_df)

 

Now that the baseline works, it’s time to respect deeper relationships. fancyimpute provides MICE/IterativeImputer-style approaches that model each feature with missing values as a function of the others. When correlations matter (say, salary tends to rise with age) these approaches often produce more realistic fills than a flat statistic. In practice, I keep categoricals imputed simply (mode) and let the iterative method handle the numeric interplay.

 

Source: Sahir MaharajSource: Sahir Maharaj

 

Below, I restore missingness to the numeric columns (so you can literally watch the iterative approach work), then apply IterativeImputer from fancyimpute to age and salary only. I keep department imputed via mode so we don’t have to encode categoricals for the numeric-only modeling step. Observe the key point here: the imputer treats each numeric column with NaNs as a target in turn and fits a small model using the other features as predictors.

 

from fancyimpute import IterativeImputer

df_fancy = raw_df.copy()

df_fancy[["department"]] = SimpleImputer(strategy="most_frequent").fit_transform(df_fancy[["department"]])

iter_imputer = IterativeImputer(max_iter=15, random_state=42)
df_fancy[["age", "salary"]] = iter_imputer.fit_transform(df_fancy[["age", "salary"]])

print("\nAfter fancyimpute.IterativeImputer on numeric features:")
print(df_fancy)

 

To make this a bit more realistic, I sometimes engineer a simple numeric helper (e.g., an estimated experience_years from age) to give the imputer a stronger signal. You’ll observe how adding a correlated feature can stabilize the imputed values. 

 

Source: Sahir MaharajSource: Sahir Maharaj

 

Then I run the iterative imputer again and compare the numeric distributions before and after to sanity-check that the filled values look plausible (not outliers or collapsed onto a single mean).

 

df_fancy2 = raw_df.copy()
df_fancy2["experience_years"] = df_fancy2["age"] - 22  # crude proxy; may introduce NaN if age is NaN

df_fancy2[["department"]] = SimpleImputer(strategy="most_frequent").fit_transform(df_fancy2[["department"]])

numeric_cols = ["age", "salary", "experience_years"]
iter_imputer2 = IterativeImputer(max_iter=20, random_state=7)
df_fancy2[numeric_cols] = iter_imputer2.fit_transform(df_fancy2[numeric_cols])

print("\nIterativeImputer with an engineered helper feature:")
print(df_fancy2)

print("\nNumeric summary after iterative imputation:")
print(df_fancy2[numeric_cols].describe())

 

KNNImputer fills a missing value for a row by looking at the k most similar rows and averaging their values (for numeric features). This is great when your data naturally clusters. The trick here is: KNN is distance-based, so scaling and feature selection matter. I usually impute/encode categoricals separately (or turn them into dummies) so KNN has clean numeric inputs.

 

Source: Sahir MaharajSource: Sahir Maharaj

 

Then, I impute the categorical, scale the numerics, and then run KNNImputer on the numeric subset. You’ll notice I explicitly reintroduce NaNs to show the transformation clearly. This structure mirrors how I’d do it in a pipeline: deterministic steps for categoricals, then KNN for the numerics that benefit from neighborhood information.

 

from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

df_knn = raw_df.copy()

df_knn[["department"]] = SimpleImputer(strategy="most_frequent").fit_transform(df_knn[["department"]])

df_knn.loc[1, "salary"] = np.nan
df_knn.loc[4, "age"]    = np.nan

scale = StandardScaler()
num_cols = ["age", "salary"]
scaled_vals = scale.fit_transform(df_knn[num_cols])

knn_imp = KNNImputer(n_neighbors=2, weights="uniform")
scaled_imputed = knn_imp.fit_transform(scaled_vals)

df_knn[num_cols] = scale.inverse_transform(scaled_imputed)

print("\nAfter KNNImputer (k=2) on scaled numeric features:")
print(df_knn)

 

Sometimes I’ll tune k with a simple “missingness simulation”: hide a small, random subset of known values, impute with various k values, and calculate an error (e.g., RMSE) between the imputed and the true values to see which k behaves best on this dataset. It’s not perfect, but it’s a practical way to pick a reasonable k without guesswork.

 

from sklearn.metrics import mean_squared_error

def pick_best_k(base_df, target_col="salary", k_values=(2, 3, 4, 5), seed=123):
    rng = np.random.default_rng(seed)
    df_tmp = base_df.copy()

    observed_idx = df_tmp[target_col].dropna().index
    mask_size = max(1, int(0.3 * len(observed_idx)))
    masked_idx = rng.choice(observed_idx, size=mask_size, replace=False)

    y_true = df_tmp.loc[masked_idx, target_col].copy()
    df_tmp.loc[masked_idx, target_col] = np.nan

    df_tmp[["department"]] = SimpleImputer(strategy="most_frequent").fit_transform(df_tmp[["department"]])

    scaler = StandardScaler()
    X_num = df_tmp[["age", "salary"]].values
    X_scaled = scaler.fit_transform(X_num)

    best_k, best_rmse = None, float("inf")
    for k in k_values:
        knn = KNNImputer(n_neighbors=k)
        X_imp_scaled = knn.fit_transform(X_scaled)
        X_imp = scaler.inverse_transform(X_imp_scaled)
        df_imp = df_tmp.copy()
        df_imp[["age", "salary"]] = X_imp

        rmse = mean_squared_error(y_true, df_imp.loc[masked_idx, target_col], squared=False)
        print(f"k={k} -> RMSE on masked {target_col}: {rmse:.2f}")
        if rmse < best_rmse:
            best_rmse, best_k = rmse, k

    return best_k, best_rmse

best_k, best_rmse = pick_best_k(raw_df, target_col="salary", k_values=(2,3,4,5,6))
print(f"\nBest k based on simple masking test: k={best_k} (RMSE={best_rmse:.2f})")

 

You’ve now got tools that can turn a dataset into something you can actually trust! The real power here isn’t just knowing the methods... it’s knowing when to use them.

 

Source: Sahir MaharajSource: Sahir Maharaj

 

Your next project might not need the fanciest solution, but it will need a thoughtful one. The more you practice, the faster you’ll start seeing patterns in your data and picking the right approach without second-guessing yourself. So don’t let those NaNs sit there another day. Open up your notebook, try these techniques, and see how much more confident you feel about your results. Go ahead and make it happen!

 

Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂