Feature Engineering at Scale for Data Science with...

Sahir_Maharaj · ‎08-28-2025

Every dataset hides a story, but not every story is easy to read. If you have ever worked with categorical data, you know exactly what I mean... Categorical variables can feel stubborn - they refuse to fit neatly into models, they don’t behave like numbers, and they often carry hidden meaning that gets lost if handled poorly. Imagine trying to teach a machine learning model the difference between "red," "blue," and "green" without any encoding. To the model, those are just arbitrary labels. It cannot sense the hierarchy, relationships, or importance unless you guide it. And that is where categorical encoding steps in. The way you choose to encode categories can make or break your model’s performance.

What you will learn: In this edition, we will explore one of the most common challenges in feature engineering - how to handle categorical data. I’ll walk you through three different encoding techniques: one-hot encoding, ordinal encoding, and target encoding. Along the way, I’ll show you how each method works, when it makes sense to use it, and how to put it into practice with pandas and scikit-learn. We’ll start simple, then build up to more advanced approaches, so by the time you’re done, you’ll not only know how to transform categories into numbers but also which encoding strategy gives your model the best shot at success.

Read Time: 9 minutes

Source: Sahir Maharaj

The first technique most data professionals learn is one-hot encoding. Think of it as the “safe” choice. It works by creating a new column for each category and marking it with a 1 if the observation belongs to that category, otherwise 0. For example, if you have a column called Color with values ["Red", "Blue", "Green"], one-hot encoding will produce three new columns: Color_Red, Color_Blue, and Color_Green. This ensures that your machine learning model never confuses categories for being ranked or ordered. For models like logistic regression or linear regression, one-hot encoding is usually the best starting point.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    "Color": ["Red", "Blue", "Green", "Blue", "Red", "Green", "Red"]
})
print("Original DataFrame:")
print(df)

df_onehot = pd.get_dummies(df, columns=["Color"])
print("\nOne-hot encoded using pandas:")
print(df_onehot)

encoder = OneHotEncoder(sparse_output=False, drop=None)
encoded = encoder.fit_transform(df[["Color"]])

df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(["Color"]))
print("\nOne-hot encoded using scikit-learn:")
print(df_encoded)

df_combined = pd.concat([df, df_encoded], axis=1)
print("\nCombined original + one-hot encoding:")
print(df_combined)

However, one-hot encoding does not come without drawbacks... The biggest issue arises when the number of categories is very large. Imagine encoding city names from around the world. Instead of three neat columns like our Color example, you might suddenly have hundreds of columns. This creates what is often called the curse of dimensionality. Too many columns mean more memory consumption, slower training, and sometimes even worse model performance because the model spreads its learning across too many features.

Source: Sahir Maharaj

Even in my own projects, I still find myself reaching for one-hot encoding when I need a reliable baseline. Even if it’s not the most efficient option, it sets the foundation. From there, I experiment with more advanced methods.

But, not all categorical data is created equal. Sometimes categories carry an inherent order or ranking that one-hot encoding fails to capture. For example, education levels (High School < Bachelor < Master < PhD) or customer satisfaction (Low < Medium < High). Treating these as independent categories using one-hot encoding would ignore the natural progression that exists between them. That is where ordinal encoding becomes useful.

Source: Sahir Maharaj

Ordinal encoding assigns each category an integer value based on its order. If we take education as an example, we might assign High School = 1, Bachelor = 2, Master = 3, and PhD = 4. This preserves the sense of hierarchy and allows the model to understand that a Master’s degree sits between a Bachelor’s and a PhD. The benefit here is clear: fewer columns and better representation of ordered data. However, the risk comes when you assign incorrect orderings. A wrong sequence can mislead the model and produce poor predictions.

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    "Education": ["Bachelor", "PhD", "Master", "High School", "Master", "PhD"]
})
print("Original DataFrame:")
print(df)

categories = [["High School", "Bachelor", "Master", "PhD"]]

encoder = OrdinalEncoder(categories=categories)
df["Education_encoded"] = encoder.fit_transform(df[["Education"]])

print("\nOrdinal encoding with custom order:")
print(df)

When I worked on survey datasets, I often noticed analysts defaulted to one-hot encoding even for ordered scales. That was a missed opportunity. By applying ordinal encoding correctly, models could capture subtle shifts in sentiment or preference that would otherwise be lost. But the key was being deliberate.. thinking carefully about the order instead of letting the computer decide.

Source: Sahir Maharaj

Though, when you face categorical variables with dozens, hundreds, or even thousands of unique values, neither one-hot encoding nor ordinal encoding feels right. That is where target encoding steps in. Instead of creating new columns or assigning arbitrary ranks, target encoding replaces each category with a statistic derived from the target variable. The most common approach is replacing categories with the mean target value for that category.

Source: Sahir Maharaj

Lets say we are predicting customer churn. If 70% of customers in the “Blue” group churn but only 20% in the “Red” group do, target encoding transforms “Blue” into 0.7 and “Red” into 0.2. This way, the encoding directly carries predictive information about the target. It’s useful for high-cardinality variables like product IDs, postal codes, or user IDs where one-hot encoding would explode a dataset into thousands of columns.

%pip install category_encoders

import category_encoders as ce

df = pd.DataFrame({
    "Color": ["Red", "Blue", "Green", "Blue", "Red", "Green", "Blue", "Red"],
    "Churn": [1, 0, 1, 0, 1, 0, 0, 1]  # Target variable
})
print("Original DataFrame:")
print(df)

encoder = ce.TargetEncoder(cols=["Color"])
df["Color_encoded"] = encoder.fit_transform(df["Color"], df["Churn"])

print("\nTarget encoded DataFrame:")
print(df)

print("\nCategory means:")
print(df.groupby("Color")["Churn"].mean())

In my experience, target encoding can sometimes feel like a magic trick the first time you use it. Suddenly, a variable that seemed unusable because of its sheer size becomes compact and highly predictive. But the trick cuts both ways. If you apply target encoding directly to your training set, you risk overfitting because the model “cheats” by seeing target patterns too early. The fix is to apply it with cross-validation or smoothing.

Source: Sahir Maharaj

If you remember one thing from this edition, let it be this: encoding is about giving your model the right perspective on the world. Categorical encoding may seem like a small preprocessing step, but it can have an outsized impact on your machine learning results. One-hot encoding provides safety at the cost of dimensionality. Ordinal encoding captures natural progression but demands careful ordering. Target encoding unlocks high-cardinality data but requires vigilance against overfitting. Each method has its place, and as a data professional, you need to pick the right one for the right context.

So, the next time you open a dataset inside Microsoft Fabric or any Python environment, pause for a moment. Ask yourself: what story are these categories trying to tell? Then choose your encoding wisely, because the quality of your features will determine the strength of your predictions.

Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂

Feature Engineering at Scale for Data Science with Microsoft Fabric

Why Data Scientists Should Start Using Microsoft F...

Mastering Advanced Pandas for Data Science in Micr...

Re: Mastering Advanced Pandas for Data Science in ...

Re: Mastering Advanced Pandas for Data Science in ...

Re: Mastering Advanced Pandas for Data Science in ...

Join us at FabCon Vienna from September 15-18, 2025

Feature Engineering at Scale for Data Science with Microsoft Fabric

Why Data Scientists Should Start Using Microsoft F...

Mastering Advanced Pandas for Data Science in Micr...

Re: Mastering Advanced Pandas for Data Science in ...

Re: Mastering Advanced Pandas for Data Science in ...

Re: Mastering Advanced Pandas for Data Science in ...