The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
You know that feeling when you’re working with a dataset, and something just doesn’t look right? Maybe a single value is absurdly higher than the rest, or a handful of rows look like they came from another planet entirely. Outliers are like those people in a meeting who derail the conversation - they don’t always mean sabotage, but they can throw off the whole flow if you’re not paying attention.
As a data professional, you’ve probably been there: a model underperforms, an average looks suspicious, or a chart has one lonely point sitting miles away from the others. That’s the quiet influence of outliers at work. Ignore them, and you risk skewed results. Handle them recklessly, and you might erase valuable information. The challenge (and, as I like to call it “the art”) is knowing how to detect them, decide if they matter, and handle them in a way that strengthens your analysis rather than weakens it.
What you will learn: In this edition, you will learn how to perform outlier detection and handle them using NumPy, SciPy, and Seaborn inside Microsoft Fabric’s Python environment. By the time you’re done, you’ll know how to spot unusual data points using statistical methods, confirm them visually with clear, informative plots, and decide whether to remove, transform, or cap them based on context. And because finding outliers is only half the story, you’ll also learn how to build the instinct to know when those “odd” values are actually your most valuable insights.
Read Time: 9 minutes
Source: Sahir Maharaj
The most effective way to understand outliers is to start by creating your own dataset so you can control what’s in it. When I’m teaching this in a 1:1 session, I usually use synthetic data because it removes the complexity of cleaning and wrangling a real dataset. It allows us to focus purely on the detection process without distractions.
import numpy as np
from scipy import stats
np.random.seed(42)
data = np.random.normal(50, 10, 1000)
outliers_to_add = [150, 200, -50, 120, -30]
data = np.append(data, outliers_to_add)
print(data[:15])
With the dataset ready, the first method we’ll use is the Z-score. This is a statistical way of measuring how far a data point is from the mean in terms of standard deviations. If a value has a Z-score of 0, it’s exactly at the mean; a Z-score of 2 means it’s two standard deviations away. In many contexts, values with an absolute Z-score greater than 3 are considered outliers, though this threshold can be adjusted depending on the sensitivity you need.
Source: Sahir Maharaj
I like this method because it’s simple, quick, and easy to understand - you can explain it to someone who’s never worked with statistics before, and they’ll grasp the concept right away. However, it’s important to remember that it works best when your data roughly follows a normal distribution.
z_scores = np.abs(stats.zscore(data))
threshold = 3
outlier_indices_z = np.where(z_scores > threshold)
print(data[outlier_indices_z])
Of course, in the real world, your data often doesn’t behave nicely. It might be skewed, have long tails, or be full of natural clusters that make the Z-score less reliable. That’s when the Interquartile Range (IQR) method comes in. This approach looks at the spread of the middle 50% of your data and flags anything that falls too far outside that range.
It doesn’t assume your data is normal, which is why I often use it alongside Z-scores. I’ve seen cases where the Z-score flags too many points as outliers simply because the distribution was skewed - but the IQR method highlighted only the ones that were truly extreme. Combining both methods gives you a more balanced perspective.
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outlier_indices_iqr = np.where((data < lower_bound) | (data > upper_bound))
print(data[outlier_indices_iqr])
Having completed the numerical checks, it’s now time to see how those results hold up visually. Numbers are powerful, but our eyes are exceptional at spotting patterns that raw statistics can sometimes miss. A visualisation can instantly reveal whether your detected outliers are genuinely extreme or just appear that way due to the shape of your data. Seaborn’s boxplot is my go-to for this because it shows the spread, median, quartiles, and any points that lie outside the whiskers, which are interpreted as outliers.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
sns.boxplot(x=data)
plt.title("Boxplot of Data with Outliers Highlighted")
plt.xlabel("Values")
plt.show()
When you view a boxplot, those individual dots above or below the whiskers represent the potential outliers. In my own analysis work, I’ve had moments where the statistical methods highlighted dozens of points as outliers, but when plotted, I could see that many of them were only slightly beyond the whiskers and not practically significant.
Source: Sahir Maharaj
This is why pairing visual analysis with statistical methods is so powerful - you’re combining objective calculations with human judgment. And sometimes, the opposite happens: the visualisation draws your attention to a group of points far beyond the range, making you realise the problem is much more severe than you thought. If you have multiple variables, creating side-by-side boxplots can give you an immediate comparison.
import pandas as pd
df = pd.DataFrame({
"Feature_A": np.append(np.random.normal(60, 12, 200), [300, 320, 350]),
"Feature_B": np.append(np.random.normal(40, 8, 200), [-100, 150, 160])
})
plt.figure(figsize=(10, 5))
sns.boxplot(data=df)
plt.title("Multiple Features with Outliers")
plt.ylabel("Values")
plt.show()
Another powerful use of boxplots is grouping by categories. Let’s say you’re analysing restaurant bills by day of the week. If you plot them all together, you might miss that outliers occur more frequently on weekends. But when you break them down by category (e.g., Monday, Tuesday, Wednesday, etc.), the pattern becomes obvious. Grouped boxplots can reveal whether certain categories naturally have more variability... and that’s crucial because not every outlier is a problem (some might be a normal part of a particular group’s behaviour).
tips = sns.load_dataset("tips")
plt.figure(figsize=(8, 5))
sns.boxplot(x="day", y="total_bill", data=tips)
plt.title("Total Bill Distribution by Day with Outliers")
plt.show()
At this stage, you’ve spotted the outliers. The next challenge is deciding what to do with them. In my own projects, I’ve learned that the context of the dataset and the business problem matters far more than simply following a rigid rule. A point that looks extreme in one dataset might be perfectly reasonable in another.
Source: Sahir Maharaj
One of the simplest approaches is removal. This can be appropriate if you know the outlier is an error, such as a negative quantity where it’s impossible, or a duplicate entry with inflated numbers. But deletion has consequences - you’re losing data, and if the dataset is small, every record matters. I’ve been in situations where removing even a few outliers significantly changed the results of an analysis, so I always check the impact before deciding to drop them.
Source: Sahir Maharaj
Another option is transformation. When you apply a transformation like a logarithm, you compress large values, bringing them closer to the bulk of the data without removing them.
Source: Sahir Maharaj
I find this especially useful when working with financial data, where differences between values can be huge, but you still want to preserve the relative relationships. A transformation keeps the data intact but reduces the dominance of extreme values in statistical calculations.
data_transformed = np.log1p(np.abs(data)) * np.sign(data)
plt.figure(figsize=(8, 4))
sns.boxplot(x=data_transformed)
plt.title("Boxplot After Log Transformation")
plt.show()
Finally, there’s capping (also known as Winsorizing). This involves setting a maximum and minimum threshold, and replacing any values beyond those limits with the threshold value itself.
Source: Sahir Maharaj
It’s a way of softening the impact of outliers while still keeping every record in the dataset. This can be helpful when your stakeholders want a stable view of the data without big jumps caused by a handful of extreme points.
lower_cap = np.percentile(data, 1)
upper_cap = np.percentile(data, 99)
data_capped = np.clip(data, lower_cap, upper_cap)
plt.figure(figsize=(8, 4))
sns.boxplot(x=data_capped)
plt.title("Boxplot After Capping Outliers")
plt.show()
And, there you have it! Sometimes the “odd” points are where the most important stories are hiding; other times, they’re the noise you need to remove to see the real picture. The most valuable lesson I’ve learned is that outlier handling isn’t a single fixed process - it’s a conversation with your dataset. Every project, every dataset, and every business problem is different, and your approach should adapt accordingly.
The more you practice this workflow, the sharper your instincts will become. So open a notebook, run through these steps, and experiment with different detection thresholds and handling strategies. The next time you see that one point sitting far away from the rest in a plot, you’ll know exactly how to decide whether it’s a problem to solve or a story to tell.
Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.