Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Days is here! Join us now for 60+ days of learning, challenges, and connection. Learn more

Sahir_Maharaj

Advanced Anomaly Detection for Data Science in Microsoft Fabric

Think about the last time you were sifting through data and something just felt… off. A spike you didn’t anticipate, a drop that didn’t align, or a handful of records that simply didn’t belong. You likely paused and thought: “What’s going on here?” As data professionals, you know that spotting those anomalies early can make the difference between mitigating risk and missing a critical warning sign. Whether it’s an unexpected revenue dip, a security breach waiting to happen, or sensor readings in manufacturing that signal equipment failure - anomalies are often the early signals of deeper issues. But how do you go from “hmm that looks weird” to a structured, repeatable anomaly-detection pipeline? That’s what I’m going to walk you through in this edition.

 

What you will learn: In this edition, we’re exploring how to detect the unusual, the unexpected, and the truly interesting moments hidden in your data using anomaly detection techniques. By the time you’re done, you’ll understand what makes certain data points stand out, how to identify them using Python, and how to visualize those findings in ways that actually make sense to your audience.

 

Read Time: 8 minutes

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

At its core, anomaly detection is about understanding the rhythm of your data and knowing when that rhythm changes. Every dataset tells a story, but not all stories unfold predictably. Some days your metrics glide along in a familiar pattern, and then suddenly, a few data points behave differently. They drift away from the norm, and that’s when the real curiosity begins. Those deviations, often small and easy to overlook, can hold immense value.

 

When I first began working with anomaly detection, I realized that “normal” is a deceptively simple word. What’s normal today might be abnormal tomorrow. In one project, a jump in customer support tickets looked alarming... until we learned it aligned perfectly with a product update rollout. In another case, a subtle dip in sensor readings preceded a hardware fault by weeks. The challenge wasn’t detecting change, but interpreting it correctly.

 

There are three main ways anomalies tend to appear. Point anomalies are single data points that stand out sharply from others, like a customer who suddenly makes an unusually large purchase. Contextual anomalies depend on surrounding conditions - a warm winter day might be typical in one region but strange in another. And collective anomalies emerge only when several points together form a pattern that doesn’t fit historical behavior, such as a cluster of failed transactions occurring within seconds of each other. Understanding these categories is important because each one calls for a different analytical approach.

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

What I’ve observed over time is that many data professionals begin their journey with statistical anomaly detection (and for good reason). It’s intuitive, fast, and surprisingly effective when applied correctly. At this stage, the goal isn’t to build an advanced model but to build understanding: to learn how your data behaves under normal conditions. If your dataset is relatively clean, univariate (one metric at a time), and stable over time, statistical techniques like Z-scores and interquartile range (IQR) thresholds often perform remarkably well. They don’t require complex training, and the results are easy to explain to non-technical stakeholders (haha, something that’s often underestimated in analytics!)

 

Statistical anomaly detection works on a simple idea - define the center, measure the spread, and then flag points that stray too far. The Z-score, for example, calculates how many standard deviations a point lies from the mean. The further away it is, the more “abnormal” it becomes. It’s a bit like knowing how loud a sound must be before you notice it. I’ve used this in multiple scenarios — from monitoring customer engagement drops to identifying system performance issues. What makes it powerful is not the complexity, but the clarity. You know exactly why a value was flagged.

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

And, of course... real-world data is rarely that neat. Many business metrics are skewed as they have long tails or contain outliers by nature. In those cases, the mean becomes misleading, and standard deviation loses its reliability. That’s when I often turn to the median and median absolute deviation (MAD) which are robust statistics that don’t get distorted by extreme values. Using these, you can measure the typical spread without letting a few large or small values dominate the calculation.

 

Another key aspect is granularity. A single threshold across all data might be too blunt. Instead, segmenting by product line, region, or time of day can yield more meaningful results. For instance, an “unusual” order volume at midnight might be normal for one geography but rare for another. I've learned that statistical techniques are flexible enough to accommodate such segmentation, and this adaptability makes them an excellent foundation before introducing machine learning.

 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
data = np.random.normal(50, 5, 200)
outliers = [20, 85, 90]
data = np.append(data, outliers)

df = pd.DataFrame({'value': data})

mean, std = df['value'].mean(), df['value'].std()
df['z_score'] = (df['value'] - mean) / std
df['is_z_anomaly'] = df['z_score'].abs() > 3

median = df['value'].median()
mad = np.median(np.abs(df['value'] - median))
df['mad_score'] = 0.6745 * (df['value'] - median) / mad
df['is_mad_anomaly'] = df['mad_score'].abs() > 3.5

plt.figure(figsize=(10,6))
plt.plot(df['value'], label='Data', linewidth=2)
plt.scatter(df.index[df['is_z_anomaly']], df['value'][df['is_z_anomaly']],
            color='red', s=100, label='Z-score Anomalies')
plt.scatter(df.index[df['is_mad_anomaly']], df['value'][df['is_mad_anomaly']],
            color='orange', s=100, label='MAD Anomalies', marker='x')
plt.title("Statistical Anomaly Detection (Z-score & MAD)")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()
plt.show()

 

Though, once your data grows more intricate traditional statistical thresholds start losing their precision. That’s when you turn to machine-learning-based anomaly detection, where algorithms learn what “normal” looks like rather than having you define it manually. Two of the most accessible and widely used techniques in this space are Isolation Forest and Local Outlier Factor (LOF).

 

Isolation Forest takes an elegant, almost counterintuitive approach as it doesn’t model normal behavior. Instead, it isolates anomalies directly. It works by randomly splitting data across features and measuring how quickly individual points get separated. Because anomalies differ significantly from the rest, they tend to get isolated faster, resulting in shorter average path lengths in the model’s trees. I like to think of it as walking through a dense forest - the more typical trees are grouped closely, while the unusual ones stand apart, easy to spot even in the distance.

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

But, Local Outlier Factor, on the other hand, views the world through density. It measures how tightly packed each data point’s neighborhood is and compares it to its neighbors’ density. If one point sits in a sparse pocket while its peers are clustered closely together, that’s your outlier. This technique excels in datasets where “unusual” depends on proximity rather than raw magnitude. In practice, I’ve often used these methods together. Isolation Forest is efficient for large, high-dimensional datasets, while LOF is more sensitive and interpretable for smaller samples. One lesson I’ve learned is don’t treat their outputs as binary “yes/no” labels. Instead, look at the anomaly scores. A higher score doesn’t just say “this is an outlier” but it tells you how strongly the model believes that.

 

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(42)
X = 0.3 * np.random.randn(200, 2)
X = np.r_[X + 2, X - 2]  # two clusters
X_outliers = np.random.uniform(low=-6, high=6, size=(10, 2))
X = np.concatenate([X, X_outliers], axis=0)

iso = IsolationForest(contamination=0.03, random_state=42)
iso_labels = iso.fit_predict(X)
iso_outliers = X[iso_labels == -1]

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.03)
lof_labels = lof.fit_predict(X)
lof_outliers = X[lof_labels == -1]

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].scatter(X[:, 0], X[:, 1], color='lightblue', s=40, label='Normal')
axes[0].scatter(iso_outliers[:, 0], iso_outliers[:, 1], color='red', s=100, label='Anomaly')
axes[0].set_title("Isolation Forest Anomaly Detection")
axes[0].legend()

axes[1].scatter(X[:, 0], X[:, 1], color='lightgreen', s=40, label='Normal')
axes[1].scatter(lof_outliers[:, 0], lof_outliers[:, 1], color='purple', s=100, label='Anomaly')
axes[1].set_title("Local Outlier Factor (LOF) Detection")
axes[1].legend()

plt.show()

 

And now we have one of my favourites... time-series data. Here, the notion of “normal” isn’t static as it moves with time. Patterns emerge and fade, seasons cycle, and trends evolve. A spike might be alarming today but perfectly reasonable next quarter. This temporal context means that any anomaly detection approach must first understand how your data changes over time before deciding whether something’s unusual.

 

One of the most powerful and interpretable approaches I’ve used is decomposition which is specifically, breaking a time series into three components: trend, seasonality, and residual. The trend shows long-term movement, the seasonality captures recurring patterns (like daily or weekly cycles), and the residual contains everything unpredictable. Once you isolate the residual, anomalies become easier to detect as they’re the unexplained parts (the signals that don’t fit any known pattern).

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

What I like about this approach is how visual it is. You can literally see your anomalies emerge once the noise and seasonality are stripped away. It’s intuitive for stakeholders, too. When presenting to teams, showing a decomposed chart with residual spikes instantly communicates what’s happening (yes, no equations needed). Of course, decomposition isn’t the only path. For more advanced scenarios, models like ARIMA, Prophet, or even deep-learning-based methods like LSTMs can predict future values and flag deviations from forecasts.

 

import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import STL
import matplotlib.pyplot as plt

np.random.seed(42)
date_range = pd.date_range(start='2025-01-01', periods=365, freq='D')
trend = np.linspace(10, 50, 365)
seasonal = 10 * np.sin(2 * np.pi * date_range.dayofyear / 30)
noise = np.random.normal(0, 2, 365)
values = np.array(trend + seasonal + noise, dtype=float)
values[60] += 20
values[180] -= 15
values[300] += 25

df = pd.DataFrame({'date': date_range, 'value': values}).set_index('date')
stl = STL(df['value'], period=30)
res = stl.fit()
df['residual'] = res.resid
threshold = 3 * df['residual'].std()
df['is_anomaly'] = df['residual'].abs() > threshold

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['value'], label='Original Series', linewidth=2)
plt.scatter(df.index[df['is_anomaly']], df['value'][df['is_anomaly']],
            color='red', s=100, label='Detected Anomalies')
plt.title("Time-Series Anomaly Detection (STL Decomposition)")
plt.xlabel("Date")
plt.ylabel("Value")
plt.legend()
plt.tight_layout()
plt.show()

 

And, the beauty of these techniques is that you don’t need massive infrastructure or complex systems to start. A few lines of Python in Microsoft Fabric can show patterns you might’ve never seen before. So, start small - run a Z-score check, visualize a few anomalies, and see what story your data tells. Once you do, you’ll begin to understand how even simple models can identify powerful insights.

 

From there, layering in more advanced methods like Isolation Forests or time-series decomposition becomes a natural evolution. I find that once you visualize anomalies and start connecting them to the real-world, it changes the way you think about data altogether. Suddenly, analytics stops being retrospective and becomes predictive... that’s really the power of anomaly detection - and it’s something you can start exploring TODAY!

 

Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂

Comments

Thank you for sharing !