This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreGet Fabric Certified for FREE during AI Skills Fest. This week only. Secure your voucher now.
Every dataset tells a story but not every story is easy to read. Sometimes, the real narrative is hidden behind dozens or even hundreds of features, each whispering a fragment of the truth. You scroll through columns of numbers - clicks, sales, engagement scores, satisfaction levels... and it feels like trying to listen to every instrument in an orchestra at once. There’s beauty in the noise, but chaos too.
This is the reality of high-dimensional data. It’s abundant, but overwhelming. I’ve seen many analysts (and at times, my past self included) try to force patterns where none seem to exist - endlessly plotting pairwise charts and correlations, hoping something obvious would jump out. But when there’s too much overlap between features, your intuition starts to blur. That’s when I discovered how powerful Principal Component Analysis (PCA) can be.
What you will learn: In this edition, we will explore Principal Component Analysis (PCA) - what it really means, how it works, and why it’s such a powerful ally. You’ll start by understanding the intuition behind PCA, then how it actually works under the hood, alongside when and why PCA is worth using, especially in real-world data scenarios where features overlap or patterns are hard to see. Finally, you’ll learn how to interpret the results inside a Fabric notebook.
Read Time: 10 minutes
Source: Sahir Maharaj (https://sahirmaharaj.com)
Let’s start with the essence of PCA and what it actually does. If you’ve ever dealt with correlated variables, you’ll know how repetitive data can get. Two features might measure nearly the same thing: income and spending power, time on app and engagement rate, or weight and height. PCA recognizes that redundancy and merges it intelligently, turning correlated features into compact “principal components” that still represent the same information (just cleaner!).
I often think of PCA as a form of data translation. Lets take your dataset as a room full of people all trying to speak at once. PCA listens, identifies the few voices that truly drive the conversation, and then rewrites the story in their language. It’s not erasing anyone, but just clarifying the message. Though, what’s magical is how PCA maintains most of your data’s variance... meaning you lose very little valuable information, even when reducing dozens of variables to just a few components.
To make this practical, let's say you might start with 20 correlated features, but PCA can distill that into 3 or 4 strong, uncorrelated components that explain 90% of what matters. It’s compression without compromise. I discovered using PCA on a customer satisfaction surveys with dozens of overlapping questions, all subtly related to the same few sentiments. PCA condensed that complexity into a handful of components representing customer confidence, trust, and product value. Suddenly, the patterns were visible!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
height = np.random.normal(170, 10, 100)
weight = height * 0.6 + np.random.normal(0, 5, 100)
data = pd.DataFrame({'Height': height, 'Weight': weight})
plt.figure(figsize=(7,5))
plt.scatter(data['Height'], data['Weight'], alpha=0.7, color='steelblue', edgecolors='black')
plt.title('Height vs Weight (Correlated Features)')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True)
plt.show()
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
pca = PCA(n_components=2)
pca.fit(scaled)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print("Principal Components (Axes):")
print(pca.components_)
Cool - now that we’ve talked about what PCA does, let’s look at how it does it. At its core, PCA is all about movement - finding the directions where your data varies the most. It starts by looking at how each variable relates to every other through something called a covariance matrix. Think of it like mapping out how strongly every feature leans on the others. PCA then identifies the directions (principal components) that capture the largest patterns of movement in that map. But, each of these components is independent - meaning they don’t overlap or repeat information.
Source: Sahir Maharaj (https://sahirmaharaj.com)
The first captures the most variance, the second captures what’s left, and each one after that explains progressively smaller pieces of the story. By transforming your data into this new coordinate system, you remove redundancy and sharpen what’s meaningful. When I first learned PCA, I thought of it as rotating a cube in my hands. The data itself never changes, you just turn it until the key features align with your view. That’s what PCA does mathematically... it rotates your data in multi-dimensional space to show you its best angle.
I find that what’s beautiful about PCA is that it’s universal. It doesn’t care whether you’re analyzing social media engagement or gene expression data - the concept remains the same: find the directions where the signal is strongest. That’s why it’s so widely used across fields like finance, biology, and even image recognition. As a data scientist, I often rely on PCA not just for modeling, but for thinking. When I apply it early in an analysis, it instantly tells me whether my data has a dominant pattern or if it’s spread too thin. That intuition saves time, guides modeling choices, and helps me communicate findings in a way non-technical stakeholders can understand.
np.random.seed(123)
likes = np.random.normal(100, 20, 200)
shares = likes * 0.7 + np.random.normal(0, 5, 200)
comments = likes * 0.5 + shares * 0.3 + np.random.normal(0, 3, 200)
engagement = pd.DataFrame({
'Likes': likes,
'Shares': shares,
'Comments': comments
})
print("Feature Correlations:")
print(engagement.corr())
scaler = StandardScaler()
scaled_engagement = scaler.fit_transform(engagement)
pca = PCA(n_components=3)
pca.fit(scaled_engagement)
print("\nExplained Variance Ratio:", pca.explained_variance_ratio_)
print("Total Variance Explained:", np.sum(pca.explained_variance_ratio_))
loadings = pd.DataFrame(pca.components_, columns=engagement.columns, index=['PC1', 'PC2', 'PC3'])
print("\nFeature contributions to each component:")
print(loadings)
But... knowing how PCA works is one thing. Knowing when to use it, that’s where judgment and experience come in. My recommendation is that you should reach for PCA when your dataset is wide, noisy, or full of variables that overlap in meaning. Think of financial metrics like “revenue,” “sales,” “turnover,” and “profit.” They’re all connected and they move together. PCA condenses those correlated features into a single “financial strength” component that captures the shared signal.
I’ve seen PCA do wonders in marketing analytics, too. For example, if you’re tracking engagement across dozens of metrics like clicks, time on page, scroll depth, likes, shares, comments - you’re likely measuring variations of the same behavior. Instead of trying to interpret each metric separately, PCA lets you identify the underlying factors, like “active engagement” or “content depth.”
It’s also an excellent choice when you’re preparing data for unsupervised learning, like clustering. In those scenarios, too many correlated features can distort distance metrics and confuse the algorithm. PCA helps by creating cleaner, uncorrelated dimensions that make clustering far more meaningful.
Source: Sahir Maharaj (https://sahirmaharaj.com)
That said, PCA isn’t a silver bullet. Yes, thats right... it’s not ideal for categorical data or situations where each variable has deep interpretive importance like in regulatory models or causal analysis. PCA trades interpretability of individual features for simplicity and structure. It’s best when your goal is discovery, compression, or visualization (not strict feature accountability). Over the years, I’ve found that PCA also improves communication with business stakeholders. It’s one thing to say, “Our churn model uses 40 behavioral features.” It’s another to say, “Three behavioral factors explain 90% of churn patterns.” I find that clarity is powerful and simplifies storytelling too.
np.random.seed(7)
time_spent = np.random.normal(5, 1, 200)
scroll_depth = time_spent * 0.5 + np.random.normal(0, 0.3, 200)
extended_df = pd.DataFrame({
'Likes': likes,
'Shares': shares,
'Comments': comments,
'TimeSpent': time_spent,
'ScrollDepth': scroll_depth
})
scaler = StandardScaler()
scaled = scaler.fit_transform(extended_df)
pca = PCA(n_components=3)
pca.fit(scaled)
loadings = pd.DataFrame(pca.components_, columns=extended_df.columns, index=['PC1', 'PC2', 'PC3'])
print("Component Loadings:")
print(loadings)
print("\nExplained Variance Ratio:", pca.explained_variance_ratio_)
print("Cumulative Variance:", np.cumsum(pca.explained_variance_ratio_))
plt.figure(figsize=(10,5))
plt.bar(extended_df.columns, loadings.loc['PC1'], color='lightcoral', edgecolor='black')
plt.title('Feature Contributions to Principal Component 1')
plt.ylabel('Loading Value')
plt.grid(True)
plt.show()
The final step in mastering PCA is interpretation. Each principal component represents a hidden dimension in your data - a combination of your original features, weighted according to how strongly each one contributes. The key to interpretation is in those weights, or loadings. They tell you which variables are driving each component.
In my experience, this is where most of the value comes from. You don’t just end up with fewer dimensions... you end up with dimensions that mean something. You can now track, analyze, and even model around these composite indicators. The explained variance ratios tell you how much of your data’s story each component carries. The first few usually explain most of the variance, meaning they contain the bulk of your dataset’s useful information. Understanding that helps you decide how many components to keep for your next step - visualization, modeling, or feature selection.
Source: Sahir Maharaj (https://sahirmaharaj.com)
And the coolest thing is that PCA also builds confidence. When you can show your audience how data condenses into a few meaningful patterns, you demonstrate mastery - not over the math, but over the message. From my perspective, that’s what great data science is about.
Source: Sahir Maharaj (https://sahirmaharaj.com)
The world of data keeps expanding. More metrics, more features, more dashboards. But in the middle of all this, clarity becomes rare - and that’s exactly what PCA gives you. It reminds you that behind the hundreds of numbers is a much smaller set of forces that actually move the needle. By identifying those forces, you can focus your analysis, simplify your models, and communicate insights with elegance and impact.
And the beauty of running it inside Microsoft Fabric is how approachable it feels. You don’t need a massive dataset or an advanced mathematical background to start. Open a new notebook, load a small table and just try it. The first time you see your data compress into cleaner, simpler patterns, you’ll realize how much time you’ve been spending managing noise instead of uncovering insight.
Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.