The Data Scientist’s Guide to Model Metrics in Mic...

Sahir_Maharaj · ‎09-10-2025

It’s easy to get caught up in the thrill of building a machine learning model. You update the features, run experiments, and finally land on an accuracy score that looks impressive. But here’s the reality I see many data professionals having to learn the hard way: accuracy can be incredibly deceptive. If you’re working on fraud detection, for example, and 98% of transactions are legitimate, then a model that predicts “legit” for everything still hits 98% accuracy. It looks great, but it’s utterly useless. That’s why I’ve learned to go beyond accuracy and look at metrics that actually tell the full story.

What you will learn: In this edition, we will explore model evaluation metrics. By the time you’re through, you’ll know how to make sense of precision, recall, F1-score, AUC, and MCC in plain language, and more importantly, when to reach for each one depending on the problem in front of you. I’ll also show you how to implement these metrics in Python using scikit-learn, and how to bring them to life with visualizations like precision-recall and ROC curves.

Read Time: 8 minutes

Source: Sahir Maharaj

Think about it: if your model misses most fraudulent transactions but still labels all the legitimate ones correctly, accuracy looks good, but the business impact is disastrous. That’s where metrics like precision, recall, AUC, F1-score, and Matthews Correlation Coefficient (MCC) come in. These metrics give you a more real picture of performance. They tell you not just whether the model is right, but how it is right (or wrong). To make this concrete, let’s use a fraud detection example.

import numpy as np
from sklearn.metrics import (precision_score, recall_score, f1_score,
                             roc_auc_score, matthews_corrcoef,
                             precision_recall_curve, roc_curve, confusion_matrix, classification_report)
import matplotlib.pyplot as plt

y_true = np.array([0, 0, 0, 1, 0, 1, 0, 0, 1, 1])   # 1 = fraud, 0 = legit
y_pred = np.array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0])   # model predictions
y_proba = np.array([0.05, 0.1, 0.2, 0.9, 0.3, 0.4, 0.15, 0.25, 0.8, 0.35])

Precision answers the question: “Of all the transactions flagged as fraud, how many were actually fraud?” In practice, this matters because false alarms cost money and time. If your fraud team investigates 100 flagged transactions and only 10 are truly fraud, that’s 90 wasted efforts. A model with high precision avoids this problem by ensuring most of its positive predictions are correct.

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:\n", cm)
print("Precision:", precision)
print("Recall:", recall)
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}, False Positives: {fp}, False Negatives: {fn}, True Positives: {tp}")

But recall flips the perspective. It asks: “Of all the fraudulent transactions, how many did the model catch?” If you’ve got 50 fraud cases and the model only flags 20 of them, that’s a recall of 0.4. From what I’ve seen, companies often care deeply about recall in fraud detection because missing fraud directly impacts revenue. The trade-off, of course, is that higher recall can mean lower precision... catching more fraud at the cost of more false alarms.

Source: Sahir Maharaj

The trade-off between these two metrics is more than just academic. In my experience, the discussion around precision and recall often becomes a business decision. A bank may prefer high recall and accept some false positives because the cost of missing fraud is higher than chasing down a few extra false alarms. On the other hand, an e-commerce company may value precision because annoying loyal customers with false fraud alerts could hurt the brand.

At some point, you’ll likely get stuck in the tug-of-war between precision and recall. That’s when I reach for the F1-score. It’s the harmonic mean of the two, meaning it rewards models that balance both sides instead of excelling at just one. When I worked on churn models, this was the metric that kept things real - it showed me when my recall was tanking even if precision looked great.

f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)
print("\nClassification Report:\n", classification_report(y_true, y_pred, target_names=["Legit", "Fraud"]))

The F1-score is especially useful when you can’t afford to ignore either false positives or false negatives. In fraud detection, a decent F1-score means your model is catching fraud consistently without overwhelming your fraud team with noise. Still, F1 assumes that false positives and false negatives have equal weight, which isn’t always the case. Sometimes recall still wins the argument.

One thing I’ve observed is that F1 can sometimes be misunderstood by stakeholders. They see a single number and assume it’s the “best score.” But in reality, F1 is a compromise - it forces balance, but it doesn’t tell you if the balance is appropriate for the situation. I’ve had cases where the F1 looked reasonable, but when I looked deeper, the cost of false positives was far higher than the cost of false negatives. In those cases, I explained why recall or precision should take priority over F1, even if it meant reporting a lower score. That’s why I use F1 not as a final answer, but as a guide. It helps me see whether the model is heavily lopsided toward precision or recall.

While precision, recall, and F1 look at performance at one threshold, the ROC curve tells how the model behaves across all thresholds. This is important in fraud detection because you might adjust the cutoff probability depending on how aggressive you want to be. If risk tolerance changes, the ROC curve shows you what happens when you slide that decision point up or down.

auc = roc_auc_score(y_true, y_proba)
fpr, tpr, thresholds = roc_curve(y_true, y_proba)

print("ROC AUC:", auc)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, marker='.', label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

for i in range(len(thresholds)):
    print(f"Threshold: {thresholds[i]:.2f}, FPR: {fpr[i]:.2f}, TPR: {tpr[i]:.2f}")

AUC values close to 1 mean the model is solid at distinguishing fraud from legit transactions across thresholds. AUC around 0.5 means the model is basically guessing. What I like about AUC is that it doesn’t get fooled by class imbalance. Even if fraud is only 2% of transactions, a good AUC still shows me whether the model understands the difference.

But I’ve also learned to be cautious with AUC. A model can look great with a strong AUC score but still fail at the threshold you care about most. I once worked on a fraud model with an AUC of 0.92, which seemed fantastic. But at the business’s chosen threshold, the precision was so low that investigators were drowning in false positives. AUC was telling the big-picture story, but the local detail at one threshold was far more important.

Source: Sahir Maharaj

That’s why I almost always pair ROC-AUC analysis with precision-recall curves. The ROC curve shows me the overall separability of classes, but the precision-recall curve shows me how the model holds up when fraud cases are rare. Together, they give me confidence that the model isn’t just good on paper, but practical in production.

One of the most underrated metrics I use is the Matthews Correlation Coefficient (MCC). It takes all four pieces of the confusion matrix into account: true positives, false positives, true negatives, and false negatives. This makes it balanced even when fraud cases are rare, which is often the case in real datasets.

mcc = matthews_corrcoef(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print("MCC:", mcc)
print("Confusion Matrix:\n", cm)

if mcc == 1:
    print("Perfect predictions.")
elif mcc == 0:
    print("Model is no better than random guessing.")
elif mcc < 0:
    print("Model is making actively wrong predictions.")
else:
    print("Model has some predictive power, but needs improvement.")

What makes MCC valuable is that it doesn’t let you cheat. A model that predicts everything as “legit” might score well on accuracy, but MCC will expose it with a score near zero. I’ve had models that looked great in terms of recall or F1 but showed their true colors once MCC was calculated. It’s like a truth serum for imbalanced datasets.

Another reason I like MCC is because it gives me a single number that actually respects balance. Unlike F1, which focuses only on positives, MCC accounts for the entire dataset. This makes it particularly valuable when fraud cases are extremely rare. I once saw a model with 99% accuracy but an MCC of just 0.02 - that single number told me the model was essentially useless.

If you’ve never tried MCC, I recommend adding it to your toolkit. Even if you don’t use it as your main metric, it’s an excellent way to validate whether your model is genuinely balanced. Think of it as a backstop that prevents you from getting too excited about a misleading metric. More than once, MCC has stopped me from deploying a model that would have collapsed in production.

Source: Sahir Maharaj

Precision, recall, F1, AUC, and MCC each shine light on different parts of the picture. Together, they turn evaluation from a single number into a full story about how your model really behaves. The best part is, these metrics help you align with business needs. Sometimes recall wins, sometimes precision, sometimes a balance is required.

The context changes, but the toolkit stays the same. If you’re evaluating models today, don’t stop at accuracy. Dig deeper, run the metrics, and plot the curves. The payoff is clear: models that not only look good but deliver real-world impact. So, the next time you see “95% accuracy” in your results, pause and ask: does it really mean what I think it does?

Thanks for taking the time to read my post! I’d love to hear what you think and connect with you 🙂

The Data Scientist’s Guide to Model Metrics in Microsoft Fabric

Get Fabric Data Agents Running in Minutes – Fast, ...

AI & Notebooks in Microsoft Fabric

Why Data Scientists Should Start Using Microsoft F...

FabCon is coming to Atlanta

The Data Scientist’s Guide to Model Metrics in Microsoft Fabric

Get Fabric Data Agents Running in Minutes – Fast, ...

AI & Notebooks in Microsoft Fabric

Why Data Scientists Should Start Using Microsoft F...