Building a Student Dropout Prediction System Using...

Olayemi_Awofe · ‎11-04-2025

Background

This project is inspired by a real-life scenario where student disengagement went undetected. This project simulates how EdTech platforms can use Microsoft Fabric to detect early warning signs and support at-risk learners.

Architecture Overview

The Fabric workspace integrates:

Data Engineering workload (PySpark)
Machine Learning workload
Delta Tables stored in OneLake
SQL Views for analysts and data scientists

Data Generator
    ↓
Bronze Layer (Raw Data)
    ↓
Silver Layer (Cleaned Aggregations)
    ↓
Gold Layer (ML Features + Labels)
    ↓
Model Training / Power BI Dashboard

🟤 Bronze Layer - Synthetic Data Generation

The Bronze layer creates millions of realistic records with attributes such as:

Demographics (age, gender, region, device)
Engagement (logins, session length, discussion activity)
Performance (grades, submissions)
Behavioral metrics (motivation, stress, attendance)

Each dataset is stored in Delta format:

students.write.mode("overwrite").saveAsTable("bronze.student_demographics")
activity.write.mode("overwrite").partitionBy("week").saveAsTable("bronze.student_activity_logs")

⚪Silver Layer – Aggregation & Cleaning

The Silver layer consolidates records per student:

activity_agg = (
    spark.table("bronze.student_activity_logs")
    .groupBy("student_id")
    .agg(
        F.avg("logins_per_week").alias("avg_logins"),
        F.stddev_pop("logins_per_week").alias("login_volatility")
    )
)

This ensures conformed, analytics-ready tables for modeling.

🟡 Gold Layer – ML Features & Labels

The Gold layer merges aggregated features and computes dropout probability:

risk_signal = (
    1.2*(1 - F.col("avg_video_completion")) +
    1.0*F.col("login_volatility") +
    0.8*(1 - F.col("attendance_rate")) +
    1.0*(F.when(F.col("avg_score_all") < 55, 1).otherwise(0)) +
    0.6*(F.when(F.col("grade_trend") < -5, 1).otherwise(0)) +
    0.5*(F.when(F.col("stress_avg") > 6, 1).otherwise(0)) +
    0.4*(F.when(F.col("motivation_avg") < 5, 1).otherwise(0))
)
p_dropout = logistic(-1.2 + risk_signal)

Students with p_dropout > 0.35 are labeled as at-risk.

⚙️Performance & Optimization

Fabric’s distributed Spark engine enables scalable synthetic data generation:

spark.conf.set("spark.sql.shuffle.partitions", "800")

Delta optimizations such as Z-ORDER BY and OPTIMIZE improve read/write efficiency for large workloads.

📈Consumption & Analytics

The Gold table is published for consumption:

CREATE OR REPLACE VIEW gold.vw_student_dropout_features AS
SELECT * FROM gold.student_dropout_features;

Analysts can query via SQL endpoint, while Data Scientists connect directly to the ML workload for model training.

Screenshot 2025-11-04 010324.png

🔭Future Enhancements

Integrate Microsoft Fabric ML models directly from the Gold dataset
Build a Power BI dashboard for cohort-level dropout visualization
Add temporal engagement trends for early disengagement signals

✅Conclusion

This project illustrates how Microsoft Fabric unifies data engineering and data science workflows to build reproducible, ML-ready data systems.

By simulating a realistic EdTech dataset, the pipeline demonstrates Fabric’s ability to:

Handle high-volume data efficiently
Support feature engineering across layers
Enable seamless ML experimentation

Ultimately, it’s a step toward data-driven education helping platforms identify and re-engage students before they drop out.In Fabric, every dataset can tell a story if you build the right pipeline to listen.

Author: Olayemi O Awofe

Github Repository: Click here
Tags: #MicrosoftFabric #DataEngineering #FabricDataEngineering #MachineLearning #DeltaTables #EducationAnalytics #SyntheticData #MedallionArchitecture