Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!Special holiday offer! You and a friend can attend FabCon with a BOGO code. Supplies are limited. Register now.
This project is inspired by a real-life scenario where student disengagement went undetected. This project simulates how EdTech platforms can use Microsoft Fabric to detect early warning signs and support at-risk learners.
Architecture Overview
The Fabric workspace integrates:
Data Engineering workload (PySpark)
Machine Learning workload
Delta Tables stored in OneLake
SQL Views for analysts and data scientists
Data Generator
↓
Bronze Layer (Raw Data)
↓
Silver Layer (Cleaned Aggregations)
↓
Gold Layer (ML Features + Labels)
↓
Model Training / Power BI DashboardThe Bronze layer creates millions of realistic records with attributes such as:
Demographics (age, gender, region, device)
Engagement (logins, session length, discussion activity)
Performance (grades, submissions)
Behavioral metrics (motivation, stress, attendance)
Each dataset is stored in Delta format:
students.write.mode("overwrite").saveAsTable("bronze.student_demographics")
activity.write.mode("overwrite").partitionBy("week").saveAsTable("bronze.student_activity_logs")
The Silver layer consolidates records per student:
activity_agg = (
spark.table("bronze.student_activity_logs")
.groupBy("student_id")
.agg(
F.avg("logins_per_week").alias("avg_logins"),
F.stddev_pop("logins_per_week").alias("login_volatility")
)
)This ensures conformed, analytics-ready tables for modeling.
The Gold layer merges aggregated features and computes dropout probability:
risk_signal = (
1.2*(1 - F.col("avg_video_completion")) +
1.0*F.col("login_volatility") +
0.8*(1 - F.col("attendance_rate")) +
1.0*(F.when(F.col("avg_score_all") < 55, 1).otherwise(0)) +
0.6*(F.when(F.col("grade_trend") < -5, 1).otherwise(0)) +
0.5*(F.when(F.col("stress_avg") > 6, 1).otherwise(0)) +
0.4*(F.when(F.col("motivation_avg") < 5, 1).otherwise(0))
)
p_dropout = logistic(-1.2 + risk_signal)Students with p_dropout > 0.35 are labeled as at-risk.
Fabric’s distributed Spark engine enables scalable synthetic data generation:
spark.conf.set("spark.sql.shuffle.partitions", "800")Delta optimizations such as Z-ORDER BY and OPTIMIZE improve read/write efficiency for large workloads.
The Gold table is published for consumption:
CREATE OR REPLACE VIEW gold.vw_student_dropout_features AS
SELECT * FROM gold.student_dropout_features;Analysts can query via SQL endpoint, while Data Scientists connect directly to the ML workload for model training.
Integrate Microsoft Fabric ML models directly from the Gold dataset
Build a Power BI dashboard for cohort-level dropout visualization
Add temporal engagement trends for early disengagement signals
This project illustrates how Microsoft Fabric unifies data engineering and data science workflows to build reproducible, ML-ready data systems.
By simulating a realistic EdTech dataset, the pipeline demonstrates Fabric’s ability to:
Handle high-volume data efficiently
Support feature engineering across layers
Enable seamless ML experimentation
Ultimately, it’s a step toward data-driven education helping platforms identify and re-engage students before they drop out.In Fabric, every dataset can tell a story if you build the right pipeline to listen.
Author: Olayemi O Awofe
Github Repository: Click here
Tags: #MicrosoftFabric #DataEngineering #FabricDataEngineering #MachineLearning #DeltaTables #EducationAnalytics #SyntheticData #MedallionArchitecture
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.