Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.
Sign up nowGet Fabric certified for FREE! Don't miss your chance! Learn more
This project is inspired by a real-life scenario where student disengagement went undetected. This project simulates how EdTech platforms can use Microsoft Fabric to detect early warning signs and support at-risk learners.
Architecture Overview
The Fabric workspace integrates:
Data Engineering workload (PySpark)
Machine Learning workload
Delta Tables stored in OneLake
SQL Views for analysts and data scientists
Data Generator
↓
Bronze Layer (Raw Data)
↓
Silver Layer (Cleaned Aggregations)
↓
Gold Layer (ML Features + Labels)
↓
Model Training / Power BI DashboardThe Bronze layer creates millions of realistic records with attributes such as:
Demographics (age, gender, region, device)
Engagement (logins, session length, discussion activity)
Performance (grades, submissions)
Behavioral metrics (motivation, stress, attendance)
Each dataset is stored in Delta format:
students.write.mode("overwrite").saveAsTable("bronze.student_demographics")
activity.write.mode("overwrite").partitionBy("week").saveAsTable("bronze.student_activity_logs")
The Silver layer consolidates records per student:
activity_agg = (
spark.table("bronze.student_activity_logs")
.groupBy("student_id")
.agg(
F.avg("logins_per_week").alias("avg_logins"),
F.stddev_pop("logins_per_week").alias("login_volatility")
)
)This ensures conformed, analytics-ready tables for modeling.
The Gold layer merges aggregated features and computes dropout probability:
risk_signal = (
1.2*(1 - F.col("avg_video_completion")) +
1.0*F.col("login_volatility") +
0.8*(1 - F.col("attendance_rate")) +
1.0*(F.when(F.col("avg_score_all") < 55, 1).otherwise(0)) +
0.6*(F.when(F.col("grade_trend") < -5, 1).otherwise(0)) +
0.5*(F.when(F.col("stress_avg") > 6, 1).otherwise(0)) +
0.4*(F.when(F.col("motivation_avg") < 5, 1).otherwise(0))
)
p_dropout = logistic(-1.2 + risk_signal)Students with p_dropout > 0.35 are labeled as at-risk.
Fabric’s distributed Spark engine enables scalable synthetic data generation:
spark.conf.set("spark.sql.shuffle.partitions", "800")Delta optimizations such as Z-ORDER BY and OPTIMIZE improve read/write efficiency for large workloads.
The Gold table is published for consumption:
CREATE OR REPLACE VIEW gold.vw_student_dropout_features AS
SELECT * FROM gold.student_dropout_features;Analysts can query via SQL endpoint, while Data Scientists connect directly to the ML workload for model training.
Integrate Microsoft Fabric ML models directly from the Gold dataset
Build a Power BI dashboard for cohort-level dropout visualization
Add temporal engagement trends for early disengagement signals
This project illustrates how Microsoft Fabric unifies data engineering and data science workflows to build reproducible, ML-ready data systems.
By simulating a realistic EdTech dataset, the pipeline demonstrates Fabric’s ability to:
Handle high-volume data efficiently
Support feature engineering across layers
Enable seamless ML experimentation
Ultimately, it’s a step toward data-driven education helping platforms identify and re-engage students before they drop out.In Fabric, every dataset can tell a story if you build the right pipeline to listen.
Author: Olayemi O Awofe
Github Repository: Click here
Tags: #MicrosoftFabric #DataEngineering #FabricDataEngineering #MachineLearning #DeltaTables #EducationAnalytics #SyntheticData #MedallionArchitecture
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.