Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!The Power BI Data Visualization World Championships is back! It's time to submit your entry. Live now!
Your file has been submitted successfully. We’re processing it now - please check back in a few minutes to view your report.
11-24-2025 11:20 AM - last edited 11-24-2025 21:01 PM
Urban sustainability has become a critical challenge as cities grow rapidly and environmental pressures increase. This project provides a comprehensive end-to-end data science workflow to analyze, visualize, and model sustainability indicators across global cities.
The goal is to explore questions such as -
Which cities are leading in sustainability?
How do green cover, renewable energy usage, and air quality influence sustainability scores?
Can we cluster cities based on their sustainability profiles?
Which factors contribute most to sustainability outcomes?
The dataset contains multiple sustainability-related features such as -
🌿Green Cover Percentage.
🏭Air Quality Index.
🔋Renewable Energy Usage.
🚇Public Transport Efficiency.
🏘Population Density.
🌞Solar & Wind Energy Capacity.
🌍Urban Sustainability Score (Target variable).
This project incorporates multiple machine learning techniques to predict Urban Sustainability Scores using diverse city-level environmental, infrastructural, and socio-economic features. The modeling step helps quantify how different factors contribute to sustainability outcomes across global cities.
High performance on structured/tabular data.
Ability to handle non-linear relationships.
Built-in regularization to prevent overfitting.
It predicts the Urban Sustainability Score based on variables such as -
Green cover percentage.
Air quality index.
Waste recycling efficiency.
Renewable energy adoption.
Urban mobility efficiency, etc.
Strong generalization.
Robustness to noise.
Reliable feature-importance insights.
Using two models enables a deeper understanding of variable influence and ensures consistency of predictions.
MAE
RMSE
R² Score
Residual analysis was performed to check for model bias, heteroscedasticity, and distributional behavior.
The following diagnostic plots were generated -
Shows whether residuals follow a normal distribution.
Helps identify skewness or model bias.
Highlights outliers that the model struggles to predict.
Evaluates homoscedasticity (constant variance).
Ideally, residuals should scatter randomly around zero.
Patterns in this plot suggest missing features or non-linearity.
These diagnostics verify that the model is reliable and generalizes well.
Both models provide insights into which factors most affect urban sustainability -
Renewable energy usage.
Pollution control metrics.
Public transport efficiency.
Waste management quality.
Green cover.
This analysis helps policymakers prioritize impactful interventions.
Cities are grouped using -
| Languages | Python |
| Visualization | Matplotlib, Seaborn, Plotly |
| Data Processing | Pandas, NumPy |
| Machine Learning | Scikit-Learn, XGBoost |
| Clustering | KMeans, Hierarchical Clustering |
| Dimensionality Reduction | PCA |
| Environment | Kaggle Notebook |
1) Cities with higher renewable energy usage significantly outperform others in sustainability scores.
2) Air quality and green cover are the two most dominant factors.
3) PCA & clustering revealed 3 distinct types of sustainable cities -
Renewable-heavy cities.
High green cover but low air quality.
Balanced mid-tier sustainability cities.
This project demonstrates the power of data-driven sustainability analysis, combining -
Strong EDA.
Rich interactive visualizations.
Dimensionality reduction.
Clustering.
Predictive modeling.
The results provide a holistic understanding of what makes a city sustainable and how future urban spaces can be designed.
https%3A%2F%2Fgithub.com%2Fiamhriturajsaha%2FURBAN-SUSTAINABILITY-ANALYSIS