The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Please assist.
How do I deal with missing data in a data source(https://drive.google.com/file/d/1w_-i4JvdcYApxtI3QD3MUOE52NYYvcxx/view?usp=sharing ), any tips or tricks?
My analysis so far:
Reports To Code has 462 missing values.
Time Sheet Status has 90 missing values.
The missing values in the "Reports To Code" and "Time Sheet Status" columns might be of particular concern because they suggest missing administrative data that is likely expected to be present for each record.
The summary statistics for the numerical columns provide the following insights:
Actual Billable Hours For Week: Mean is around 14.23 hours with a maximum of 58.34 hours, suggesting some high variations in billable hours, which may be expected depending on workload and contracts.
Actual Hours For Week: The average is higher than billable hours at approximately 26.85 hours, with a maximum of 74.33 hours, which might include non-billable work.
Approved By: This seems to be a categorical field that might be encoded as numeric, representing different approvers or departments with IDs ranging from 4 to 172.
Billable Hours For Week: On average, around 19.15 hours, with a maximum of 93.5 hours, which is quite high and may warrant further investigation to ensure accuracy.
Daily Capacity: Most values are around 7.5 to 8 hours, aligning with typical workday expectations.
Weekly Capacity: Generally 37.5 to 40 hours, which is standard for a full-time workweek.
Weekly Leave: Averages to approximately 3.68 hours, with a maximum of 40 hours, indicating that some entries represent a full week of leave.
Columns related to specific services such as Data & Development Services - Weekly Actual Billable Hours and Professional Services - Weekly Actual Billable Hours have mean values that suggest active tracking of hours for these services, with a reasonable spread.
Utilization Rate:
Billable Utilization: The percentage of actual billable hours to total available hours.
Overall Utilization: The percentage of total actual hours (billable and non-billable) to total available hours.
Billable Hours:
Total Billable Hours: The total number of hours billed to clients.
Average Billable Hours per Employee: The average number of billable hours per employee or per role.
Capacity Utilization:
Average Daily Capacity Utilization: Average number of hours worked per day divided by the daily capacity.
Weekly Capacity Utilization: Total actual hours for the week divided by the total weekly capacity.
Revenue Efficiency:
Revenue per Billable Hour: Total revenue divided by total billable hours.
Revenue per Employee: Total revenue divided by the number of employees.
Leave and Absenteeism:
Average Weekly Leave: The average hours of leave taken per week.
Leave Rate: The percentage of hours taken as leave out of total available hours.
Project Performance:
Solved! Go to Solution.
Hi @Yrstruly2021 ,
Several things you can do:
At the end of the day, it is your call which approach to follow. If this is work-related, there usually there usually are company guidelines on what to do with missing data. If not, why not ask your immediate superior?
Proud to be a Super User!
Hi, @Yrstruly2021
May I ask you to solve your problem, you can refer to @danextian suggestion, if there is still confusion you can refer to the following suggestion about dealing with missing values
1.Identify the Types of Missing Data(There are three main types of missing data)
MCAR (Missing Completely at Random): In this scenario, the missingness occurs randomly across all variables and observations. It’s like losing Lego pieces during a game—each missing piece doesn’t affect the information about other pieces.
MAR (Missing at Random): Here, the probability of missing data depends on other variables in the dataset. For example, in a survey, data scientists who don’t frequently upgrade their skills might skip certain questions related to new technologies.
MNAR (Missing Not at Random): This is the most challenging scenario. The probability of missing data varies for different values of the same variable, and the reasons can be unknown to us (e.g., couples with a bad relationship avoiding certain questions in a survey).
2.Techniques to Handle Missing Data:
Imputation: Impute missing values with estimated or predicted values. Common methods include mean imputation, median imputation, or regression-based imputation.
Dropping Rows: If the missing data is minimal, consider removing rows with missing values. However, be cautious not to lose too much information.
Categorical Variables:
For categorical variables, consider creating a new category for missing values.
Alternatively, use predictive models to impute missing categorical values.
Numerical Variables:
For numerical variables, impute missing values using statistical measures (mean, median, etc.).
Consider using machine learning models (e.g., k-nearest neighbors) for imputation.
Advanced Techniques:
Multiple Imputation: Generate multiple imputed datasets and combine results.
Interpolation: Use time-series data to interpolate missing values.
Deep Learning: Train neural networks to predict missing values.
3.Tips and Tricks:
Limit Follow-ups: Minimize the number of follow-ups during data collection to reduce missing data.
User-Friendly Forms: Design user-friendly data collection forms to encourage complete responses.
Data Validation: Implement validation techniques to catch errors during data entry.
Offer Incentives: Provide incentives to participants for accurate and complete data submission.
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
How to get your questions answered quickly -- How to provide sample data in the Power BI Forum
Hi, @Yrstruly2021
May I ask you to solve your problem, you can refer to @danextian suggestion, if there is still confusion you can refer to the following suggestion about dealing with missing values
1.Identify the Types of Missing Data(There are three main types of missing data)
MCAR (Missing Completely at Random): In this scenario, the missingness occurs randomly across all variables and observations. It’s like losing Lego pieces during a game—each missing piece doesn’t affect the information about other pieces.
MAR (Missing at Random): Here, the probability of missing data depends on other variables in the dataset. For example, in a survey, data scientists who don’t frequently upgrade their skills might skip certain questions related to new technologies.
MNAR (Missing Not at Random): This is the most challenging scenario. The probability of missing data varies for different values of the same variable, and the reasons can be unknown to us (e.g., couples with a bad relationship avoiding certain questions in a survey).
2.Techniques to Handle Missing Data:
Imputation: Impute missing values with estimated or predicted values. Common methods include mean imputation, median imputation, or regression-based imputation.
Dropping Rows: If the missing data is minimal, consider removing rows with missing values. However, be cautious not to lose too much information.
Categorical Variables:
For categorical variables, consider creating a new category for missing values.
Alternatively, use predictive models to impute missing categorical values.
Numerical Variables:
For numerical variables, impute missing values using statistical measures (mean, median, etc.).
Consider using machine learning models (e.g., k-nearest neighbors) for imputation.
Advanced Techniques:
Multiple Imputation: Generate multiple imputed datasets and combine results.
Interpolation: Use time-series data to interpolate missing values.
Deep Learning: Train neural networks to predict missing values.
3.Tips and Tricks:
Limit Follow-ups: Minimize the number of follow-ups during data collection to reduce missing data.
User-Friendly Forms: Design user-friendly data collection forms to encourage complete responses.
Data Validation: Implement validation techniques to catch errors during data entry.
Offer Incentives: Provide incentives to participants for accurate and complete data submission.
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
How to get your questions answered quickly -- How to provide sample data in the Power BI Forum
Hi @Yrstruly2021 ,
Several things you can do:
At the end of the day, it is your call which approach to follow. If this is work-related, there usually there usually are company guidelines on what to do with missing data. If not, why not ask your immediate superior?
Proud to be a Super User!
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Power BI update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
113 | |
77 | |
76 | |
47 | |
38 |
User | Count |
---|---|
147 | |
115 | |
65 | |
64 | |
53 |