Solved: Re: Calculating Coefficient of Variation

Schmidtmayer · ‎03-13-2020

Hello everyone 😃

I do have a table containing the following columns:

Personal Number, Date, Team, Project, TimeCategory, Time

There are multiple Personal Numbers per Team, multiple Teams per Project and also multiple TimeCategories per Date and Personal Number.

It is my goal to determine whether the high qutoas of illness are due to some extreme examples of high illness rates or a case of illness rates being high in general.

In order to answer this question, I want to use a statistical tool called the coefficient of variation. This is the standard deviation divided by the expected value.

https://en.wikipedia.org/wiki/Coefficient_of_variation

Because of different contract lengths, I will not use some equal distribution of the quotas.
I will use the illness quota per employee as a random variable with probability being equal to the proportion of the employees hours to the hours of the team/project within the selected time period.

Times relevant may be filtered using the column TimeCategory.

I know how to use CALUCATE() im combination with FILTER(), ALL(), ALLEXCEPT(), but I am lost here.
My problem is: The data must be dynamically preaggregated by Personal Number in order to get the right results. And I do not not know how to do this.

The formula itself will look like this:

CV = (((Illness Quota of employee A- Illness Quota of Team)^2)*(Total Hours of Employee A)/(Total Hours of Team) +.....+((Illness Quota of employee Y- Illness Quota of Team)^2)*(Total Hours of Employee Y)/(Total Hours of Team) )/(IllnessQuota of Team)

As said: My problem is the fact, that the data are not preaggregated by Personal Number.

Need help!

Anonymous · ‎03-13-2020

All filtering is one-way only from a dimension to the fact table.

// Dimensions:
// 		Employee connected to FactTable[PIN] (1:*)
// 		Calendar connected to FactTable[Date] (1:*)
// 		Team connected to FactTable[TeamId] (1:*)
// 		Project connected to FactTable[ProjectId] (1:*)
// 		TimeCategory connected to FactTable[TimeCategoryId] (1:*)
// All *Id fields are hidden in dimensions.
//
// All columns in the FactTable must be hidden.
// Only measures can be visible. All slicing is
// done through dimensions. This is the correct
// star schema model.

// Define the following measures. They work for any
// slice for any dimension. In particular for 
// slices on Employee.

[Total] = SUM( FactTable[Hours] )

[Illness] =
	CALCULATE(
		[Total],
		KEEPFILTERS( TimeCategory[TimeCategoryId] = "Illness" )
	)

[IllnessQuota] = DIVIDE( [Illness], [Total] )
	
[Proportion To Team] =
var __totalForEmps = [Total]
var __teamsOfEmps =
	summarize(
		FactTable,
		Team[TeamId]
	)
var __totalForTeams =
	calculate(
		[Total],
		__teamsOfEmps,
		all( Employee ),
		all( Team )
	)
var __result =
	divide( __totalForEmps, __totalForTeams )
return
	__result
	
	
[VC for Team] =
var __oneTeamVisible = hasonevalue( Team[TeamId] )
var __team = 
	SUMMARIZE(
		FactTable,
		Team[TeamId]
	)
var __employees =
	SUMMARIZE(
		FactTable,
		Employee[PIN]
	)
var __numerator =
	SQRT(
		SUMX(
		
			__employees,
			
			var __iqForEmp = [IllnessQuota]
			var __iqForTeam =
				calculate(
					[IllnessQuota],
					__team,
					all( Team ),
					all( Employee )
				)
			var __propToTeam = [Proportion To Team]
			var __result =
				__propToTeam
					* POWER( __iqForEmp - __iqForTeam, 2 )
			return
				__result
			
		)
	)
var __denominator = 
	calculate(
		[IllnessQuota],
		__team,
		all( Team ),
		all( Employee )
	)
var __varCoeff =
	DIVIDE( __numerator, __denominator )
return
	if( __oneTeamVisible, __varCoeff )

Once you've implemented the correct model, please let me know how it goes. Thanks 🙂

Best

D

View solution in original post

Greg_Deckler · ‎03-13-2020

I'll have to look, I believe I cover Covariance in my upcoming book, DAX Cookbook. Comes out next week. But would need sample data. Please see this post regarding How to Get Your Question Answered Quickly: https://community.powerbi.com/t5/Community-Blog/How-to-Get-Your-Question-Answered-Quickly/ba-p/38490

Follow on LinkedIn
@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: Power BI Cookbook Third Edition (Color)

DAX is easy, CALCULATE makes DAX hard...

Greg_Deckler · ‎03-13-2020

Ah yes, here is the formula that I was using for covariance.

Covariance = 
    VAR __Table = 'R04_Table'
    VAR __Count = COUNTROWS(__Table)
    VAR __AvgA = AVERAGEX(__Table,[A])
    VAR __AvgB = AVERAGEX(__Table,[B])
    VAR __Table1 = 
        ADDCOLUMNS(
            __Table,
            "__Covariance",
            DIVIDE(
                ([A] - __AvgA) * ([B] - __AvgB),
                __Count
            )
        )
RETURN
    SUMX(__Table1,[__Covariance])

Follow on LinkedIn
@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: Power BI Cookbook Third Edition (Color)

DAX is easy, CALCULATE makes DAX hard...

Anonymous · ‎03-13-2020

Hi there.

Mate, would you please make this more understandable? Please bear in mind that we are not working with your model and what's easily graspable to you is not necessarily easy to grasp for us. You are referring in your formulas to fields that are nowhere to be found in your description. The table at the top you are mentioning does not give us almost any information at all. For instance, where and what is 'illness quota of employee'? I understand there are dimensions and some fact tables in your model but could you be more explicit, please? State what tables you have, what measures, what columns... and how they are linked together.

If ppl don't understand your problem, you won't get any replies.

Best
D

Schmidtmayer · ‎03-13-2020

First of all, thanks for the reply.
I will be more specific.

There is only one table til now, having the mentioned columns:

Personal Number - Unique ID to identify the employees
Date - no further infos required, I suppose
Team - the current team of the employee
Project -the current project of the employee
TimeCategory - Classification of hours, includes the categories Illness, Productive, Vacation
Hours - Registered time of the employee

In the following let X be the set of Personal Numbers, T be the set of Teams

For x from X we define:

Total(x) = CALCULATE(SUM(Hours); Personal Number = x)
Illness(x) = CALCULATE(SUM(Hours); Personal Number =x; TimeCategory = Illness)

For t from T we define:

Total(t) = CALCULATE(SUM(Hours); Team = t)
Illness(t) = CALCULATE(SUM(Hours); Team = t; TimeCategory = Illness)

Now, the quotas:
IllnessQuota(x) = Illness(x)/Total(x)
IllnessQuota(t) = Illness(t)/Total(t)

Lastly:
Proportion(x) = Total(x)/Total(t)

Remark: Every employee is a member of just one team.

Then I wish to calculate

VC(t) =
SQRT(
SUM(
(IllnessQuota(x) - IllnessQuota(t))^2×Proportion(x)
)
)
/IllnessQuota(t)

Just these x shall be included, which where a part of t at this time.

A table with the following is also present:

Personal Number, Team, Start

Giving the complete history of teams. The current team of x is mentioned in the first table, so I think this one should not be needed.

CV(t) should be visualized in a bar diagramm having years on the x axis, with Drill Down to month and having a filter for teams.

Anonymous · ‎03-13-2020

All filtering is one-way only from a dimension to the fact table.

// Dimensions:
// 		Employee connected to FactTable[PIN] (1:*)
// 		Calendar connected to FactTable[Date] (1:*)
// 		Team connected to FactTable[TeamId] (1:*)
// 		Project connected to FactTable[ProjectId] (1:*)
// 		TimeCategory connected to FactTable[TimeCategoryId] (1:*)
// All *Id fields are hidden in dimensions.
//
// All columns in the FactTable must be hidden.
// Only measures can be visible. All slicing is
// done through dimensions. This is the correct
// star schema model.

// Define the following measures. They work for any
// slice for any dimension. In particular for 
// slices on Employee.

[Total] = SUM( FactTable[Hours] )

[Illness] =
	CALCULATE(
		[Total],
		KEEPFILTERS( TimeCategory[TimeCategoryId] = "Illness" )
	)

[IllnessQuota] = DIVIDE( [Illness], [Total] )
	
[Proportion To Team] =
var __totalForEmps = [Total]
var __teamsOfEmps =
	summarize(
		FactTable,
		Team[TeamId]
	)
var __totalForTeams =
	calculate(
		[Total],
		__teamsOfEmps,
		all( Employee ),
		all( Team )
	)
var __result =
	divide( __totalForEmps, __totalForTeams )
return
	__result
	
	
[VC for Team] =
var __oneTeamVisible = hasonevalue( Team[TeamId] )
var __team = 
	SUMMARIZE(
		FactTable,
		Team[TeamId]
	)
var __employees =
	SUMMARIZE(
		FactTable,
		Employee[PIN]
	)
var __numerator =
	SQRT(
		SUMX(
		
			__employees,
			
			var __iqForEmp = [IllnessQuota]
			var __iqForTeam =
				calculate(
					[IllnessQuota],
					__team,
					all( Team ),
					all( Employee )
				)
			var __propToTeam = [Proportion To Team]
			var __result =
				__propToTeam
					* POWER( __iqForEmp - __iqForTeam, 2 )
			return
				__result
			
		)
	)
var __denominator = 
	calculate(
		[IllnessQuota],
		__team,
		all( Team ),
		all( Employee )
	)
var __varCoeff =
	DIVIDE( __numerator, __denominator )
return
	if( __oneTeamVisible, __varCoeff )

Once you've implemented the correct model, please let me know how it goes. Thanks 🙂

Best

D

Schmidtmayer · ‎03-15-2020

@Anonymous Thanks a bunch.
SUMMARIZE was all the thing I actually needed. Was able to put it all into one smaller measure:

[coefficient of variation] =

var __tablepersonalnumber =

SUMMARIZE(

FactsTable

FactsTable[PersonalNumber];

"Total"; [Total];

"Illness"; [Illness];

"Quota"; [IllnessQuota]
)

var __tablecomplete =
SUMMARIZE (
FactTable;

"TotalGroup" ;[Total];

"IllnessGroup"; [Illness];

"QuotaGroup"; [IllnessQuota]

)

var __tablejoin =

CROSSJOIN(__tablecomplete; __tablepersonalnumber)

RETURN SQRT(SUMX(__tablejoin; (Quota - QuotaGroup)*(Quota-QuotaGroup)*Total/TotalGroup))/AVERAGEX(__tablecomplete; QuotaGroup)

Illness, Total and IllnessQuota defined as in your post.

@Greg_Deckler : Your measure was not that useful, as the covariance describes the dependence/independence of 2 random variables. Here, I do only have one random variable, being the quotas of illness of the employees. Maybe I will need it later. Thanks for that 😃

Greg_Deckler · ‎03-15-2020

Yeah, sorry about that @Schmidtmayer , read that one too quickly!

Follow on LinkedIn
@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: Power BI Cookbook Third Edition (Color)

DAX is easy, CALCULATE makes DAX hard...

Anonymous · ‎03-15-2020

Hi there. Before you think you've got the solution right, have a look at this article:

https://www.sqlbi.com/articles/all-the-secrets-of-summarize/

Here's an excerpt from it:

UPDATE 2018-01-24 : The content of this article is obsolete as of January 2018. Recent versions of Excel 2016, Power BI, and Analysis Services have a SUMMARIZE behavior that is different from the one described in this article. As noted below, using SUMMARIZE should be deprecated for aggregations and you should use SUMMARIZECOLUMNS instead. Read more in Introducing SUMMARIZECOLUMNS.

Disregard the "obsolete" note. It's not relevant to your case. You should not use SUMMARIZE to do any calculations inside it. Instead, you should always use the SUMMARIZE/ADDCOLUMNS combination.

Please read the article if you want to be sure your DAX is always correct. But there's more to it. You should also have the proper dimensional design (as I outlined in my measure in the preamble) if you want to be safe in the knowledge that your measures always work correctly in any circumstances. If you want to know what can happen if you don't do this, then go and read this: https://www.sqlbi.com/articles/understanding-dax-auto-exist/

Be careful with how you structure your models because the correctness of figures DAX produces depends on it.

Best
D

Schmidtmayer · ‎03-16-2020

@Anonymous:

Thanks for pointing out. I was aware of this during testing of my measure. All the values were complete garbage the first time I tried it. That's why I used the second table for the values of the whole group I am looking at and crossjoining it. Til now, everything works fine. Also checked the values via pocket calculator.

Regaring the DataModell: Sorry for ignoring it. We are using this model as a standard within our departement, so I didn't answer this.

Greetings Tom

Anonymous · ‎03-16-2020

Hi there.

Just the last word of warning: If you have a model that consists of one big fact table, then you'll never know when the problem with auto-exist will bite you because it's almost random (it's not really but whether it happens or not depends on data distribution in your table).

So, in a word, you are risking producing wrong numbers without even knowing it. If you and your department can live with that... then all is good.

Best
D

Calculating Coefficient of Variation

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Calculating Coefficient of Variation

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025