Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
freswood
Helper I
Helper I

Decision Trees in Power BI

Hi all, has anybody tried to use the Decision Tree custom visual? Whenever I try to use it (even with < 150k rows as input), it turns out completely different to the output I get in RStudio. Even when using exacty the same input and same settings, the % in the top node are completely different. Eg in one scenario I had a true/false predictive variable, and in RStudio it was 98/02 and in Power BI it was 52/48.

 

I'm curious to know whether anyone else has had this experience. Thanks for your help!

1 ACCEPTED SOLUTION

Hey @freswood @Regulate,

 

I came across this same issue today.

In case you're still looking for a solution, here's what I have figured out so far:

 

Every time you associate a target variable and select a few input variables, Power BI internally creates a data-table from your original set that contains only the fields you have selected.

It then goes and deletes all duplicate rows.

 

Assume your original input data is a 5x5 table like:

 Distinct Var AInput BInput CInput DTarget
1a121TRUE
2b122TRUE
3c211FALSE
4d231TRUE
5e011FALSE



At this point, you have 60% True, 40% False.

Let's say you set your target and pick only Input D as the input variable.

Your internal data-table being used for the decision tree looks like:

Input DTarget
1TRUE
2TRUE
1FALSE

Now, you have 66.67% True, 33.33% False.

 

Let's say you now use Input C as an additional input variable.
Your new internal data-table for the decision tree looks like:

Input CInput DTarget
21TRUE
22TRUE
11FALSE
31TRUE

Following along, you're now at 75% True, 25% False

 

This explains why the values in your root node keep changing every time you modify the input variables.

I haven't yet been able to figure out a way to make this funny (maybe as designed, but I don't like it) behavior stop.

As a workaround, I'm considering including a variable that is unique for each record. Ideally, the tree should never be splitting on that varaible and it'll ensure that all records of your data are considered, since there won't be any duplicates.

 

Let me know if this makes sense/works. Cheers!

 

KK

View solution in original post

7 REPLIES 7
Regulate
Regular Visitor

I'm having the same problem. In fact, the percentages in the top node change drastically when I try different input variables. But shouldn't the top node be the same irregardless of the input variables? 

 

I have a True/False target variable, and the division in the data is obviously always the same (about 50/50), yet the percentages in the top node change when I add/delete input variables. Percentages in the top node are sometimes 2%/98% and sometimes 40%/60% or anything in between. It's never the same as in the data itself, or if it is it is by chance.

 

Any help? How can I trust that the decision tree involves all the data if the percentage in the top node changes all the time?

Hey @freswood @Regulate,

 

I came across this same issue today.

In case you're still looking for a solution, here's what I have figured out so far:

 

Every time you associate a target variable and select a few input variables, Power BI internally creates a data-table from your original set that contains only the fields you have selected.

It then goes and deletes all duplicate rows.

 

Assume your original input data is a 5x5 table like:

 Distinct Var AInput BInput CInput DTarget
1a121TRUE
2b122TRUE
3c211FALSE
4d231TRUE
5e011FALSE



At this point, you have 60% True, 40% False.

Let's say you set your target and pick only Input D as the input variable.

Your internal data-table being used for the decision tree looks like:

Input DTarget
1TRUE
2TRUE
1FALSE

Now, you have 66.67% True, 33.33% False.

 

Let's say you now use Input C as an additional input variable.
Your new internal data-table for the decision tree looks like:

Input CInput DTarget
21TRUE
22TRUE
11FALSE
31TRUE

Following along, you're now at 75% True, 25% False

 

This explains why the values in your root node keep changing every time you modify the input variables.

I haven't yet been able to figure out a way to make this funny (maybe as designed, but I don't like it) behavior stop.

As a workaround, I'm considering including a variable that is unique for each record. Ideally, the tree should never be splitting on that varaible and it'll ensure that all records of your data are considered, since there won't be any duplicates.

 

Let me know if this makes sense/works. Cheers!

 

KK

Thanks @thekkm13 for figuring this out! As you can see this post has been open for a long time, so it's great to finally have an answer. Please do let us know if you end up trying out the workaround 🙂 Thanks again.

v-yuezhe-msft
Microsoft Employee
Microsoft Employee

@freswood,

Could you please share sample data of your table?

Regards,
Lydia

Community Support Team _ Lydia Zhang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Hello,

 

Relating to last question, we had the same issue with our data, the percentage is wrong compared to what we can found; in our example, the part of defect is around 24 % in the brut file, with Power BI after getting those data, it’s  rather 47%. Can you explain where might be the problem  is located ?

 

i tried to join the file, but not possible into this platform.

 

can you indicate me you professionnal e-mail adress for sending the file ?

 

Sincerely,

 

Nabil

Hello,

 

Relating to last question, we had the same issue with our data, the percentage is wrong compared to what we can found; in our example, the part of defect is around 24 % in the brut file, with Power BI after getting those data, it’s  rather 47%. Can you explain where might be the problem  is located ?

 

I tried to join the file, but not possible with this platform.

 

Can you indicate me your professional e-mail for sending the file ?

 

Sincerely,

 

O.N.

Hi Lydia, unfortunately not because the data is commercially sensitive. However I'm hoping that perhaps other people have had similar experiences.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.