Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
Hi all, has anybody tried to use the Decision Tree custom visual? Whenever I try to use it (even with < 150k rows as input), it turns out completely different to the output I get in RStudio. Even when using exacty the same input and same settings, the % in the top node are completely different. Eg in one scenario I had a true/false predictive variable, and in RStudio it was 98/02 and in Power BI it was 52/48.
I'm curious to know whether anyone else has had this experience. Thanks for your help!
Solved! Go to Solution.
I came across this same issue today.
In case you're still looking for a solution, here's what I have figured out so far:
Every time you associate a target variable and select a few input variables, Power BI internally creates a data-table from your original set that contains only the fields you have selected.
It then goes and deletes all duplicate rows.
Assume your original input data is a 5x5 table like:
Distinct Var A | Input B | Input C | Input D | Target | |
1 | a | 1 | 2 | 1 | TRUE |
2 | b | 1 | 2 | 2 | TRUE |
3 | c | 2 | 1 | 1 | FALSE |
4 | d | 2 | 3 | 1 | TRUE |
5 | e | 0 | 1 | 1 | FALSE |
At this point, you have 60% True, 40% False.
Let's say you set your target and pick only Input D as the input variable.
Your internal data-table being used for the decision tree looks like:
Input D | Target |
1 | TRUE |
2 | TRUE |
1 | FALSE |
Now, you have 66.67% True, 33.33% False.
Let's say you now use Input C as an additional input variable.
Your new internal data-table for the decision tree looks like:
Input C | Input D | Target |
2 | 1 | TRUE |
2 | 2 | TRUE |
1 | 1 | FALSE |
3 | 1 | TRUE |
Following along, you're now at 75% True, 25% False
This explains why the values in your root node keep changing every time you modify the input variables.
I haven't yet been able to figure out a way to make this funny (maybe as designed, but I don't like it) behavior stop.
As a workaround, I'm considering including a variable that is unique for each record. Ideally, the tree should never be splitting on that varaible and it'll ensure that all records of your data are considered, since there won't be any duplicates.
Let me know if this makes sense/works. Cheers!
KK
I'm having the same problem. In fact, the percentages in the top node change drastically when I try different input variables. But shouldn't the top node be the same irregardless of the input variables?
I have a True/False target variable, and the division in the data is obviously always the same (about 50/50), yet the percentages in the top node change when I add/delete input variables. Percentages in the top node are sometimes 2%/98% and sometimes 40%/60% or anything in between. It's never the same as in the data itself, or if it is it is by chance.
Any help? How can I trust that the decision tree involves all the data if the percentage in the top node changes all the time?
I came across this same issue today.
In case you're still looking for a solution, here's what I have figured out so far:
Every time you associate a target variable and select a few input variables, Power BI internally creates a data-table from your original set that contains only the fields you have selected.
It then goes and deletes all duplicate rows.
Assume your original input data is a 5x5 table like:
Distinct Var A | Input B | Input C | Input D | Target | |
1 | a | 1 | 2 | 1 | TRUE |
2 | b | 1 | 2 | 2 | TRUE |
3 | c | 2 | 1 | 1 | FALSE |
4 | d | 2 | 3 | 1 | TRUE |
5 | e | 0 | 1 | 1 | FALSE |
At this point, you have 60% True, 40% False.
Let's say you set your target and pick only Input D as the input variable.
Your internal data-table being used for the decision tree looks like:
Input D | Target |
1 | TRUE |
2 | TRUE |
1 | FALSE |
Now, you have 66.67% True, 33.33% False.
Let's say you now use Input C as an additional input variable.
Your new internal data-table for the decision tree looks like:
Input C | Input D | Target |
2 | 1 | TRUE |
2 | 2 | TRUE |
1 | 1 | FALSE |
3 | 1 | TRUE |
Following along, you're now at 75% True, 25% False
This explains why the values in your root node keep changing every time you modify the input variables.
I haven't yet been able to figure out a way to make this funny (maybe as designed, but I don't like it) behavior stop.
As a workaround, I'm considering including a variable that is unique for each record. Ideally, the tree should never be splitting on that varaible and it'll ensure that all records of your data are considered, since there won't be any duplicates.
Let me know if this makes sense/works. Cheers!
KK
Thanks @thekkm13 for figuring this out! As you can see this post has been open for a long time, so it's great to finally have an answer. Please do let us know if you end up trying out the workaround 🙂 Thanks again.
@freswood,
Could you please share sample data of your table?
Regards,
Lydia
Hello,
Relating to last question, we had the same issue with our data, the percentage is wrong compared to what we can found; in our example, the part of defect is around 24 % in the brut file, with Power BI after getting those data, it’s rather 47%. Can you explain where might be the problem is located ?
i tried to join the file, but not possible into this platform.
can you indicate me you professionnal e-mail adress for sending the file ?
Sincerely,
Nabil
Hello,
Relating to last question, we had the same issue with our data, the percentage is wrong compared to what we can found; in our example, the part of defect is around 24 % in the brut file, with Power BI after getting those data, it’s rather 47%. Can you explain where might be the problem is located ?
I tried to join the file, but not possible with this platform.
Can you indicate me your professional e-mail for sending the file ?
Sincerely,
O.N.
Hi Lydia, unfortunately not because the data is commercially sensitive. However I'm hoping that perhaps other people have had similar experiences.
Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.
User | Count |
---|---|
94 | |
92 | |
83 | |
71 | |
49 |
User | Count |
---|---|
143 | |
120 | |
110 | |
60 | |
57 |