The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
Something very odd going on.
I am connected to a Databricks source in Power BI. In import mode.
I have not yet joined the table to any others, either with merges in Power Query or relationships with other tables so this is not an issue of incorrect joins or poorly established relationships.
Despite there being no duplications in my source data I am getting duplications at the Power BI end in a relatively small number of cases.
I have refreshed the import multiple times in case there was an import error, but no resolution so far.
Power query steps currently include the removal of unnecessary column, removal of rows based on a condition, and I have also applied a remove duplicates step.
Any suggestions greatly appreciated. Thanks.
Solved! Go to Solution.
I found the answer! As it turned out it was minor differences in upper/lower case that weren't immediately visible from the way in which I was inspecting the data at both ends. In Power BI visuals the differences in capitalisation weren't showing (like in Excel when you have a filter "today" and "Today" will be treated the same) meaning that the count for the value was duplicated, even though it would have been a count of one against each variation. Then of course at the Databricks end, the querying was case sensitive so I was only returning the specific version of the data with that capitalisation when I tried to find the relevant rows.
Eventually found it all once I moved to pulling the row counts for everything from both Databricks and PBI and comparing in Excel.
What a ride!
Thanks for your offers of possible things to test!
How many columns are you highlighting with the remove duplciates step?
If more than one it will be looking at the unique combinations.
Assuming you are selecting only one is there a chance of whitespace being the difference between the two rows?
Try filtering in the QE to two of the rows that are duplicating. See if there are any differences in the QE rows compared to the table view versions.
If you are happy with this answer please mark as a solution for others to find !
Kudos are always appreciated! Check out our free Power BI video courses.
I found the answer! As it turned out it was minor differences in upper/lower case that weren't immediately visible from the way in which I was inspecting the data at both ends. In Power BI visuals the differences in capitalisation weren't showing (like in Excel when you have a filter "today" and "Today" will be treated the same) meaning that the count for the value was duplicated, even though it would have been a count of one against each variation. Then of course at the Databricks end, the querying was case sensitive so I was only returning the specific version of the data with that capitalisation when I tried to find the relevant rows.
Eventually found it all once I moved to pulling the row counts for everything from both Databricks and PBI and comparing in Excel.
What a ride!
Thanks for your offers of possible things to test!
Hi @amy9
Does the table have much activity on it?
There are three times I have experienced similar things in SQL:
1) Dynamic SQL running that changed the column order leading to weird matchups and duplication
2) Tables with high levels of activity featuring a lock/no lock situation: https://www.sqlshack.com/understanding-impact-clr-strict-security-configuration-setting-sql-server-2...
3) The SQL script itself was causing the duplication due to a column having returns in the data that Power BI read as two seperate rows.
If you are happy with this answer please mark as a solution for others to find !
Kudos are always appreciated! Check out our free Power BI video courses.
Thanks for the reply. It isn't a hugely active dataset (it tracks the organisation data on a weekly basis, so should only be changing theoretically once a week), but it is a view in the serving layer so perhaps something going on in there.
Odd that the remove duplicates action in power query doesn't resolve the issue either, as I would have thought that that would have dealt with it regardless.