Solved: Reducing large dataset based on values in related ...

Anonymous · ‎11-17-2021

Hi,

I'd like to understand best-practice for building this dataset as I've not worked with data on this scale before.

I'm not the owner of the data I'm using, it's SQL views created by our finance team. Our business is quite new to Power BI so I don't have much help internally in terms of best-practices and optimising - your feedback would be great!

I have a star schema like this, all with numerical index keys:

Customer_Channel 1:1 Customer

Customer 1:* Transactions

Location 1:* Transactions

Invoice Channel 1:* Transactions

Product 1:* Transactions

Calendar 1:* Transactions

In power query I've dropped unnecessary columns, and filtered the transaction table based on a date value to the period I wanted.

In this state there are still >24 million rows to load from the transaction table.

There are many transactions which I do not require in the dataset, but the attribute by which I would filter them lies in the related tables. e.g. keep transactions where the related Customer_Channel[Sales office] IN ('name1','name2') etc.

From what I know, it would be suitable to INNER JOIN the tables to be left with the rows that I want.

Using a merge in power query hangs desktop, it doesn't feel very efficient.
1. I understand this is 'query folding', and if I wanted to filter both directions this could fall over? e.g. filter Customer to only customers with Transactions in the period. Filter transactions to only customers with [sales office] x.
2. In power query editor, if I merge with inner join, then have a column I don't need which is then deleted, is there a more efficient way to do this? This method feels like why it's quite resource intensive in desktop.
I could use a native SQL query for each table, each query having a CTE for the other tables I want to filter by and INNER JOIN them. This seems like a weird duplication of effort on the SQL side.

Is there a method I've not thought of? What's the recommended method?

Another consideration is that this data is already in the Power BI service, serving reports my colleagues have built. I understand this is where dataflows come into play? Should I be trying to piggy-back off their dataset somehow? If so an overview would be really helpful.

Thanks for your advice.

BA_Pete · ‎11-17-2021

Hi @Anonymous ,

In brief:

You should be able to filter your Transactions table based on the customer codes that relate back up to Customer Channel, but I'd need to understand the values in both Customer Channel and Customer, how they relate to one another, and how Customer relates to Transactions.

Ideal method in my opinion, if there's no direct lineage from Transactions to Customer Channel would be to do this at the SQL server side by creating a view filtered according to your precise needs.

If you can't do the above, then 2nd best method would be to connect to your Transactions table using Direct Query. This will avoid you having to load 24m rows into the model and Power BI will just query the server for exactly what it needs for each visual. There can be performance downsides to this depending on how complicated you make your report, and also restricytions on the transformations you can perform, but likely faster at this volume of rows than using an import model.

Merges in Power Query are generally very expensive. You may be able to do an inner join between your tables if they're from the same server/db and get this to fold to the server, but I've not tested.

Dataflows - these are just online versions of Power Queries. If someone has already made a Dataflow with the correct filters applied to the data then, by all means, use this. However, it will not give you a quicker or easier route to your desired outcome if it's not already been made.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

View solution in original post

BA_Pete · ‎11-17-2021