Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi,
I'd like to understand best-practice for building this dataset as I've not worked with data on this scale before.
I'm not the owner of the data I'm using, it's SQL views created by our finance team. Our business is quite new to Power BI so I don't have much help internally in terms of best-practices and optimising - your feedback would be great!
I have a star schema like this, all with numerical index keys:
Customer_Channel 1:1 Customer
Customer 1:* Transactions
Location 1:* Transactions
Invoice Channel 1:* Transactions
Product 1:* Transactions
Calendar 1:* Transactions
In power query I've dropped unnecessary columns, and filtered the transaction table based on a date value to the period I wanted.
In this state there are still >24 million rows to load from the transaction table.
There are many transactions which I do not require in the dataset, but the attribute by which I would filter them lies in the related tables. e.g. keep transactions where the related Customer_Channel[Sales office] IN ('name1','name2') etc.
From what I know, it would be suitable to INNER JOIN the tables to be left with the rows that I want.
Is there a method I've not thought of? What's the recommended method?
Another consideration is that this data is already in the Power BI service, serving reports my colleagues have built. I understand this is where dataflows come into play? Should I be trying to piggy-back off their dataset somehow? If so an overview would be really helpful.
Thanks for your advice.
Solved! Go to Solution.
Hi @Anonymous ,
In brief:
You should be able to filter your Transactions table based on the customer codes that relate back up to Customer Channel, but I'd need to understand the values in both Customer Channel and Customer, how they relate to one another, and how Customer relates to Transactions.
Ideal method in my opinion, if there's no direct lineage from Transactions to Customer Channel would be to do this at the SQL server side by creating a view filtered according to your precise needs.
If you can't do the above, then 2nd best method would be to connect to your Transactions table using Direct Query. This will avoid you having to load 24m rows into the model and Power BI will just query the server for exactly what it needs for each visual. There can be performance downsides to this depending on how complicated you make your report, and also restricytions on the transformations you can perform, but likely faster at this volume of rows than using an import model.
Merges in Power Query are generally very expensive. You may be able to do an inner join between your tables if they're from the same server/db and get this to fold to the server, but I've not tested.
Dataflows - these are just online versions of Power Queries. If someone has already made a Dataflow with the correct filters applied to the data then, by all means, use this. However, it will not give you a quicker or easier route to your desired outcome if it's not already been made.
Pete
Proud to be a Datanaut!
Hi @Anonymous ,
In brief:
You should be able to filter your Transactions table based on the customer codes that relate back up to Customer Channel, but I'd need to understand the values in both Customer Channel and Customer, how they relate to one another, and how Customer relates to Transactions.
Ideal method in my opinion, if there's no direct lineage from Transactions to Customer Channel would be to do this at the SQL server side by creating a view filtered according to your precise needs.
If you can't do the above, then 2nd best method would be to connect to your Transactions table using Direct Query. This will avoid you having to load 24m rows into the model and Power BI will just query the server for exactly what it needs for each visual. There can be performance downsides to this depending on how complicated you make your report, and also restricytions on the transformations you can perform, but likely faster at this volume of rows than using an import model.
Merges in Power Query are generally very expensive. You may be able to do an inner join between your tables if they're from the same server/db and get this to fold to the server, but I've not tested.
Dataflows - these are just online versions of Power Queries. If someone has already made a Dataflow with the correct filters applied to the data then, by all means, use this. However, it will not give you a quicker or easier route to your desired outcome if it's not already been made.
Pete
Proud to be a Datanaut!