The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Long time listener, first time caller.
Power Query RE: building predictive models. Context, I have a dataset (c.400k rows, 140 columns) which is curated elsewhere and ingested into PQ as a flat file.
I'm building out 3x models for different events, and segmenting the dataset into say 4x customer segments = 12x models I'm looking to train (on the basis of sufficient data volume to do so and vastly improved AUC as the segments and events are distinct from one another). So it looks sort of like this:
What I want to do here is (working around the limitations of the ML solution in calibrating specifics) force a specific block of data as "training" data (majority of data), create a custom hold out for forward testing (c.25000 rows), and then a third block of data for applying the model against just the latest month (i.e. <=3000 rows).
So in reality my set of data cuts looks like this
I've created the "All", which all reference the Base data, and each Seg is a subset of that
So my question is - is it more process efficient
Benefits of sub referencing is it would reduce the amount of filters I need to change each month as they will cascade through, so this is the most likely answer. But specifically my issue here is I'm not sure if it is processing the queries in a manner then which stacking would make them inefficient given the size of the raw dataset.... my hope / assumption here is that when a sub query is called it essentially just "folds" the whole end to end chain, so it isn't in effect buffering the intermediary layer unnecessarily. I've had performance and run time issues so far on other queries using this dataset due to its size, hence optimising is relevant. Answers on a postcard
Thanks in advance
Tom
Incidentally, I found out the hard way that trying to house all of this is (models + data) within a single dataflow was a disaster as essentially I was unable to alter the cut off for segments or events down the track and just had to live with how it had been set up to begin with, hence now breaking data and models into seperate dataflows. Similarly, one area I'm finding the PowerBI ML solution not ideal is I can't set up a dozen models to run overnight, I have to set them up one at a time and wait for it to stop refreshing - there's no way to stop the refresh to let me queue them all up. Lastly, given it's all originated from the same dataset, there's no clean and quick way to say "apply all models" to the final dataset so it's all in the one place - overall, lots of faff and duplication which could probably be avoided if the user had the ability to configure refresh settings individually for each component of a dataflow.
Solved! Go to Solution.
Hi @TomDeas ,
To answer your main question: There's no difference between Referencing and Duplicating in terms of internal processing. There are some edge cases where Power Query/Dataflows will cache results, but it's safest to just assume at least one source hit per query regardless of whether duplicated or referenced.
As you rightly alluded to, the savings will be in your adjustment time. If you can create a core query that contains all generic transformations, then reference this for each subset of unique transformations, you will limit the number of areas that you need to make adjustments in.
Pete
Proud to be a Datanaut!
Hi @TomDeas ,
To answer your main question: There's no difference between Referencing and Duplicating in terms of internal processing. There are some edge cases where Power Query/Dataflows will cache results, but it's safest to just assume at least one source hit per query regardless of whether duplicated or referenced.
As you rightly alluded to, the savings will be in your adjustment time. If you can create a core query that contains all generic transformations, then reference this for each subset of unique transformations, you will limit the number of areas that you need to make adjustments in.
Pete
Proud to be a Datanaut!
Perfect thanks Pete that's absolutely it. Was keen to take that route (stacked references) as the queries were all ultimately subsets of one another and hence helps ensure integrity of filtering later on, but wasn't sure if I was walking into a processing trap, particularly as I've had some stability issues already.
I ran overnight on basis of stacking references, and didn't encounter any issues. Perhaps my question was phrased "what's best practice here".
Thanks
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Power BI update to learn about new features.