Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
TomDeas
New Member

Query Efficiency - Stack References or Duplicate?

Long time listener, first time caller.

 

Power Query RE: building predictive models. Context, I have a dataset (c.400k rows, 140 columns) which is curated elsewhere and ingested into PQ as a flat file.

 

I'm building out 3x models for different events, and segmenting the dataset into say 4x customer segments = 12x models I'm looking to train (on the basis of sufficient data volume to do so and vastly improved AUC as the segments and events are distinct from one another). So it looks sort of like this:

 

snip 1.PNG

 

What I want to do here is (working around the limitations of the ML solution in calibrating specifics) force a specific block of data as "training" data (majority of data), create a custom hold out for forward testing (c.25000 rows), and then a third block of data for applying the model against just the latest month (i.e. <=3000 rows). 

 

So in reality my set of data cuts looks like this

 

snip 2.PNG

 

I've created the "All", which all reference the Base data, and each Seg is a subset of that

 

So my question is - is it more process efficient

  1. Duplicating the "All" queries, or
  2. Sub referencing the "All" queries

 

Benefits of sub referencing is it would reduce the amount of filters I need to change each month as they will cascade through, so this is the most likely answer. But specifically my issue here is I'm not sure if it is processing the queries in a manner then which stacking would make them inefficient given the size of the raw dataset.... my hope / assumption here is that when a sub query is called it essentially just "folds" the whole end to end chain, so it isn't in effect buffering the intermediary layer unnecessarily. I've had performance and run time issues so far on other queries using this dataset due to its size, hence optimising is relevant. Answers on a postcard

 

Thanks in advance

Tom

 

Incidentally, I found out the hard way that trying to house all of this is (models + data) within a single dataflow was a disaster as essentially I was unable to alter the cut off for segments or events down the track and just had to live with how it had been set up to begin with, hence now breaking data and models into seperate dataflows. Similarly, one area I'm finding the PowerBI ML solution not ideal is I can't set up a dozen models to run overnight, I have to set them up one at a time and wait for it to stop refreshing - there's no way to stop the refresh to let me queue them all up. Lastly, given it's all originated from the same dataset, there's no clean and quick way to say "apply all models" to the final dataset so it's all in the one place - overall, lots of faff and duplication which could probably be avoided if the user had the ability to configure refresh settings individually for each component of a dataflow.

1 ACCEPTED SOLUTION
BA_Pete
Super User
Super User

Hi @TomDeas ,

 

To answer your main question: There's no difference between Referencing and Duplicating in terms of internal processing. There are some edge cases where Power Query/Dataflows will cache results, but it's safest to just assume at least one source hit per query regardless of whether duplicated or referenced.

 

As you rightly alluded to, the savings will be in your adjustment time. If you can create a core query that contains all generic transformations, then reference this for each subset of unique transformations, you will limit the number of areas that you need to make adjustments in.

 

Pete



Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!




View solution in original post

2 REPLIES 2
BA_Pete
Super User
Super User

Hi @TomDeas ,

 

To answer your main question: There's no difference between Referencing and Duplicating in terms of internal processing. There are some edge cases where Power Query/Dataflows will cache results, but it's safest to just assume at least one source hit per query regardless of whether duplicated or referenced.

 

As you rightly alluded to, the savings will be in your adjustment time. If you can create a core query that contains all generic transformations, then reference this for each subset of unique transformations, you will limit the number of areas that you need to make adjustments in.

 

Pete



Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!




Perfect thanks Pete that's absolutely it. Was keen to take that route (stacked references) as the queries were all ultimately subsets of one another and hence helps ensure integrity of filtering later on, but wasn't sure if I was walking into a processing trap, particularly as I've had some stability issues already.

 

I ran overnight on basis of stacking references, and didn't encounter any issues. Perhaps my question was phrased "what's best practice here".

 

Thanks

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

AugPowerBI_Carousel

Power BI Monthly Update - August 2024

Check out the August 2024 Power BI update to learn about new features.

August Carousel

Fabric Community Update - August 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors