Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
Ben-Dev
Helper II
Helper II

Data Protection Firewall - Why 'TOP 1000'?

Hello,

 

The Data privacy analysis section of Why does my query run multiple times? says:

Data privacy does its own evaluations of each query to determine whether the queries are safe to run together. This evaluation can sometimes cause multiple requests to a data source. A telltale sign that a given request is coming from data privacy analysis is that it will have a “TOP 1000” condition (although not all data sources support such a condition). [...]

However, Behind the scenes of the Data Privacy Firewall doesn't seem to mention anything about the firewall needing to fetch actual data rows (e.g. to do a "TOP 1000"); rather, from the description given, it sounds like the firewall can compute whether two data sources can fold together based on data source-level (not data set-level) information, principally by looking at the configured privacy level settings for the sources (which doesn't involve needing to fetch actual row data).

 

Can anyone shed light on why the firewall might need to do a "TOP 1000" fetch?

 

Thank you,
Ben

1 ACCEPTED SOLUTION

The mashup engine doesn't have any static info about what data sources are being accessed by a given query or step. That info is surfaced at runtime, when a given query is being evaluated. So in order for the firewall to perform the logic described in the "Behind the scenes" article, it pulls the first 1k rows for any partition it's interested in. Doing so results in the partition's data source(es) being surfaced.

View solution in original post

5 REPLIES 5
Ben-Dev
Helper II
Helper II

@Ehren, any chance you could weigh in on this to confirm/shed light? Thanks!

The mashup engine doesn't have any static info about what data sources are being accessed by a given query or step. That info is surfaced at runtime, when a given query is being evaluated. So in order for the firewall to perform the logic described in the "Behind the scenes" article, it pulls the first 1k rows for any partition it's interested in. Doing so results in the partition's data source(es) being surfaced.

Thanks, @Ehren!

Ben-Dev
Helper II
Helper II

Thanks for digging into this, @Watsky.

Watsky
Solution Sage
Solution Sage

Hey @Ben-Dev ,

I'll be honest and tell you that I don't know the actual answer but I'll take a stab at what might be occuring. After you posted this I did what you probably did which was scour the web which you know turned up nothing for me. So, I ran some tests and came to a hypothesis on what is happening.

In the next paragraph, following the Data privacy analysis section, is the background analysis section which reads:

Similar to the evaluations performed for data privacy, the Power Query editor by default will download a preview of the first 1000 rows of each query step. 

This tells me that this is a default action which is occuring. So, what is it being used for? If you run the diagnostics within Power Query and open the Diagnostics Partition, you will find a couple of columns that give us some clues. The Firewall Group Column is:

Categorization that explains why this partition has to be evaluated separately, including details on the privacy level of the partition. 

This is our reason for the evaluation being done on the partition. 

The Expression Column:

The expression that gets evaluated on top of the partition's query/step. In several cases, it coincides with the query/step.

This is where you will see the top 1000 being used:

Watsky_1-1643169004362.png

For the few tests I ran against different data sources I found all of them appear to be pulling the top 1,000 values of metadata. It must be validating each step by looking at the expression results. Interestingly, each step that is being evaluated has the exact same expression (essentially top 1,000 meta-data for all fields). I'm wondering if this is also what attributes to slower refreshes with data privacy turned on. To sum up why, it's used to evaluate partition validity (step validity). At least, that's my theory.


Did my answer(s) help you? Give it a kudos by clicking the Thumbs Up!
Did my post answer your question(s)? Mark my post as a solution. This will help others find the solution.

Proud to be a Super User!

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

Power BI Carousel June 2024

Power BI Monthly Update - June 2024

Check out the June 2024 Power BI update to learn about new features.

PBI_Carousel_NL_June

Fabric Community Update - June 2024

Get the latest Fabric updates from Build 2024, key Skills Challenge voucher deadlines, top blogs, forum posts, and product ideas.

Top Solution Authors
Top Kudoed Authors