Solved: Re: Privacy levels

frithjof_v · ‎07-22-2024

Hi,

I have some questions regarding Data Privacy levels.

If I understand correctly, the data privacy levels work like this:

Technically it is up to me (i.e. the semantic model author), as the owner of my data source connections (be it in Power BI Desktop or in Power BI Service), to set the privacy level of each of my data source connections.

When I create a new data source connection, I can technically define it as either Private, Organizational or Public, there is nothing technically preventing me from doing that. Technically, it doesn't depend upon my organization actually "owning" the data source, the sensitivity of the data, the type of data source, the nature of the data source, or something like that.

Technically, I can choose freely which data Privacy level setting to choose for any data source.
However, I should of course understand the (potential) consequenses of choosing each privacy level.
And, in reality, the business / data protection / legal perspective should of course be considered when choosing the appropriate data Privacy level for a data source.

However, from a purely technical perspective, there is nothing stopping me from choosing illogical settings like defining my company's top-secret on-prem SQL server as having Public privacy level, while I can also technically define a fully open web source (e.g. w3schools.com/python/pandas/data.csv.txt) as having Private privacy level. Setting the privacy levels as mentioned in these two examples would not make sense from a business risk point of view, however, technically, I could define either of these example data sources as either Private, Organizational or Public and nothing would technically stop me from doing that.

What are the implications (and risks) of setting a Data Privacy level?

The only implication of setting the data Privacy level, which I know of, from a security perspective, is about query folding. E.g. when using values from one data source to filter another data source, perhaps also when merging data from two different data sources, etc., Power Query can achieve performance benefits by sending data from one source as part of the query (query folding) to the other source.

I believe the Power Query engine's rules are like this:

The Power Query engine can use data from a Public data source as input (e.g. in a WHERE clause) in query folding to other Public, Organizational or Private data sources.
The Power Query engine can use data from an Organizational data source as input in query folding to other Organizational or Private data sources.
The Power Query engine cannot use data from Private data sources as inputs in query folding to any other data sources.

This means, data from a data source I have set as Public may leak (as part of a folded query) to other data sources which have been set as Public, Organizational or Private. And data from a data source which I have set as Organizational, may leak to other data sources which have been set as Organizational or Private.
Are there any other risks which are affected by data Privacy levels, other than the risk of data leakage through query folding? Not that I know of.

In the documentation (Understand Power BI Desktop privacy levels - Power BI | Microsoft Learn) there are some statements I am wondering about:

Private - "Visibility can be restricted to authorized users."

Organizational - "Visibility is set to a trusted group."

Public - "Visibilty is available to everyone."

(Also: "Files, internet data sources, and workbook data can be set to Public" I think this should depend on a risk assessment - sensitivity - of the data content, and not the source's storage format or location. If a workbook contains sensitive data, it should definitely not be set as Public. I agree sensitive data should not be stored in a workbook, but that is another question.)

What do these statements mean in practical terms? I am wondering about the word "visibility".
How does the chosen Data Privacy level affect which data other users can see (have visibility into)?
In my understanding of Data Privacy level, visibility has nothing to do with Data Privacy levels. In my understanding, data visibility is subject to workspace access, sharing of semantic models / reports, RLS, OLS, and so on. So I am struggling to understand these statements which talk about visibility being affected by data Privacy level. In my understanding, data Privacy level is only concerned with the potential for data leakage through query folding.

This documentation (Privacy levels in Power Query - Power Query | Microsoft Learn) says that «Power Query analyzes each data source and classifies it by the defined level of privacy: Public, Organizational, and Private. This analysis ensures data isn't combined if there's undesirable data transfer.

This process of data protection can also occur when a query uses query folding.»

Why does it say «can also occur when a query uses query folding»?
Isn’t query folding the only way Power Query can leak data from one data source to another?
So the word «also» seems misleading, and could be replaced by «only» (or rewrite the entire section in the documentation). Or am I missing something?

I think the documentations which I mentioned (Privacy levels in Power Query - Power Query | Microsoft Learn and Understand Power BI Desktop privacy levels - Power BI | Microsoft Learn) create confusion about what the privacy levels are really used for.

If it is true that the privacy levels are only used to restrict data leakage through query folding (which I believe is true), then I think this should be clearly stated, instead of confusing statements like "Privacy Levels might prevent you from inadvertently combining data from multiple data sources" and "Privacy levels are critical to configure correctly so that only authorized users can view the sensitive data" which makes it sound like the data Privacy levels affect other concepts than what they are really affecting.

So either I am still missing something here, or the documentation seems to be quite confusing?

What do you think?

I found these resources which I think do a great job at explaining this topic:

Power BI Data Privacy Settings Deep Dive - Chris Webb (youtube.com)

Behind the scenes of the Data Privacy Firewall - Power Query | Microsoft Learn

Power Query M Primer (Part 13): Tables—Table Think II | Ben Gribaudo (about Firewall)

I would like to learn more about this, and/or get some clarifications regarding this. Thanks!

AmiraBedh · ‎07-22-2024

Is there any technical restriction on setting the privacy level for any data source?

Technically, you can set the privacy level of any data source to Private, Organizational, or Public, regardless of the nature or ownership of the data. This flexibility means that even illogical settings can be applied, such as defining a top-secret company server as Public or an open web source as Private. So you need to choose the appropriate level, considering the sensitivity and potential business risks.

What are the implications (or risks) of setting a Data Privacy level?

In my opinion, the primary implication of setting privacy levels in Power BI is related to data leakage through query folding. Let me explain it one by one :
- Public Data Sources: Data can be used as input in query folding to other Public, Organizational, or Private data sources.
- Organizational Data Sources: Data can be used as input in query folding to other Organizational or Private data sources.
- Private Data Sources: Data cannot be used as input in query folding to any other data sources.

The main risk here is for data leakage when data from less sensitive sources (Public or Organizational) is used in queries involving more sensitive data (Private).

Are there any other risks affected by data Privacy levels, besides the risk of data leakage through query folding?

The primary concern remains indeed data leakage through query folding. The documentation mentions preventing undesirable data transfers, which essentially translates to controlling how data from different sources interacts within queries. Other security measures like workspace access, sharing permissions, Row-Level Security (RLS), and Object-Level Security (OLS) manage visibility and access control outside the context of query folding.

What does "visibility" mean in practical terms regarding data privacy levels?

The term "visibility" in the context of data privacy levels might be misleading. It seems to refer to the control over data leakage during query folding rather than direct user access. Data visibility to users is managed through workspace access, sharing settings, RLS, and OLS. Thus, the term "visibility" in the documentation should ideally be interpreted as the potential exposure of data during query processing rather than user access control.

Why does the documentation mention that data protection occurs "also" when a query uses query folding?

I totally agree that the documentation use of "also" might indeed be misleading. Query folding is the primary mechanism through which data privacy levels protect against data leakage. The phrase "also occur when a query uses query folding" should be clarified to indicate that query folding is the primary, if not the only, context in which privacy levels play a protective role. This section of the documentation could benefit from clearer wording to avoid confusion.

Proud to be a Power BI Super User !

Microsoft Community : https://docs.microsoft.com/en-us/users/AmiraBedhiafi
Linkedin : https://www.linkedin.com/in/amira-bedhiafi/
StackOverflow : https://stackoverflow.com/users/9517769/amira-bedhiafi
C-Sharp Corner : https://www.c-sharpcorner.com/members/amira-bedhiafi
Power BI Community :https://community.powerbi.com/t5/user/viewprofilepage/user-id/332696

View solution in original post

frithjof_v · ‎07-22-2024

I also think in the case of an external data source (i.e. source which doesn't contain your organization's data and you don't want to leak your organizational data to it) then you should NOT set Organizational or Private level for that data source, because if you set an external source as Organizational or Private, you might leak data from your other (actual) Organizational data sources into the external source which has been wrongfully labelled as Private or Organizational.

So external sources should probably always be selected as Public, if that made sense?

I think we need to remember two things:

- using a too "relaxed" setting can lead to unwanted data exposure.

If a source which should be labelled as Private, gets wrongfully labelled as Organizational, it can cause unwanted data exposure of that sensitive data into other organizational data systems.

If a source which should be labelled as Private or Organizational, gets wrongfully labelled as Public, it can also cause unwanted exposure of that sensitive or internal data to external systems.

- using a too "strict" setting can also lead to unwanted data exposure.

If an (external) source which should be labelled as Public, gets wrongfully labelled as Organizational or Private, then data from other sources which are rightfully labelled as Organizational can be leaked to that external system. Because when we label a data source as Organizational or Private, we say that it's okay to send data to this data source from our other Organizational and Public sources (through query folding).

If a source which should be labelled as Organizational, gets wrongfully labelled as Private, I don't think this introduces any added risk of data leakage. But it can lead to bad performance in Power Query.

Another aspect of it: before doing some kind of mix of data from different sources in Power Query (i.e. merging data, using data from one source to filter another source, etc.), remember to check the privacy level setting of each data source.