Direct query using count distinct hitting database...

Anonymous · ‎05-19-2022

Hi I have the following issue:

We have a table which we connec to using DirectQuery (onsite web events table), which is connect to an imported table with user information.

We show a table of users for particular types of onsite events, and in this table, a column giving a total of the number of page views the user has performed for those types of events.

We needed to change the column that we use to calculate the "total number of page views" from a sum of the number of records in a column where a value was = 1, to be a COUNT DISTINCT of a "unique_id' field. Both of these fields are on the DirectQuery table.

However, when we switch the logic to be COUNT DISTINCT the performance slows right down and it turns out it's querying it for every row, rather than overall for

select "USER_AND_SHORT_CONTRACT_KEY",

sum("IS_PAGE_VIEW") as "C1"
from
(
select "UNIQUE_ID",
"DATE_TIME",
"DATE_WEBEVENT",
"SESSION_START_TIMESTAMP",
"VISIT_START_TIME_GMT_TS",
"HIT_TIME_GMT_TS",
"VISIT_START_TIME_GMT",
"HIT_TIME_GMT",
"MONTH_YEAR",
"IS_PAGE_VIEW",
"IS_RESEARCH_VIEW",
"IS_PAYWALL_HIT",
"IS_LINK_TRACK",
"SESSION_ID",
"LOGIN_ID",
"CONTRACT_ID_SHORT",
"SESSION_USER_KEY",
"PAGE_NAME",
"PAGE_URL",
"BROWSER_LANGUAGE",
"OS",
"USER_REGION",
"WEBSITE_AREA",
"LINK_TYPE",
"LINK_ID",
"LINK_NAME",
"IS_SEARCH",
"SEARCH_TERM",
"SEARCH_LINK_SECTION",
"SEARCH_NUM_OF_RESULTS",
"SEARCH_TYPE",
"SEARCH_LOCATION",
"VENDOR",
"VENDOR2",
"CONTENT_GROUP",
"CLICKED_FROM",
"WIDGET_NAME",
"WIDGET_ESG",
"GEO_COUNTRY",
"VISIT_REF_TYPE",
"VISIT_PAGE_NUM",
"VISIT_REF_DOMAIN",
"REF_DOMAIN",
"CLICK_WIDGET_ESG",
"DOCUMENT_ID",
"DOC_TYPE_HITDATA",
"DOC_TYPE_MONGO",
"DOC_CONTENT_TYPE",
"DOC_TITLE",
"DOC_TITLE_PW_HIT",
"DOC_PUBLICATION_DATE",
"IS_ISSUER",
"ISSUER_ID",
"ISSUER_ORG_NAME",
"ISSUER_ORG_PEER_INDUSTRY",
"ISSUER_ORG_BUSINESS",
"ISSUER_ORG_TYPE",
"ISSUER_ORG_PEER_GROUP",
"IS_TOPIC",
"TOPIC_SITE_LEVEL_1",
"TOPIC_SITE_LEVEL_2",
"IS_SECTOR",
"SECTOR_SITE_LEVEL_1",
"SECTOR_SITE_LEVEL_2",
"PV_PRIMARY_MARKET_SEGMENT",
"PV_SECONDARY_MARKET_SEGMENT",
"PV_TERTIARY_MARKET_SEGMENT",
"PV_QUATERNARY_MARKET_SEGMENT",
"PV_QUINARY_MARKET_SEGMENT",
"CLICKSTREAM_KEY",
"DATE_ID",
"USER_AND_SHORT_CONTRACT_KEY",
"ISSUER_NAME",
"ISSUER_ORG_IS_RATED",
"ISSUER_ORG_IS_CAP_STRUCTURE",
"REGISTERED_USER_HIT"
from "PROD_MACI"."MACI_SEMANTIC"."FACT_REGISTERED_USER_PAGE_VIEWS"
where ("USER_AND_SHORT_CONTRACT_KEY" in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?))
) as "ITBL"
group by "USER_AND_SHORT_CONTRACT_KEY"
LIMIT 1000001 OFFSET 0

whereas when I do COUNT DISTINCT, we get a query like the following per user

select { fn convert(count(distinct("UNIQUE_ID")), SQL_DOUBLE) } + { fn convert(max("C1"), SQL_DOUBLE) } as "C1"
from
(
select "UNIQUE_ID"
case
when "UNIQUE_ID" is null
then CAST(1 as INTEGER)
else CAST(0 as INTEGER)
end as "C1"
from "PROD_MACI"."MACI_SEMANTIC"."FACT_REGISTERED_USER_PAGE_VIEWS"
where (("USER_AND_SHORT_CONTRACT_KEY" in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)) and "USER_AND_SHORT_CONTRACT_KEY" = ?
) as "ITBL"

Does anyone have any ideas how I can get the COUNT DISTINCT version to hit the database only once (logically it seems like this should be possible, but I'm guessing it's something to with the NULL handling above?

Anonymous · ‎05-22-2022

Unfortunately it isn't, it's the unique_id of an event that is duplicated in some cases in this particular table, which is why we need the distinct. To me - it seems that the underlying query should just be able to switch from doing the SUM(IS_PAGE_VIEW) to COUNT(DISTINCT UNIQUE_ID) with it querying once for all the values (like in the first query). It's not clear to me why it's switching from one type of query to another. The table it's querying is a large one with around 300 million rows, and it's just not going to handle running this query against it for every row.

lbendlin · ‎05-22-2022

One thing you can consider is indexing the living daylights out of your source table, using the Direct Query queries as guidance. Of course this comes with a large cost in terms of storage and maintenance, but it may buy you the required performance.

Anonymous · ‎05-24-2022

We use Snowflake where you don't do your own indexing, so this isn't gong to help.

What I really want is the query that hits the database to look like the below, what I don't understand is why PowerBI doesn't do this - and if there's anything I can do to suggest that it constructs the query like this instead. Each query is against the 300 million row table, which is fine when it runs once - it takes about a second, but clearly when it does this once for each of 100+ users this is very inefficient and slow.

select "USER_AND_SHORT_CONTRACT_KEY",

count(distinct unique_id) as "C1"
from
(
select "UNIQUE_ID",
"DATE_TIME",
... lots of columns
from "PROD_MACI"."MACI_SEMANTIC"."FACT_REGISTERED_USER_PAGE_VIEWS"
where ("USER_AND_SHORT_CONTRACT_KEY" in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?))
) as "ITBL"
group by "USER_AND_SHORT_CONTRACT_KEY"
LIMIT 1000001 OFFSET 0

lbendlin · ‎05-24-2022

Raise it as an idea and/or issue against the Power Query team to ask for options to include query hints during folding. Not sure if the Snowflake connector supports custom queries but if it does then handcrafting your queries might be another option. Of couse this will likely break query folding.

lbendlin · ‎05-21-2022

Is the field name ( NIQUE_ID ) any indication of the column type? Is this the primary key in the table? If yes then changing COUNT(DISTINCT) to COUNT() should yield the same result and is likely a lot faster.

If you use Direct Query then your data source needs to be ready and able to handle gazillions of queries. If the data source cannot support that then you should consider switching to Import Mode.

Direct query using count distinct hitting database for every row

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025

How to Get Your Question Answered Quickly

Join the #PBI10 DataViz contest