Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
clairetaylor
Frequent Visitor

Direct query using count distinct hitting database for every row

Hi I have the following issue:

 

We have a table which we connec to using DirectQuery (onsite web events table), which is connect to an imported table with user information.

 

We show a table of users for particular types of onsite events, and in this table, a column giving a total of the number of page views the user has performed for those types of events.

 

We needed to change the column that we use to calculate the "total number of page views" from a sum of the number of records in a column where a value was = 1, to be a COUNT DISTINCT of a "unique_id' field. Both of these fields are on the DirectQuery table.

 

However, when we switch the logic to be COUNT DISTINCT the performance slows right down and it turns out it's querying it for every row, rather than overall for 

 

 

select "USER_AND_SHORT_CONTRACT_KEY",

sum("IS_PAGE_VIEW") as "C1"
from
(
select "UNIQUE_ID",
"DATE_TIME",
"DATE_WEBEVENT",
"SESSION_START_TIMESTAMP",
"VISIT_START_TIME_GMT_TS",
"HIT_TIME_GMT_TS",
"VISIT_START_TIME_GMT",
"HIT_TIME_GMT",
"MONTH_YEAR",
"IS_PAGE_VIEW",
"IS_RESEARCH_VIEW",
"IS_PAYWALL_HIT",
"IS_LINK_TRACK",
"SESSION_ID",
"LOGIN_ID",
"CONTRACT_ID_SHORT",
"SESSION_USER_KEY",
"PAGE_NAME",
"PAGE_URL",
"BROWSER_LANGUAGE",
"OS",
"USER_REGION",
"WEBSITE_AREA",
"LINK_TYPE",
"LINK_ID",
"LINK_NAME",
"IS_SEARCH",
"SEARCH_TERM",
"SEARCH_LINK_SECTION",
"SEARCH_NUM_OF_RESULTS",
"SEARCH_TYPE",
"SEARCH_LOCATION",
"VENDOR",
"VENDOR2",
"CONTENT_GROUP",
"CLICKED_FROM",
"WIDGET_NAME",
"WIDGET_ESG",
"GEO_COUNTRY",
"VISIT_REF_TYPE",
"VISIT_PAGE_NUM",
"VISIT_REF_DOMAIN",
"REF_DOMAIN",
"CLICK_WIDGET_ESG",
"DOCUMENT_ID",
"DOC_TYPE_HITDATA",
"DOC_TYPE_MONGO",
"DOC_CONTENT_TYPE",
"DOC_TITLE",
"DOC_TITLE_PW_HIT",
"DOC_PUBLICATION_DATE",
"IS_ISSUER",
"ISSUER_ID",
"ISSUER_ORG_NAME",
"ISSUER_ORG_PEER_INDUSTRY",
"ISSUER_ORG_BUSINESS",
"ISSUER_ORG_TYPE",
"ISSUER_ORG_PEER_GROUP",
"IS_TOPIC",
"TOPIC_SITE_LEVEL_1",
"TOPIC_SITE_LEVEL_2",
"IS_SECTOR",
"SECTOR_SITE_LEVEL_1",
"SECTOR_SITE_LEVEL_2",
"PV_PRIMARY_MARKET_SEGMENT",
"PV_SECONDARY_MARKET_SEGMENT",
"PV_TERTIARY_MARKET_SEGMENT",
"PV_QUATERNARY_MARKET_SEGMENT",
"PV_QUINARY_MARKET_SEGMENT",
"CLICKSTREAM_KEY",
"DATE_ID",
"USER_AND_SHORT_CONTRACT_KEY",
"ISSUER_NAME",
"ISSUER_ORG_IS_RATED",
"ISSUER_ORG_IS_CAP_STRUCTURE",
"REGISTERED_USER_HIT"
from "PROD_MACI"."MACI_SEMANTIC"."FACT_REGISTERED_USER_PAGE_VIEWS"
where ("USER_AND_SHORT_CONTRACT_KEY" in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?))
) as "ITBL"
group by "USER_AND_SHORT_CONTRACT_KEY"
LIMIT 1000001 OFFSET 0

 

whereas when I do COUNT DISTINCT, we get a query like the following per user

 

select { fn convert(count(distinct("UNIQUE_ID")), SQL_DOUBLE) } + { fn convert(max("C1"), SQL_DOUBLE) } as "C1"
from
(
select "UNIQUE_ID"
case
when "UNIQUE_ID" is null
then CAST(1 as INTEGER)
else CAST(0 as INTEGER)
end as "C1"
from "PROD_MACI"."MACI_SEMANTIC"."FACT_REGISTERED_USER_PAGE_VIEWS"
where (("USER_AND_SHORT_CONTRACT_KEY" in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?))  and "USER_AND_SHORT_CONTRACT_KEY" = ?
) as "ITBL"

 

Does anyone have any ideas how I can get the COUNT DISTINCT version to hit the database only once (logically it seems like this should be possible, but I'm guessing it's something to with the NULL handling above?

5 REPLIES 5
clairetaylor
Frequent Visitor

Unfortunately it isn't, it's the unique_id of an event that is duplicated in some cases in this particular table, which is why we need the distinct. To me - it seems that the underlying query should just be able to switch from doing the SUM(IS_PAGE_VIEW) to COUNT(DISTINCT UNIQUE_ID) with it querying once for all the values (like in the first query). It's not clear to me why it's switching from one type of query to another. The table it's querying is a large one with around 300 million rows, and it's just not going to handle running this query against it for every row. 

One thing you can consider is indexing the living daylights out of your source table, using the Direct Query queries as guidance.  Of course this comes with a large cost in terms of storage and maintenance, but it may buy you the required performance.

We use Snowflake where you don't do your own indexing, so this isn't gong to help. 

 

What I really want is the query that hits the database to look like the below, what I don't understand is why PowerBI doesn't do this - and if there's anything I can do to suggest that it constructs the query like this instead. Each query is against the 300 million row table, which is fine when it runs once - it takes about a second, but clearly when it does this once for each of 100+ users this is very inefficient and slow.

select "USER_AND_SHORT_CONTRACT_KEY",

count(distinct unique_id) as "C1"
from
(
select "UNIQUE_ID",
"DATE_TIME",
... lots of columns
from "PROD_MACI"."MACI_SEMANTIC"."FACT_REGISTERED_USER_PAGE_VIEWS"
where ("USER_AND_SHORT_CONTRACT_KEY" in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?))
) as "ITBL"
group by "USER_AND_SHORT_CONTRACT_KEY"
LIMIT 1000001 OFFSET 0

Raise it as an idea and/or issue against the Power Query team to ask for options to include query hints during folding.  Not sure if the Snowflake connector supports custom queries but if it does then handcrafting your queries might be another option. Of couse this will likely break query folding.

lbendlin
Super User
Super User

Is the field name ( NIQUE_ID ) any indication of the column type?  Is this the primary key in the table?  If yes then changing COUNT(DISTINCT)  to COUNT() should yield the same result and is likely a lot faster.

 

If you use Direct Query then your data source needs to be ready and able to handle gazillions of queries. If the data source cannot support that then you should consider switching to Import Mode.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.