March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
In the article querying-one-trillion-rows-of-data-with-powerbi-and-azure-databricks written by Kyle Hale, he had tested combining azure databricks with Power BI sucessfully.
We followed the same steps for a table with around 1 Trillion rows:
from delta.tables import DeltaTable
DeltaTable.create(spark)\
.tableName("FACT_TABLE")\
.addColumn("ID", "STRING")\
.addColumn("TIC", "STRING")\
.addColumn("UPDATE_TIMESTAMP", "TIMESTAMP")\
.addColumn("SRC", "STRING")\
.addColumn("PRICE", "DOUBLE")\
.addColumn("QUANTITY", "DOUBLE")\
.addColumn("TYPE", "STRING")\
.addColumn("DATE", "DATE", generatedAlwaysAs="CAST(UPDATE_TIMESTAMP AS DATE)")\
.addColumn("HOUR", "INT", generatedAlwaysAs="HOUR(UPDATE_TIMESTAMP)")\
.addColumn("MINUTE", "INT", generatedAlwaysAs="MINUTE(UPDATE_TIMESTAMP)")\
.addColumn("SECOND", "INT", generatedAlwaysAs="SECOND(UPDATE_TIMESTAMP)")\
.partitionedBy("DATE", "HOUR")\
.execute()
In power bi we used Direct Query with Dynamic M parameters (Date, Hour and Minute) to try to limit the number of records being fetched from a query. For one hour we have around 30 million rows or around 0.5 million per minute when we are counting using:
df = spark.sql("SELECT * FROM LH.FACT_TABLE where date = '2023-06-10' and HOUR = 11 and minute = 12")
df.count()
from a notebook. The result is returned very quickly (less than 4 minutes).
When trying to run the same sql query from the sql endpoint of the LH, the result is never returned even after waiting for several hours. As if the SQL Query is not doing any partition pruning.
As we had chosen to connect the sql endpoint in power bi, we are unable to report on that table.
Is there anyone out there dealing with 1 Trillion row tables in power bi? If yes, were you able to report on the table, or are you facing the same problem as mine.
Thanks
Hi @mkj1213 ,
Thanks for reaching out to our community.
You could check partition pruning. Please ensure that the SQL endpoint is correctly configured and supports partition pruning. Try explicitly specifying partition columns in your SQL query to ensure the query can leverage partitions.
Also consider using aggregate queries or precomputed views to reduce the amount of data being queried. For example, create daily or hourly aggregate tables and query these aggregates in Power BI.
Hope it helps.
Best Regards,
Stephen Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Thanks for your reply Stephen,
1- How am i supposed to check if the sql endpoint is correctly configured and supports partition pruning. We are using the sql endpoint that is provided by Fabric when i am creating a lake house.
2-We had already tried to query only using the partition columns, and i am not getting data at all.
3-We had already created daily and hourly aggregate tables.
Thanks in Advance
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
124 | |
89 | |
84 | |
70 | |
51 |
User | Count |
---|---|
206 | |
146 | |
97 | |
79 | |
69 |