Learn from the best! Meet the four finalists headed to the FINALS of the Power BI Dataviz World Championships! Register now
In the article querying-one-trillion-rows-of-data-with-powerbi-and-azure-databricks written by Kyle Hale, he had tested combining azure databricks with Power BI sucessfully.
We followed the same steps for a table with around 1 Trillion rows:
from delta.tables import DeltaTable
DeltaTable.create(spark)\
.tableName("FACT_TABLE")\
.addColumn("ID", "STRING")\
.addColumn("TIC", "STRING")\
.addColumn("UPDATE_TIMESTAMP", "TIMESTAMP")\
.addColumn("SRC", "STRING")\
.addColumn("PRICE", "DOUBLE")\
.addColumn("QUANTITY", "DOUBLE")\
.addColumn("TYPE", "STRING")\
.addColumn("DATE", "DATE", generatedAlwaysAs="CAST(UPDATE_TIMESTAMP AS DATE)")\
.addColumn("HOUR", "INT", generatedAlwaysAs="HOUR(UPDATE_TIMESTAMP)")\
.addColumn("MINUTE", "INT", generatedAlwaysAs="MINUTE(UPDATE_TIMESTAMP)")\
.addColumn("SECOND", "INT", generatedAlwaysAs="SECOND(UPDATE_TIMESTAMP)")\
.partitionedBy("DATE", "HOUR")\
.execute()In power bi we used Direct Query with Dynamic M parameters (Date, Hour and Minute) to try to limit the number of records being fetched from a query. For one hour we have around 30 million rows or around 0.5 million per minute when we are counting using:
df = spark.sql("SELECT * FROM LH.FACT_TABLE where date = '2023-06-10' and HOUR = 11 and minute = 12")
df.count()from a notebook. The result is returned very quickly (less than 4 minutes).
When trying to run the same sql query from the sql endpoint of the LH, the result is never returned even after waiting for several hours. As if the SQL Query is not doing any partition pruning.
As we had chosen to connect the sql endpoint in power bi, we are unable to report on that table.
Is there anyone out there dealing with 1 Trillion row tables in power bi? If yes, were you able to report on the table, or are you facing the same problem as mine.
Thanks
Hi @mkj1213 ,
Thanks for reaching out to our community.
You could check partition pruning. Please ensure that the SQL endpoint is correctly configured and supports partition pruning. Try explicitly specifying partition columns in your SQL query to ensure the query can leverage partitions.
Also consider using aggregate queries or precomputed views to reduce the amount of data being queried. For example, create daily or hourly aggregate tables and query these aggregates in Power BI.
Hope it helps.
Best Regards,
Stephen Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Thanks for your reply Stephen,
1- How am i supposed to check if the sql endpoint is correctly configured and supports partition pruning. We are using the sql endpoint that is provided by Fabric when i am creating a lake house.
2-We had already tried to query only using the partition columns, and i am not getting data at all.
3-We had already created daily and hourly aggregate tables.
Thanks in Advance
| User | Count |
|---|---|
| 57 | |
| 43 | |
| 32 | |
| 16 | |
| 13 |
| User | Count |
|---|---|
| 84 | |
| 70 | |
| 38 | |
| 27 | |
| 24 |