1 Trillion rows table in Fabric and Power BI.

mkj1213 · ‎11-25-2024

In the article querying-one-trillion-rows-of-data-with-powerbi-and-azure-databricks written by Kyle Hale, he had tested combining azure databricks with Power BI sucessfully.

We followed the same steps for a table with around 1 Trillion rows:

The table was created in a lakehouse by the following code:

from delta.tables import DeltaTable
DeltaTable.create(spark)\
    .tableName("FACT_TABLE")\
    .addColumn("ID", "STRING")\
    .addColumn("TIC", "STRING")\
    .addColumn("UPDATE_TIMESTAMP", "TIMESTAMP")\
    .addColumn("SRC", "STRING")\
    .addColumn("PRICE", "DOUBLE")\
    .addColumn("QUANTITY", "DOUBLE")\
    .addColumn("TYPE", "STRING")\
    .addColumn("DATE", "DATE", generatedAlwaysAs="CAST(UPDATE_TIMESTAMP AS DATE)")\
    .addColumn("HOUR", "INT", generatedAlwaysAs="HOUR(UPDATE_TIMESTAMP)")\
    .addColumn("MINUTE", "INT", generatedAlwaysAs="MINUTE(UPDATE_TIMESTAMP)")\
    .addColumn("SECOND", "INT", generatedAlwaysAs="SECOND(UPDATE_TIMESTAMP)")\
    .partitionedBy("DATE", "HOUR")\
    .execute()

Then the table was populated by writing pyspark dataframe from a notebook.

In power bi we used Direct Query with Dynamic M parameters (Date, Hour and Minute) to try to limit the number of records being fetched from a query. For one hour we have around 30 million rows or around 0.5 million per minute when we are counting using:

df = spark.sql("SELECT * FROM LH.FACT_TABLE where date  = '2023-06-10' and HOUR = 11 and minute = 12")
df.count()

from a notebook. The result is returned very quickly (less than 4 minutes).
When trying to run the same sql query from the sql endpoint of the LH, the result is never returned even after waiting for several hours. As if the SQL Query is not doing any partition pruning.

As we had chosen to connect the sql endpoint in power bi, we are unable to report on that table.

Is there anyone out there dealing with 1 Trillion row tables in power bi? If yes, were you able to report on the table, or are you facing the same problem as mine.

Thanks

v-stephen-msft · ‎11-25-2024

Hi @mkj1213 ,

Thanks for reaching out to our community.

You could check partition pruning. Please ensure that the SQL endpoint is correctly configured and supports partition pruning. Try explicitly specifying partition columns in your SQL query to ensure the query can leverage partitions.

Also consider using aggregate queries or precomputed views to reduce the amount of data being queried. For example, create daily or hourly aggregate tables and query these aggregates in Power BI.

Hope it helps.

Best Regards,

Stephen Tao

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

mkj1213 · ‎11-26-2024

Thanks for your reply Stephen,

1- How am i supposed to check if the sql endpoint is correctly configured and supports partition pruning. We are using the sql endpoint that is provided by Fabric when i am creating a lake house.
2-We had already tried to query only using the partition columns, and i am not getting data at all.

3-We had already created daily and hourly aggregate tables.

Thanks in Advance

1 Trillion rows table in Fabric and Power BI.

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024

How to Get Your Question Answered Quickly

Join us at the 2025 Microsoft Fabric Community Conference

1 Trillion rows table in Fabric and Power BI.

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024

How to Get Your Question Answered Quickly