Solved: Data flow incremental refresh with backfill

JoyceM · ‎09-02-2024

Good day everyone

I need to import a big table (billions of rows).

It lives in a Cloudera Impala database and I'm using the native Cloudera ODBC driver to access the data.

The data is "partitioned" daily. Consists of telecoms data for 5 years. The data remains fairly static except for the last 15 days or so.

I have created a view on this big table to join in enrichment information.

For performance and re-use I read it's best to create a data flow which is then incrementally refreshed every x days. On top of this 1 or more semantic models can then be created and incrementally refreshed as needed. And then on top of that thin reports can be created.

This all makes sense.

To test if the concept will work I limited the days pulled into the view to 5 days on the database side.

My problem is now that when I change the view to more days and setup incremental refresh for say 1 year / 15 days / check last updated, my refresh fails with out of memory errors.

I played around with less days/more days. It fails every time.

Checked the job log and it seems it's trying to pull the whole years data in one single query - not day by day like I would have expected.

Any ideas on how I can pull my big dataset into my data flow one day at a time fore the past 5 years and then only do daily incremental refreshes for the last 15 days thereafter?

Thanks a lot

lbendlin · ‎09-02-2024

I read it's best to create a data flow which is then incrementally refreshed every x days. On top of this 1 or more semantic models can then be created and incrementally refreshed as needed

Where did you read that?

Dataflow incremental refresh is a black box. You cannot bootstrap anything and you cannot refresh individual partitions.

Having both the dataflow and the semantic model do incremental refresh is a waste of time and storage.

Drop the dataflow and load the data directly into the semantic model. Use bootstrapping and XMLA to fill the partitions. Each partition needs to be less than 10 GB and needs to refresh within 5 hours, so aim for a safe size of about 5GB maximum per partition or 2.5 hours refresh (whatever is the limit that hits first).

Advanced incremental refresh and real-time data with the XMLA endpoint in Power BI - Power BI | Micr...

View solution in original post

lbendlin · ‎09-02-2024

I read it's best to create a data flow which is then incrementally refreshed every x days. On top of this 1 or more semantic models can then be created and incrementally refreshed as needed

Where did you read that?

Dataflow incremental refresh is a black box. You cannot bootstrap anything and you cannot refresh individual partitions.

Having both the dataflow and the semantic model do incremental refresh is a waste of time and storage.

Drop the dataflow and load the data directly into the semantic model. Use bootstrapping and XMLA to fill the partitions. Each partition needs to be less than 10 GB and needs to refresh within 5 hours, so aim for a safe size of about 5GB maximum per partition or 2.5 hours refresh (whatever is the limit that hits first).

Advanced incremental refresh and real-time data with the XMLA endpoint in Power BI - Power BI | Micr...