Raw Dataflow-Computed Dataflow-Dataset Best Practi...

LasseMr · ‎11-04-2021

Dear Community,

After having read countless pages on the subject and been in dialogue with MS Support I still fail to get a good answer on what is the "best" setup.

We are running;

Embedded premium capacity Gen2, A2
Source: Navision 2018 (ERP system) SQL database via gateway
Workspace is updated to latest version
Workspace set to large dataset storage format
Storage connected to Azure Data Lake Gen2 account

Current solution;

Raw Dataflow with incremental refresh on transactional data:
1. Example case: Table Value Entries with incremental refresh controlled on Posting DateTime (DateTime is standard Navision field type, and a prerequisite for incremental refresh, actual data has no "time" part, only dates)
2. Purpose: getting data continuously stored in Datalake for analytical purposes as an modern and cost effective alternative to regular SQL DW
3. Amount of data: Value Entries: approx. 18.000.000 rows with approx. 175.000 new rows per month
4. Refresh time: 1-2 minutes (very satisfactory considering the amount of data)
5. Content: Many other transactional data than Value Entries and various dimensional data with full load.
Computed Dataflow with incremental refresh on transactional data:
1. Example case: Inventory Transactions Monthly Buckets
2. Purpose: Aggregating Value Entry data to monthly buckets to ease dataset data amount and PQ+DAX workload
3. Approach: End of Month on Date and Group By by Item, End of Month and Sum of Quantity
4. Incremental Refresh on Date with 45 days, ensuring that last month is always updated when/if inventory adjustments are performed for previous period
5. Refresh time: initial full load 2 hours, continuous daily refresh 20 minutes (considering how simple an aggreation it is, the high refresh time is concerning)
6. Content: Currently only this one case
7. Extra challenge:
  1. CPU load is spinning to high level (100+%)
  2. Initial load tends to leave the CPU load stuck at higher than 100% after completion, and Gen2 does not allow restart of capacity which leaves to only resolution to activate another capacity and move workspaces to the other capacity (see example further down where we have needed to move the workspaces various times to normalize operations - the original capacities doesn't go down from 100% before after a few days, MS support can't explain why).
Datasets consuming data from Raw Dataflow and Computed Dataflow
1. Example is a Inventory Analysis dataset, calculating the inventory at end of each month
2. Won't go in detail how this is setup as main challenge is on the dataflows right now, but here we are also struggling on long refresh time (approx. 1,5 hours) and high memory consumption
3. Convinced that if we manage to prepare the data better at dataflow level and with some optimization on the dataset PQ and DAX then this will not be an issue.

Considering that the amount of data is really not that large, I am concerned that we are facing these issues already.

Any ideas on what could be improved from an architectural and technical point of view?

lbendlin · ‎11-07-2021

Dataflows are good at raw storage, similar to Parquet blobls or Hadoop. They suck at anything requires computation, including incremental refresh. If your data source has an ok spooling performance there is no need for dataflows at all. (And don't get me started with Direct Query against dataflows).

If you must, feed your dataset from the raw dataflow and do the computations on the dataset side. Implement incremental refresh in your dataset, and do the partition management yourself.