Managing Historic Data with Staging Dataflows

purple_SP · ‎02-09-2023

Hello, I am currently building a pipeline for a dashboard. I'm new to dataflows and just had a few questions regarding incremental refreshes and pipelines in general.

For context, I am working with three tables, the first two get about 1 million and 8 million new rows everyday. I wish to merge these two tables, unpivot the result, then merge with the third table (about 10,000 new rows daily). The dashboard should have historical data dating back to January 1st this year.

At the moment, I am connecting to the data warehouse with an SQL statement WHERE date >= 2023-01-01, but is there a better way to configure this with the dataflow itself?

Because the first 2 data sources are quite large, I think it would be best to transform them one day at a time. However for the initial refresh, how can I avoid having the dataflow try to process the 40 or so days worth of data at once?

Also, in terms of the staging dataflows, should they hold all of the historical data or just the new daily data? Intuitively, I think the dataflows should just hold and process the daily data with the historical data being stored in the final dataset that the dashboard uses. In such a case, would configuring the incremental refresh to be 1 day store, 1 day refresh do the job?

Finally just had a question regarding the transformation dataflows, I know that transformation and staging dataflows should be separate, but how granular should the transformations be? For example, would it be advisable to do the merge and the unpivot separetly?

Sorry for all the questions, any advice would be much appreciated.