Solved: Understanding the bucket size setting for the incr...

pmscorca · ‎10-01-2024

Hi,

I'm trying the new incremental refresh feature for a dataflow gen2.

I'd like to understand better about how using the bucket size setting.

As a first thing, is it related to the "DateTime column to filter by" or to the "Only extract new data when the maximum value in this column changes"?

Choosing a bucket size equals to a month, which is the resulting bucket range and respect to which date?

Thanks

Anonymous · ‎10-20-2024

Hi @pmscorca ,

Thanks for sharing your incremental refresh settings. Let's break down your points to clarify the concepts further.

Monthly Bucket and Date Column to Filter By: You're correct that a monthly bucket means the data will be partitioned by month based on the "orderdate" column. If you have orders only in August, the partition will be for August. However, if you have orders in August, September, and October, there will be three partitions, one for each month. The "Extract data from the past" option is indeed relevant for the initial load, ensuring that historical data is included based on the specified range.
Extract Data from the Past: This setting is primarily for the initial load. For subsequent loads, the data will be loaded based on the current month and moving forward. For example, in October, it will load October's orders, in November, it will load November's orders, and so on. The order date will determine which monthly bucket the data falls into, such as an order with the date 10/20/2024 falling into the October bucket.
Maximum Value in Modified Date Column: This setting ensures that if any data within a monthly bucket changes, the entire bucket is refreshed. So, if an order within a monthly bucket is modified, the entire bucket for that month will be refreshed, potentially replacing unmodified orders if any changes are detected within that bucket.

Best Regards,
Adamk Kong

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly

View solution in original post

Hdgzfujg · ‎10-02-2024

Jcxhbb the other one is dgghfgjj the same time as well as

Anonymous · ‎10-01-2024

Hi @pmscorca ,

Depending on your settings:
Setting the bucket size to months means that the data will be partitioned by month based on the orderdate column. Since you are extracting data from the last 2 months, the dataflow will create buckets for each month within those 2 months.

For example, if today is October 2, 2024, the dataflow will create buckets for the following months:

August 2024: start date: August 1, 2024, end date: August 31, 2024
September 2024: start date: September 1, 2024, end date: September 30, 2024

These buckets will be used for incremental refreshes. The dataflow will check the modifieddate column to determine if the data within these buckets has changed. If the maximum value in the modifieddate column changes, the dataflow will refresh the entire bucket for that month.

Best Regards,
Adamk Kong

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

pmscorca · ‎10-02-2024

Hi, thanks for your reply. Below I've reported my incremental refresh settings.

The subject isn't very clear for me, yet.

A monthly bucket means that the data will be partitioned by month based on the orderdate column, namely the "date column to filter by". So, considering the two rows to import, I expected that the monthly bucket partitions were 3, August, September and October starting from the first order date (08/10/2024) and comprising the actual calendar month or only 1 partition because the orders are only at August; but it occurs to consider also what specified for the "Extract data from the past" options (e.g. choosing months should be coherent with the bucket size).
Extract data from the past" options should be valid only for the first load and not for the next ones: for October, November, December and so on it should be loaded the orders of October, then the orders of November and next the orders of December, evaluating the order date respect to the range of a monthly bucket partition (e.g. the order with the order date 10/20/2024 falls into the October bucket partition), isn't it?

Moreover, it is very interesting this concept "If the maximum value in the modifieddate column changes, the dataflow will refresh the entire bucket for that month": so an unmodified order in the destination could be deleted and replaced if at least another order is changed in the same monthly bucket partition, namely the order dates of the two orders fall into this partition.

Ultimately:

- the "date column to filter by" allows to establish which data to consider for each load respect to a date,

- the "bucket size" allows to partition the data for each load respect to the previous date following a specified size. Each load consists of getting a next bucket partition from the source, not present in the destination,

- the "only extract new data when the maximum value in this column changes" allows to check the modified date for the data in bucket partition of the source with the corresponding modified date for the existing data in the same bucket partition of the destination; if the modified date in the source bucket partition is changed then this source partition must replace the destination bucket partition.
The first or initial bucket partition to get from the source should be the first one deriving from what specified for "extract data from the past" settings.

Thanks

Anonymous · ‎10-20-2024

Hi @pmscorca ,

Thanks for sharing your incremental refresh settings. Let's break down your points to clarify the concepts further.

Monthly Bucket and Date Column to Filter By: You're correct that a monthly bucket means the data will be partitioned by month based on the "orderdate" column. If you have orders only in August, the partition will be for August. However, if you have orders in August, September, and October, there will be three partitions, one for each month. The "Extract data from the past" option is indeed relevant for the initial load, ensuring that historical data is included based on the specified range.
Extract Data from the Past: This setting is primarily for the initial load. For subsequent loads, the data will be loaded based on the current month and moving forward. For example, in October, it will load October's orders, in November, it will load November's orders, and so on. The order date will determine which monthly bucket the data falls into, such as an order with the date 10/20/2024 falling into the October bucket.
Maximum Value in Modified Date Column: This setting ensures that if any data within a monthly bucket changes, the entire bucket is refreshed. So, if an order within a monthly bucket is modified, the entire bucket for that month will be refreshed, potentially replacing unmodified orders if any changes are detected within that bucket.

Best Regards,
Adamk Kong

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly

LuitwielerMSFT · ‎10-01-2024

Hey pmscorca!

I see you have some questions about buckets and how to use them.

For incremental refresh you need to configure a couple of settings to get going.

First setting is to set the column you want to use to split the data into buckets with.("DateTime column to filter by") So in a case you have a orders table, you a probably want to use the order date.

With the buckets settings you can now define the range of data you want in scope. After the time range you decide how to split it into multiple buckets and chose either day, week, month, quarter or year.

For "Only extract new data when the maximum value in this column changes" is the column we use to detect if there is new data available for a specific bucket and line it up for refresh.

To learn more we have a detailed explaination here:

Incremental refresh in Dataflow Gen2 - Microsoft Fabric | Microsoft Learn

pmscorca · ‎10-01-2024

Hi, thanks for your reply. I've opened this post after reading the related documentation and doing some proofs.

I'd like to understand better this feature.

I need to import some excel files as a source in order to write the related data in a Fabric warehouse.

As a first load, I've these data:

two orders at August 10th.

Inside the dataflow gen 2 I have these settings:
- Date to filter by = orderdate,

- Extract data from the past = 2 months,

- Bucket size = month,

- Only extract new data when the maximum value in this column changes = modifieddate.

Which are the splitted bucket ranges respect to the order date, in terms of start date and end date?

Thanks

lbendlin · ‎10-01-2024

Excel files are not a foldable source. Incremental refresh should only be used for large data, buckets should have at least 2 Million rows for this to make sense.

pmscorca · ‎10-01-2024

Hi, I'm trying with few rows to understand better the mechanism and however I've no issues using excel files.

Understanding the bucket size setting for the incremental refresh in dataflow gen2

Helpful resources

Fabric Monthly Update - September 2025

FabCon is coming to Atlanta

Understanding the bucket size setting for the incremental refresh in dataflow gen2

Helpful resources

Fabric Monthly Update - September 2025