Solved: Power BI Gen1 Dataflow Fails with 3M+ Web API Rows...

CrouchingTiger · ‎10-28-2025

I'm trying to load large datasets (over 3 million rows) from a Web API into Power BI.

I'm currently testing with Dataflow Gen1.
The workspace is Premium, with Fabric capacity (SKU: F64) and Small semantic model storage format.

Current Setup:

For entities with less than 1 million rows:

Each Dataflow Gen1 contains one query.
The M code includes:
1. Token refresh
2. Pagination using $filter and $skip (10,000 rows per page)
3. List combine
4. Conversion to table
Each Dataflow refreshes once daily, with different refresh times.
For queries close to 1 million rows, refresh occasionally fails. To mitigate this, I increased the page size to 100,000 rows and am monitoring the results.

For entities with more than 3 million rows:

The structure is the same, but refresh fails consistently.
The error message is: "We reached the end of the buffer", which seems to be a memory or capacity issue.
As a workaround, I'm splitting the query to keep each under ~1 million rows and attempting incremental refresh in Dataflow Gen1.

I'm looking for guidance on how to structure and implement large-scale Web API ingestion (3M+ rows) using OData in Power BI, especially with Gen1. Any best practices, architectural suggestions, or technical tips would be greatly appreciated.

example of M code of dataflow gen1

let

// 1. Post to get Access Token

TokenResponse = Json.Document(Web.Contents("https://..oauth2",

[

Headers = [

#"Content-Type" = "application/x-www-form-urlencoded",

VariableHeader = Text.From(Time.Hour(DateTime.LocalNow())) // ignore token cash, I think it's not neccessary.

],

Content = Text.ToBinary("grant_type=...&client_id=...&client_secret=...&scope=...")

]

)),

AccessToken = TokenResponse[access_token],

// 2. API get call

BaseUrl = "https://...",

GetPagedData = (skip as number) =>

let

RelativePath = "...",

Query = [

#"APIKey" = "...",

#"$top" = "100000",

#"$skip" = Number.ToText(skip),

#"$filter" = "...",

#"$select" = "..."

],

Response = Json.Document(Web.Contents(BaseUrl,

[

RelativePath = RelativePath,

Query = Query,

Headers = [

#"Authorization" = "Bearer " & AccessToken,

...

]

)),

Page = if Record.HasFields(Response, "value") and List.Count(Response[value]) > 0 then

Response[value]

else

{}

in

Page,

// 3. Fetch pages

PageSize = 100000,

FetchPages =

List.Generate(

() => [i = 0, result = GetPagedData(0)],

each List.Count([result]) > 0,

each [i = [i] + 1, result = GetPagedData(([i] + 1) * PageSize)],

each [result]

),

Combined = List.Combine(FetchPages),

PageCount = List.Count(FetchPages), // for check the pages

RowCount = List.Count(Combined), // for check the rows

// 4. convert to table

#"Converted to Table" = Table.FromList(Combined, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

#"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"..."}),

#"Changed column type" = Table.TransformColumnTypes(#"Expanded Column1", {{"", type datetime}})

in

#"Changed column type"

ibarrau · ‎10-30-2025

I'm glad it worked. The most conventional tool for moving data are data pipelines inside Fabric. Dataflow gen2 can be used for sources that are not listed at pipelines connectors, or those with Fast Copy possibility. The idea is to move the data without any transformation to a lakehouse. Later you can transform the data with notebooks (python or SQL) or with dataflows gen2.

The cloud sources from API could also be gather with notebooks and python code directly. Some APIs could be tricky and running code at cloud is the best approach to control it. Of course store it at lakehouse.

You can read about medallion architecture to get a better understanding about the architecture or the flow of the data from the source to the report.

I hope that make sense

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Happy to help!

LaDataWeb Blog

View solution in original post

v-ssriganesh · ‎11-01-2025

Hi @CrouchingTiger,

Thank you for posting your query in the Microsoft Fabric Community Forum, and thanks to @ibarrau for sharing valuable insights.

Could you please confirm if your query has been resolved by the provided solutions? This would be helpful for other members who may encounter similar issues.

Thank you for being part of the Microsoft Fabric Community.

ibarrau · ‎10-29-2025

Hi. I don't think you can scale with dataflows gen1. I would strongly suggest to do it in a Fabric Notebook or a Data Pipeline in order to handle API sources (that's the best practice). You could even try dataflow gen2 just in case (copy paste the code). For any of those you will need a storage (lakehouse or warehouse) to build and store the table.

The M code looks good. I don't think you can improve much more. If incremental refresh is not helping then you need to understand your architecture won't scale. That's why you need to use other tool.

I hope that make sense.

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Happy to help!

LaDataWeb Blog

CrouchingTiger · ‎10-29-2025

Hi, @ibarrau

Thank you for response

I’ve confirmed that the data loads correctly in Dataflow Gen2, but I still need to test whether the refresh and incremental settings are functioning properly.

Moving forward, I’ll need to ingest data not only from APIs but also from sources like SAP HANA. I’d appreciate your guidance on what would be the most appropriate architectural approach to handle these diverse data sources efficiently.

I would appreciate your advice on the most appropriate architectural approach to handle these diverse data sources efficiently.

ibarrau · ‎10-30-2025

I'm glad it worked. The most conventional tool for moving data are data pipelines inside Fabric. Dataflow gen2 can be used for sources that are not listed at pipelines connectors, or those with Fast Copy possibility. The idea is to move the data without any transformation to a lakehouse. Later you can transform the data with notebooks (python or SQL) or with dataflows gen2.

The cloud sources from API could also be gather with notebooks and python code directly. Some APIs could be tricky and running code at cloud is the best approach to control it. Of course store it at lakehouse.

You can read about medallion architecture to get a better understanding about the architecture or the flow of the data from the source to the report.

I hope that make sense

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Happy to help!

LaDataWeb Blog

Power BI Gen1 Dataflow Fails with 3M+ Web API Rows – How to Structure for Success?

Current Setup:

For entities with less than 1 million rows:

For entities with more than 3 million rows:

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - February 2026

FabCon Atlanta 2026

FabCon is coming to Atlanta

Power BI Gen1 Dataflow Fails with 3M+ Web API Rows – How to Structure for Success?

Current Setup:

For entities with less than 1 million rows:

For entities with more than 3 million rows:

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - February 2026

FabCon Atlanta 2026