Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now! Learn more

Reply

Power BI Gen1 Dataflow Fails with 3M+ Web API Rows – How to Structure for Success?

I'm trying to load large datasets (over 3 million rows) from a Web API into Power BI.

  1. I'm currently testing with Dataflow Gen1.
  2. The workspace is Premium, with Fabric capacity (SKU: F64) and Small semantic model storage format.

Current Setup:

For entities with less than 1 million rows:

  • Each Dataflow Gen1 contains one query.
  • The M code includes:
    1. Token refresh
    2. Pagination using $filter and $skip (10,000 rows per page)
    3. List combine
    4. Conversion to table
  • Each Dataflow refreshes once daily, with different refresh times.
  • For queries close to 1 million rows, refresh occasionally fails. To mitigate this, I increased the page size to 100,000 rows and am monitoring the results.

For entities with more than 3 million rows:

  • The structure is the same, but refresh fails consistently.
  • The error message is: "We reached the end of the buffer", which seems to be a memory or capacity issue.
  • As a workaround, I'm splitting the query to keep each under ~1 million rows and attempting incremental refresh in Dataflow Gen1.

I'm looking for guidance on how to structure and implement large-scale Web API ingestion (3M+ rows) using OData in Power BI, especially with Gen1. Any best practices, architectural suggestions, or technical tips would be greatly appreciated.

example of M code of dataflow gen1

let
 
    // 1. Post to get Access Token
    TokenResponse = Json.Document(Web.Contents("https://..oauth2",
        [
            Headers = [
                #"Content-Type" = "application/x-www-form-urlencoded",
                VariableHeader = Text.From(Time.Hour(DateTime.LocalNow())) // ignore token cash, I think it's not neccessary.
            ],
            Content = Text.ToBinary("grant_type=...&client_id=...&client_secret=...&scope=...")
        ]
    )),
  AccessToken = TokenResponse[access_token],
 
  // 2. API get call
  BaseUrl = "https://...",
  GetPagedData = (skip as number) =>
        let          
            RelativePath = "...",
            Query = [
                #"APIKey" = "...",
                #"$top" = "100000",
                #"$skip" = Number.ToText(skip),
                #"$filter" = "...",
                #"$select" = "..."
            ],
            Response = Json.Document(Web.Contents(BaseUrl,
                [
                    RelativePath = RelativePath,
                    Query = Query,
                    Headers = [
                        #"Authorization" = "Bearer " & AccessToken,
                        ...
                    ]
                ]
            )),
            Page = if Record.HasFields(Response, "value") and List.Count(Response[value]) > 0 then
                Response[value]
            else
                {}
        in
            Page,
  // 3. Fetch pages
PageSize = 100000,
 
FetchPages =
    List.Generate(
        () => [i = 0, result = GetPagedData(0)],
        each List.Count([result]) > 0,
        each [i = [i] + 1, result = GetPagedData(([i] + 1) * PageSize)],
        each [result]
    ),
 
Combined = List.Combine(FetchPages),
PageCount = List.Count(FetchPages), // for check the pages
RowCount = List.Count(Combined), // for check the rows
 
  // 4. convert to table
  #"Converted to Table" = Table.FromList(Combined, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
  #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"..."}),
  #"Changed column type" = Table.TransformColumnTypes(#"Expanded Column1", {{"", type datetime}})
in
  #"Changed column type"
 
1 ACCEPTED SOLUTION

I'm glad it worked. The most conventional tool for moving data are data pipelines inside Fabric. Dataflow gen2 can be used for sources that are not listed at pipelines connectors, or those with Fast Copy possibility. The idea is to move the data without any transformation to a lakehouse. Later you can transform the data with notebooks (python or SQL) or with dataflows gen2.

The cloud sources from API could also be gather with notebooks and python code directly. Some APIs could be tricky and running code at cloud is the best approach to control it. Of course store it at lakehouse.

You can read about medallion architecture to get a better understanding about the architecture or the flow of the data from the source to the report.

I hope that make sense


If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Happy to help!

LaDataWeb Blog

View solution in original post

4 REPLIES 4
v-ssriganesh
Community Support
Community Support

Hi @CrouchingTiger,

Thank you for posting your query in the Microsoft Fabric Community Forum, and thanks to @ibarrau for sharing valuable insights.

 

Could you please confirm if your query has been resolved by the provided solutions? This would be helpful for other members who may encounter similar issues.

 

Thank you for being part of the Microsoft Fabric Community.

ibarrau
Super User
Super User

Hi. I don't think you can scale with dataflows gen1. I would strongly suggest to do it in a Fabric Notebook or a Data Pipeline in order to handle API sources (that's the best practice). You could even try dataflow gen2 just in case (copy paste the code). For any of those you will need a storage (lakehouse or warehouse) to build and store the table.

The M code looks good. I don't think you can improve much more. If incremental refresh is not helping then you need to understand your architecture won't scale. That's why you need to use other tool.

I hope that make sense.


If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Happy to help!

LaDataWeb Blog

Hi, @ibarrau 

Thank you for response


I’ve confirmed that the data loads correctly in Dataflow Gen2, but I still need to test whether the refresh and incremental settings are functioning properly.

 

Moving forward, I’ll need to ingest data not only from APIs but also from sources like SAP HANA. I’d appreciate your guidance on what would be the most appropriate architectural approach to handle these diverse data sources efficiently.

 

I would appreciate your advice on the most appropriate architectural approach to handle these diverse data sources efficiently.

I'm glad it worked. The most conventional tool for moving data are data pipelines inside Fabric. Dataflow gen2 can be used for sources that are not listed at pipelines connectors, or those with Fast Copy possibility. The idea is to move the data without any transformation to a lakehouse. Later you can transform the data with notebooks (python or SQL) or with dataflows gen2.

The cloud sources from API could also be gather with notebooks and python code directly. Some APIs could be tricky and running code at cloud is the best approach to control it. Of course store it at lakehouse.

You can read about medallion architecture to get a better understanding about the architecture or the flow of the data from the source to the report.

I hope that make sense


If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Happy to help!

LaDataWeb Blog

Helpful resources

Announcements
Power BI DataViz World Championships

Power BI Dataviz World Championships

The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now!

December 2025 Power BI Update Carousel

Power BI Monthly Update - December 2025

Check out the December 2025 Power BI Holiday Recap!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.