Need help with loading large data volume

Bobster326 · ‎06-26-2025

Hi

I have to decicde between two routes,

I have a python script that paginates and pulls a JSON response, it also then filters with this JSON response another downloaded csv file, where both are uploaded to a file on my computer.

Sometimes if I load a year's worth of data and I load the API response and the csv file, the file size then exeeds 2gb and I get an error message.

I was wondering if anyone had advice on how to deal with this

------------------------------

I also had a other idea of having the python script in PowerQuery, however since I cant use some functions such as GUIs and other things in my python script this runs slower, but it outputs a merged dataframe for me. However this also runs to the same issues where the data size can get too large

let
StartDate = Text.From(StartDate),
EndDate = Text.From(EndDate),
pyCode = Text.Combine({
"import requests",
"import aiohttp",
"import asyncio",
"import pandas as pd",
"import zipfile",
"import tempfile",
"import os",
"from datetime import datetime, timedelta",
"",
"start_date_str = '" & StartDate & "'",
"end_date_str = '" & EndDate & "'",
"start_date = datetime.strptime(start_date_str, '%Y-%m-%d')",
"end_date = datetime.strptime(end_date_str, '%Y-%m-%d')",
"",
"PATIENTPROBLEMDATA_ZIP_URL = 'https://www.accessdata.fda.gov/MAUDE/ftparea/patientproblemcode.zip'",
"",
"async def fetch_day_data(session, date_str, retries=3):",
" url = f'https://api.fda.gov/device/event.json?search=date_received:[{date_str}+TO+{date_str}]&limit=1000'",
" for attempt in range(retries):",
" try:",
" results_df = pd.DataFrame()",
" async with session.get(url, headers={'Accept-Encoding': 'gzip'}) as response:",
" if response.status == 200:",
" data = await response.json()",
" if 'results' in data:",
" results_df = pd.DataFrame(data['results'])",
" next_link = response.headers.get('Link')",
" while next_link:",
" next_link = next_link.split(',')[0].split(';')[0].strip('<>')",
" async with session.get(next_link, headers={'Accept-Encoding': 'gzip'}) as next_response:",
" if next_response.status != 200:",
" break",
" next_data = await next_response.json()",
" if 'results' in next_data:",
" new_df = pd.DataFrame(next_data['results'])",
" results_df = pd.concat([results_df, new_df], ignore_index=True)",
" next_link = next_response.headers.get('Link')",
" api_columns_to_keep = ['mdr_report_key', 'event_type', 'date_received', 'product_code', 'mdr_text']",
" results_df = results_df[[col for col in api_columns_to_keep if col in results_df.columns]]",
" return results_df",
" except Exception as e:",
" if attempt < retries - 1:",
" await asyncio.sleep(2 ** attempt)",
" else:",
" return pd.DataFrame({'error': [f'{date_str} failed: {e}']})",
" return pd.DataFrame()",
"",
"async def fetch_all_dates(start_date, end_date):",
" date_list = [(start_date + timedelta(days=i)).strftime('%Y%m%d') for i in range((end_date - start_date).days + 1)]",
" results = []",
" async with aiohttp.ClientSession() as session:",
" tasks = [fetch_day_data(session, date) for date in date_list]",
" for task in asyncio.as_completed(tasks):",
" df = await task",
" if not df.empty:",
" results.append(df)",
" return pd.concat(results, ignore_index=True) if results else pd.DataFrame()",
"",
"def merge_zip_with_api(api_df):",
" expected_column = 'MDR_REPORT_KEY'",
" if 'mdr_report_key' not in api_df.columns:",
" return pd.DataFrame({'error': ['Missing mdr_report_key in API data']})",
" try:",
" valid_keys = set(api_df['mdr_report_key'])",
" chunks = []",
" with tempfile.NamedTemporaryFile(delete=False, suffix='.zip') as tmp_zip:",
" response = requests.get(PATIENTPROBLEMDATA_ZIP_URL, stream=True, headers={'User-Agent': 'Mozilla/5.0'}, timeout=30)",
" tmp_zip.write(response.content)",
" zip_path = tmp_zip.name",
" with zipfile.ZipFile(zip_path, 'r') as z:",
" for fname in z.namelist():",
" if fname.endswith('.txt'):",
" with z.open(fname) as f:",
" for chunk in pd.read_csv(f, delimiter='|', dtype=str, chunksize=50000):",
" if expected_column in chunk.columns:",
" matched = chunk[chunk[expected_column].isin(valid_keys)]",
" if not matched.empty:",
" zip_columns_to_keep = ['MDR_REPORT_KEY', 'PATIENT_PROBLEM_CODE', 'PATIENT_PROBLEM_DESC', 'product_problems', 'mdr_text']",
" matched = matched[[col for col in zip_columns_to_keep if col in matched.columns]]",
" chunks.append(matched)",
" os.remove(zip_path)",
" if chunks:",
" zip_df = pd.concat(chunks, ignore_index=True)",
" return pd.merge(api_df, zip_df, how='inner', left_on='mdr_report_key', right_on='MDR_REPORT_KEY')",
" return pd.DataFrame()",
" except Exception as e:",
" return pd.DataFrame({'error': [f'ZIP merge error: {e}']})",
"",
"try:",
" api_df = asyncio.run(fetch_all_dates(start_date, end_date))",
" if not api_df.empty and 'error' not in api_df.columns:",
" output_df = merge_zip_with_api(api_df)",
" else:",
" output_df = api_df if not api_df.empty else pd.DataFrame({'error': ['No API data retrieved']})",
"except Exception as e:",
" output_df = pd.DataFrame({'error': [f'Execution error: {e}']})",
"",
"output_df"
}, "#(lf)"),
Source = Python.Execute(pyCode, null)
in
Source{[Name="output_df"]}[Value]

v-venuppu · ‎07-10-2025

Hi @Bobster326 ,

I hope the information provided is helpful.I wanted to check whether you were able to resolve the issue with the provided solutions.Please let us know if you need any further assistance.

Thank you.

v-venuppu · ‎07-07-2025

Hi @Bobster326 ,

May I ask if you have resolved this issue? Please let us know if you have any further issues, we are happy to help.

Thank you.

v-venuppu · ‎07-02-2025

Hi @Bobster326 ,

Thank you @lbendlin @pankajnamekar25 for the prompt response.

I wanted to check if you had the opportunity to review the information provided and resolve the issue..? If the responses given by @lbendlin @pankajnamekar25 has addressed your query, please accept it as a solution, so other members can easily find it.

Thank you.

v-venuppu · ‎06-30-2025

Hi @Bobster326 ,

"Parsing the json response in power query?" - means that:

Directly call the API in Power Query using M (no need to run Python in the query step).Use Power Query functions like Json.Document to extract and transform the JSON structure into tables.Then use Power Query's merge or join capabilities to filter/match the JSON results against your CSV data using mdr_report_key.

If this post helps, then please consider Accepting as solution to help the other members find it more quickly.

Thank you.

pankajnamekar25 · ‎06-27-2025

Hello @Bobster326

The core issue here is hitting Power BI’s 2GB data/model limit when dealing with large API and CSV merges. Here's a short solution recommendation:

Use your Python script outside of Power BI to:

Fetch and merge the API and ZIP data.

Save the final output to a Parquet file (much smaller and optimized for Power BI).

Import the Parquet into Power BI.

This avoids memory issues, speeds up refreshes, and keeps your transformations efficient. Avoid running large async or zip-handling scripts inside Power Query—they’re not optimized for that scale

Thanks

Pankaj Namekar | LinkedIn

If this solution helps, please accept it and give a kudos (Like), it would be greatly appreciated.

lbendlin · ‎06-27-2025

The core issue here is hitting Power BI’s 2GB data/model limit

Do you have a citation/link for that?

Bobster326 · ‎06-26-2025

Hi @lbendlin @what do you mean by parsing the json response in power query? Do you mean having the data frames load into power query than conduct these transformations?

I think it's just because I am looking at filtering the results of the other downloaded csv file by matching mdr report key

v-venuppu · ‎06-26-2025

Hi @Bobster326 ,

Thank you for reaching out to Microsoft Fabric Community.

Thank you @lbendlin for the prompt response.

In addition to Ibendlin's points:

As you are running into Power BI’s file and memory limitations, especially when dealing with large datasets (>2GB). There are a couple of options as suggested by Ibendlin:

parsing JSON directly in Power Query or storing the data in a SQL database are both solid options.

However, given the 2GB+ data size, I would suggest:

1.Using Python externally to fetch/merge the data

2.Saving it to a SQL database or a parquet file

3.Then loading into Power BI - this avoids memory and refresh limits, while keeping performance smooth.

Parsing JSON in Power Query works but may be slower and hit memory issues if the dataset is large or heavily paginated.

If this post helps, then please consider Accepting as solution to help the other members find it more quickly.

Thank you.

lbendlin · ‎06-26-2025

Why not parse the JSON directly in Power Query? Or stuff it into a SQL database and then run SQL queries against that?

Need help with loading large data volume

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

How to Get Your Question Answered Quickly

Join us at FabCon Vienna from September 15-18, 2025

Need help with loading large data volume

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

How to Get Your Question Answered Quickly