Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us for an expert-led overview of the tools and concepts you'll need to become a Certified Power BI Data Analyst and pass exam PL-300. Register now.

Reply
Bobster326
New Member

Need help with loading large data volume

Hi 

 

I have to decicde between two routes,

 

I have a python script that paginates and pulls a JSON response, it also then filters with this JSON response another downloaded csv file, where both are uploaded to a file on my computer.

 

Sometimes if I load a year's worth of data and I load the API response and the csv file, the file size then exeeds 2gb and I get an error message.

 

I was wondering if anyone had advice on how to deal with this 

 

------------------------------

 

I also had a other idea of having the python script in PowerQuery, however since I cant use some functions such as GUIs and other things in my python script this runs slower, but it outputs a merged dataframe for me. However this also runs to the same issues where the data size can get too large

 

let
StartDate = Text.From(StartDate),
EndDate = Text.From(EndDate),
pyCode = Text.Combine({
"import requests",
"import aiohttp",
"import asyncio",
"import pandas as pd",
"import zipfile",
"import tempfile",
"import os",
"from datetime import datetime, timedelta",
"",
"start_date_str = '" & StartDate & "'",
"end_date_str = '" & EndDate & "'",
"start_date = datetime.strptime(start_date_str, '%Y-%m-%d')",
"end_date = datetime.strptime(end_date_str, '%Y-%m-%d')",
"",
"PATIENTPROBLEMDATA_ZIP_URL = 'https://www.accessdata.fda.gov/MAUDE/ftparea/patientproblemcode.zip'",
"",
"async def fetch_day_data(session, date_str, retries=3):",
" url = f'https://api.fda.gov/device/event.json?search=date_received:[{date_str}+TO+{date_str}]&limit=1000'",
" for attempt in range(retries):",
" try:",
" results_df = pd.DataFrame()",
" async with session.get(url, headers={'Accept-Encoding': 'gzip'}) as response:",
" if response.status == 200:",
" data = await response.json()",
" if 'results' in data:",
" results_df = pd.DataFrame(data['results'])",
" next_link = response.headers.get('Link')",
" while next_link:",
" next_link = next_link.split(',')[0].split(';')[0].strip('<>')",
" async with session.get(next_link, headers={'Accept-Encoding': 'gzip'}) as next_response:",
" if next_response.status != 200:",
" break",
" next_data = await next_response.json()",
" if 'results' in next_data:",
" new_df = pd.DataFrame(next_data['results'])",
" results_df = pd.concat([results_df, new_df], ignore_index=True)",
" next_link = next_response.headers.get('Link')",
" api_columns_to_keep = ['mdr_report_key', 'event_type', 'date_received', 'product_code', 'mdr_text']",
" results_df = results_df[[col for col in api_columns_to_keep if col in results_df.columns]]",
" return results_df",
" except Exception as e:",
" if attempt < retries - 1:",
" await asyncio.sleep(2 ** attempt)",
" else:",
" return pd.DataFrame({'error': [f'{date_str} failed: {e}']})",
" return pd.DataFrame()",
"",
"async def fetch_all_dates(start_date, end_date):",
" date_list = [(start_date + timedelta(days=i)).strftime('%Y%m%d') for i in range((end_date - start_date).days + 1)]",
" results = []",
" async with aiohttp.ClientSession() as session:",
" tasks = [fetch_day_data(session, date) for date in date_list]",
" for task in asyncio.as_completed(tasks):",
" df = await task",
" if not df.empty:",
" results.append(df)",
" return pd.concat(results, ignore_index=True) if results else pd.DataFrame()",
"",
"def merge_zip_with_api(api_df):",
" expected_column = 'MDR_REPORT_KEY'",
" if 'mdr_report_key' not in api_df.columns:",
" return pd.DataFrame({'error': ['Missing mdr_report_key in API data']})",
" try:",
" valid_keys = set(api_df['mdr_report_key'])",
" chunks = []",
" with tempfile.NamedTemporaryFile(delete=False, suffix='.zip') as tmp_zip:",
" response = requests.get(PATIENTPROBLEMDATA_ZIP_URL, stream=True, headers={'User-Agent': 'Mozilla/5.0'}, timeout=30)",
" tmp_zip.write(response.content)",
" zip_path = tmp_zip.name",
" with zipfile.ZipFile(zip_path, 'r') as z:",
" for fname in z.namelist():",
" if fname.endswith('.txt'):",
" with z.open(fname) as f:",
" for chunk in pd.read_csv(f, delimiter='|', dtype=str, chunksize=50000):",
" if expected_column in chunk.columns:",
" matched = chunk[chunk[expected_column].isin(valid_keys)]",
" if not matched.empty:",
" zip_columns_to_keep = ['MDR_REPORT_KEY', 'PATIENT_PROBLEM_CODE', 'PATIENT_PROBLEM_DESC', 'product_problems', 'mdr_text']",
" matched = matched[[col for col in zip_columns_to_keep if col in matched.columns]]",
" chunks.append(matched)",
" os.remove(zip_path)",
" if chunks:",
" zip_df = pd.concat(chunks, ignore_index=True)",
" return pd.merge(api_df, zip_df, how='inner', left_on='mdr_report_key', right_on='MDR_REPORT_KEY')",
" return pd.DataFrame()",
" except Exception as e:",
" return pd.DataFrame({'error': [f'ZIP merge error: {e}']})",
"",
"try:",
" api_df = asyncio.run(fetch_all_dates(start_date, end_date))",
" if not api_df.empty and 'error' not in api_df.columns:",
" output_df = merge_zip_with_api(api_df)",
" else:",
" output_df = api_df if not api_df.empty else pd.DataFrame({'error': ['No API data retrieved']})",
"except Exception as e:",
" output_df = pd.DataFrame({'error': [f'Execution error: {e}']})",
"",
"output_df"
}, "#(lf)"),
Source = Python.Execute(pyCode, null)
in
Source{[Name="output_df"]}[Value]

 

5 REPLIES 5
pankajnamekar25
Memorable Member
Memorable Member

Hello @Bobster326 

The core issue here is hitting Power BI’s 2GB data/model limit when dealing with large API and CSV merges. Here's a short solution recommendation:

Use your Python script outside of Power BI to:

Fetch and merge the API and ZIP data.

Save the final output to a Parquet file (much smaller and optimized for Power BI).

Import the Parquet into Power BI.

This avoids memory issues, speeds up refreshes, and keeps your transformations efficient. Avoid running large async or zip-handling scripts inside Power Query—they’re not optimized for that scale

Thanks

 Pankaj Namekar | LinkedIn

If this solution helps, please accept it and give a kudos (Like), it would be greatly appreciated.

The core issue here is hitting Power BI’s 2GB data/model limit 

Do you have a citation/link for that?

Bobster326
New Member

Hi @lbendlin @what do you mean by parsing the json response in power query? Do you mean having the data frames load into power query than conduct these transformations? 

I think it's just because I am looking at filtering the results of the other downloaded csv file by matching mdr report key

v-venuppu
Community Support
Community Support

Hi @Bobster326 ,

Thank you for reaching out to Microsoft Fabric Community.

Thank you @lbendlin for the prompt response.

In addition to Ibendlin's points:

As you are running into Power BI’s file and memory limitations, especially when dealing with large datasets (>2GB). There are a couple of options as suggested by Ibendlin:

parsing JSON directly in Power Query or storing the data in a SQL database are both solid options.

However, given the 2GB+ data size, I would suggest:

1.Using Python externally to fetch/merge the data

2.Saving it to a SQL database or a parquet file

3.Then loading into Power BI - this avoids memory and refresh limits, while keeping performance smooth.

Parsing JSON in Power Query works but may be slower and hit memory issues if the dataset is large or heavily paginated.

 

If this post helps, then please consider Accepting as solution to help the other members find it more quickly.

Thank you.

lbendlin
Super User
Super User

Why not parse the JSON directly in Power Query?  Or stuff it into a SQL database and then run SQL queries against that?

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June 2025 Power BI Update Carousel

Power BI Monthly Update - June 2025

Check out the June 2025 Power BI update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.