Power BI is turning 10, and we’re marking the occasion with a special community challenge. Use your creativity to tell a story, uncover trends, or highlight something unexpected.
Get startedJoin us for an expert-led overview of the tools and concepts you'll need to become a Certified Power BI Data Analyst and pass exam PL-300. Register now.
Hi
I have to decicde between two routes,
I have a python script that paginates and pulls a JSON response, it also then filters with this JSON response another downloaded csv file, where both are uploaded to a file on my computer.
Sometimes if I load a year's worth of data and I load the API response and the csv file, the file size then exeeds 2gb and I get an error message.
I was wondering if anyone had advice on how to deal with this
------------------------------
I also had a other idea of having the python script in PowerQuery, however since I cant use some functions such as GUIs and other things in my python script this runs slower, but it outputs a merged dataframe for me. However this also runs to the same issues where the data size can get too large
let
StartDate = Text.From(StartDate),
EndDate = Text.From(EndDate),
pyCode = Text.Combine({
"import requests",
"import aiohttp",
"import asyncio",
"import pandas as pd",
"import zipfile",
"import tempfile",
"import os",
"from datetime import datetime, timedelta",
"",
"start_date_str = '" & StartDate & "'",
"end_date_str = '" & EndDate & "'",
"start_date = datetime.strptime(start_date_str, '%Y-%m-%d')",
"end_date = datetime.strptime(end_date_str, '%Y-%m-%d')",
"",
"PATIENTPROBLEMDATA_ZIP_URL = 'https://www.accessdata.fda.gov/MAUDE/ftparea/patientproblemcode.zip'",
"",
"async def fetch_day_data(session, date_str, retries=3):",
" url = f'https://api.fda.gov/device/event.json?search=date_received:[{date_str}+TO+{date_str}]&limit=1000'",
" for attempt in range(retries):",
" try:",
" results_df = pd.DataFrame()",
" async with session.get(url, headers={'Accept-Encoding': 'gzip'}) as response:",
" if response.status == 200:",
" data = await response.json()",
" if 'results' in data:",
" results_df = pd.DataFrame(data['results'])",
" next_link = response.headers.get('Link')",
" while next_link:",
" next_link = next_link.split(',')[0].split(';')[0].strip('<>')",
" async with session.get(next_link, headers={'Accept-Encoding': 'gzip'}) as next_response:",
" if next_response.status != 200:",
" break",
" next_data = await next_response.json()",
" if 'results' in next_data:",
" new_df = pd.DataFrame(next_data['results'])",
" results_df = pd.concat([results_df, new_df], ignore_index=True)",
" next_link = next_response.headers.get('Link')",
" api_columns_to_keep = ['mdr_report_key', 'event_type', 'date_received', 'product_code', 'mdr_text']",
" results_df = results_df[[col for col in api_columns_to_keep if col in results_df.columns]]",
" return results_df",
" except Exception as e:",
" if attempt < retries - 1:",
" await asyncio.sleep(2 ** attempt)",
" else:",
" return pd.DataFrame({'error': [f'{date_str} failed: {e}']})",
" return pd.DataFrame()",
"",
"async def fetch_all_dates(start_date, end_date):",
" date_list = [(start_date + timedelta(days=i)).strftime('%Y%m%d') for i in range((end_date - start_date).days + 1)]",
" results = []",
" async with aiohttp.ClientSession() as session:",
" tasks = [fetch_day_data(session, date) for date in date_list]",
" for task in asyncio.as_completed(tasks):",
" df = await task",
" if not df.empty:",
" results.append(df)",
" return pd.concat(results, ignore_index=True) if results else pd.DataFrame()",
"",
"def merge_zip_with_api(api_df):",
" expected_column = 'MDR_REPORT_KEY'",
" if 'mdr_report_key' not in api_df.columns:",
" return pd.DataFrame({'error': ['Missing mdr_report_key in API data']})",
" try:",
" valid_keys = set(api_df['mdr_report_key'])",
" chunks = []",
" with tempfile.NamedTemporaryFile(delete=False, suffix='.zip') as tmp_zip:",
" response = requests.get(PATIENTPROBLEMDATA_ZIP_URL, stream=True, headers={'User-Agent': 'Mozilla/5.0'}, timeout=30)",
" tmp_zip.write(response.content)",
" zip_path = tmp_zip.name",
" with zipfile.ZipFile(zip_path, 'r') as z:",
" for fname in z.namelist():",
" if fname.endswith('.txt'):",
" with z.open(fname) as f:",
" for chunk in pd.read_csv(f, delimiter='|', dtype=str, chunksize=50000):",
" if expected_column in chunk.columns:",
" matched = chunk[chunk[expected_column].isin(valid_keys)]",
" if not matched.empty:",
" zip_columns_to_keep = ['MDR_REPORT_KEY', 'PATIENT_PROBLEM_CODE', 'PATIENT_PROBLEM_DESC', 'product_problems', 'mdr_text']",
" matched = matched[[col for col in zip_columns_to_keep if col in matched.columns]]",
" chunks.append(matched)",
" os.remove(zip_path)",
" if chunks:",
" zip_df = pd.concat(chunks, ignore_index=True)",
" return pd.merge(api_df, zip_df, how='inner', left_on='mdr_report_key', right_on='MDR_REPORT_KEY')",
" return pd.DataFrame()",
" except Exception as e:",
" return pd.DataFrame({'error': [f'ZIP merge error: {e}']})",
"",
"try:",
" api_df = asyncio.run(fetch_all_dates(start_date, end_date))",
" if not api_df.empty and 'error' not in api_df.columns:",
" output_df = merge_zip_with_api(api_df)",
" else:",
" output_df = api_df if not api_df.empty else pd.DataFrame({'error': ['No API data retrieved']})",
"except Exception as e:",
" output_df = pd.DataFrame({'error': [f'Execution error: {e}']})",
"",
"output_df"
}, "#(lf)"),
Source = Python.Execute(pyCode, null)
in
Source{[Name="output_df"]}[Value]
Hello @Bobster326
The core issue here is hitting Power BI’s 2GB data/model limit when dealing with large API and CSV merges. Here's a short solution recommendation:
Use your Python script outside of Power BI to:
Fetch and merge the API and ZIP data.
Save the final output to a Parquet file (much smaller and optimized for Power BI).
Import the Parquet into Power BI.
This avoids memory issues, speeds up refreshes, and keeps your transformations efficient. Avoid running large async or zip-handling scripts inside Power Query—they’re not optimized for that scale
Thanks
If this solution helps, please accept it and give a kudos (Like), it would be greatly appreciated.
The core issue here is hitting Power BI’s 2GB data/model limit
Do you have a citation/link for that?
Hi @lbendlin @what do you mean by parsing the json response in power query? Do you mean having the data frames load into power query than conduct these transformations?
I think it's just because I am looking at filtering the results of the other downloaded csv file by matching mdr report key
Hi @Bobster326 ,
Thank you for reaching out to Microsoft Fabric Community.
Thank you @lbendlin for the prompt response.
In addition to Ibendlin's points:
As you are running into Power BI’s file and memory limitations, especially when dealing with large datasets (>2GB). There are a couple of options as suggested by Ibendlin:
parsing JSON directly in Power Query or storing the data in a SQL database are both solid options.
However, given the 2GB+ data size, I would suggest:
1.Using Python externally to fetch/merge the data
2.Saving it to a SQL database or a parquet file
3.Then loading into Power BI - this avoids memory and refresh limits, while keeping performance smooth.
Parsing JSON in Power Query works but may be slower and hit memory issues if the dataset is large or heavily paginated.
If this post helps, then please consider Accepting as solution to help the other members find it more quickly.
Thank you.
Why not parse the JSON directly in Power Query? Or stuff it into a SQL database and then run SQL queries against that?
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Power BI update to learn about new features.
User | Count |
---|---|
74 | |
73 | |
56 | |
38 | |
31 |
User | Count |
---|---|
84 | |
63 | |
63 | |
49 | |
45 |