Re: Unable to Read Nested Parquet Files in Azure B...

rashidanwar · ‎05-30-2024

Hi everyone,

I'm seeking assistance with reading nested parquet files stored in Azure Blob Storage using a PySpark Notebook in Microsoft Fabric. Here's a breakdown of the situation:

Challenge:

Power BI's Parquet.Document() function is unable to handle nested data structures in my parquet files.
PySpark throws an authentication error when attempting to access the files with the same SAS token used in Power BI.

Steps Taken:

Power BI (Attempted):
- Used an Azure Blob Storage SAS token for authentication.
- Encountered limitations with Power BI for nested data structures.
PySpark Notebook (Current Issue):
- Configured the notebook with the same SAS token but received an authentication error: "No credentials found...".
- Enabled anonymous access on the container (Security Risk) to bypass the error, but then encountered a "Path does not exist" error.

Code Snippet:

blob_account_name = "account_name"
blob_container_name = "container"
blob_relative_path = "folder1/folder2/folder3/folder4/parquet/20240527/000001.parquet"
blob_sas_token = "sp=r&st=2024-05-19T01:51:21Z&se=2024-07-01T09:51:21Z&spr=https&sv=2022-11-02&sr=c&sig=%2CkbLTU8aY7maCu3ak15hjtVHr1jdhHgR2ZghfTTYBF%3D"

# Construct the path for connection
wasbs_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/{blob_relative_path}?{blob_sas_token}'

# Read parquet data from Azure Blob Storage path
blob_df = spark.read.parquet(wasbs_path)

# Show the Azure Blob DataFrame
blob_df.show()
spark.stop()

Request:

I'd appreciate any insights on how to access and read the nested parquet files securely using PySpark without anonymous container access.

Additional Information:

The provided code snippet includes hypothetical credentials. Thanks in advance for your assistance!

Regards,

Rashid Anwar

Regards
Rashid Anwar

rashidanwar · ‎06-11-2024

@Anonymous,

This issue is still unresolved, and I have opened a support ticket for it. Below is the tracking ID:

Tracking ID: 2405230050000012

I have successfully accessed the list of blobs stored in Azure Blob Storage and have posted the solution in the relevant thread, which can now be closed. Let's keep this thread open for further discussion.
Thank you!

Anonymous · ‎06-11-2024

Hi @rashidanwar ,

Thanks for sharing the support ticket.

Please allow some time, so team can check and provide a resolution.

In case if you got a resolution please do share with the community as it can be helpful to others .

rashidanwar · ‎06-02-2024

Now I want to have a list of all the blobs stored in the container. But list should contain the complete path to the blob.
For example I have Heirarchial File Structure in the Container.
There is a main folder called Entity and within Entity I have another folder called Apps and within Apps I have two folders 20240521 and 20240525 and within each folder I have a parquet file named 00001.parquet.

Now I want to have a list of the parquet files with comlete path.

How to get the list of files as follows using PySPark Notebook in Fabric
Entity/App/20240521/00001.parquet
Entity/App/20240525/00001.parquet

Anonymous · ‎06-03-2024

Hi @rashidanwar ,

Can you please check below doc ?
Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn

Anonymous · ‎06-04-2024

Hello @rashidanwar ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

Anonymous · ‎06-06-2024

Hi @rashidanwar ,

This thread is duplicate of this thread - How to get a list of the blob with full path store... - Microsoft Fabric Community

I am closing this thread.

rashidanwar · ‎06-02-2024

Hi I made some research and I have been able to get the content of all the blobs in the Azure Blob Storage Container using the following code. You can also filter ther results.

from pyspark.sql import SparkSession

from pyspark import SparkContext

blob_account_name = "parquetfiles1"

blob_container_name = "container1"

blob_sas_token = "sp=rl...."

# Initialize Spark Session

spark = SparkSession.builder.appName("azure").getOrCreate()

# Set the configuration

spark.conf.set(

f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',

blob_sas_token)

# Build the base path for the container (without specific file path)

base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'

df = (

spark.read.format("parquet")

.option("recursiveFileLookup", "true")

.load(base_path)

)

# df.show()

display(df)

Anonymous · ‎05-31-2024

Hi @rashidanwar ,

Thanks for using Fabric Community.
Can you please refer this similar thread - Solved: read Azure Data Lake from notebook fabric - Microsoft Fabric Community

Additional resources to refer -
How to read Parquet files in PySpark Azure Databricks? (azurelib.com)

Hope this might bring some insights. Please do let me know incase of further queries.

Unable to Read Nested Parquet Files in Azure Blob Storage with PySpark

Helpful resources

Fabric Monthly Update - May 2025

Fabric Community Update - June 2025

Become a Certified Power BI Data Analyst!

Unable to Read Nested Parquet Files in Azure Blob Storage with PySpark

Helpful resources

Fabric Monthly Update - May 2025

Fabric Community Update - June 2025