Solved: How to get a list of the blob with full path store...

rashidanwar · ‎06-02-2024

Hi Everyone

Any help with the following would be highly appreciated.

I am using the following code to get the Content of the Parquet files stored in an Azure Blob Storage Container. The below code gets the data of all the files successfully. I am using a Fabric PySpark Notebook for this.

Now I have two issues that need to be resolved.

1. I am only able to get the data when the access level of the container is set to "anonymus". I want to get the code worked when the access level of the container is set to "private".

2. I want to get a list of all the blobs stored in the container and the list should contain the complete path to the blob. Like if I have 2 files stored in the container and their complete path should be as follows.
Entity/App/20240521/00001.parquet
Entity/App/20240525/00001.parquet

Code:
from pyspark.sql import SparkSession

from pyspark import SparkContext

blob_account_name = "parquetfiles1"

blob_container_name = "container1"

blob_sas_token = "sp=rl...."

# Initialize Spark Session

spark = SparkSession.builder.appName("azure").getOrCreate()

# Set the configuration

spark.conf.set(

f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',

blob_sas_token)

# Build the base path for the container (without specific file path)

base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'

df = (

spark.read.format("parquet")

.option("recursiveFileLookup", "true")

.load(base_path)

)

display(df)

Regards
Rashid Anwar

rashidanwar · ‎06-11-2024

@jwinchell40, thank you for your message. I used the following code and I am able to access the list of the parquet files in my blob storage container.

from pyspark.sql.functions import *

from pyspark import SparkContext

from pyspark.sql import SparkSession

from notebookutils.mssparkutils.fs import ls

blob_account_name = "account_name"

blob_container_name = "container"

blob_sas_token = "sp=rl&....."

# Initialize Spark Session

spark = SparkSession.builder.appName("azure").getOrCreate()

# Set the configuration

spark.conf.set(

f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',

blob_sas_token)

# Build the base path for the container (without specific file path)

base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'

# Function to list all files in a directory recursively

def list_all_files(directory😞

all_files = []

items = ls(directory)

for item in items:

if item.isDir:

all_files.extend(list_all_files(item.path))

else:

all_files.append(item.path)

return all_files

main_directories = mssparkutils.fs.ls(base_path)

# Initialize a list to hold all Parquet file paths

parquet_files = []

# Iterate through each main directory and list all files

for main_dir in main_directories:

if main_dir.isDir:

all_files = list_all_files(main_dir.path)

# Filter Parquet files

parquet_files.extend([file for file in all_files if file.endswith('.parquet')])

# Print all Parquet file paths

for parquet_file in parquet_files:

print(parquet_file)

View solution in original post

rashidanwar · ‎06-11-2024

@jwinchell40, thank you for your message. I used the following code and I am able to access the list of the parquet files in my blob storage container.

from pyspark.sql.functions import *

from pyspark import SparkContext

from pyspark.sql import SparkSession

from notebookutils.mssparkutils.fs import ls

blob_account_name = "account_name"

blob_container_name = "container"

blob_sas_token = "sp=rl&....."

# Initialize Spark Session

spark = SparkSession.builder.appName("azure").getOrCreate()

# Set the configuration

spark.conf.set(

f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',

blob_sas_token)

# Build the base path for the container (without specific file path)

base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'

# Function to list all files in a directory recursively

def list_all_files(directory😞

all_files = []

items = ls(directory)

for item in items:

if item.isDir:

all_files.extend(list_all_files(item.path))

else:

all_files.append(item.path)

return all_files

main_directories = mssparkutils.fs.ls(base_path)

# Initialize a list to hold all Parquet file paths

parquet_files = []

# Iterate through each main directory and list all files

for main_dir in main_directories:

if main_dir.isDir:

all_files = list_all_files(main_dir.path)

# Filter Parquet files

parquet_files.extend([file for file in all_files if file.endswith('.parquet')])

# Print all Parquet file paths

for parquet_file in parquet_files:

print(parquet_file)

Anonymous · ‎06-11-2024

Hi @rashidanwar ,

Glad to know your issue got resolved. Please continue using Fabric Community on your further queries.

rashidanwar · ‎06-11-2024

Thank you @Anonymous for the infomration.
Below is the support ticket/tracking id
TrackingID#2405230050000012

Anonymous · ‎06-11-2024

Hi @rashidanwar ,

Thanks for sharing the support ticket.

Please allow some time, so team can check and provide a resolution.

In case if you got a resolution please do share with the community as it can be helpful to others .

rashidanwar · ‎06-10-2024

@Anonymous,
regarding query 1: Creating managed private endpoints in Fabric is only available for workspaces assigned to Fabric capacities with SKUs F64 or larger. Unfortunately, this does not help in my cas

I have tried every suggestion but without success. I've had multiple meetings with the Microsoft team, and they haven't been able to resolve the issue yet. I'm at a loss for what to do next.

I have Parquet files stored in Azure Blob Storage, some of which have nested data structures. Initially, I tried to directly consume the files using Power BI, but the Parquet.Document() function in Power Query cannot read the nested data structure and throws an error. Interestingly, I have been able to access my Azure Blob Storage data in Power BI with the container access level set to "Private". I then decided to use a Fabric PySpark Notebook to retrieve the data, but I haven't had any success with that either. In fabric I am to get the data unanimously but can't get when access level is set to Private.

Anonymous · ‎06-11-2024

Hi @rashidanwar ,

At present inorder to access the data with access level set to 'Private' is by using Manage Private Endpoint.

In your case, the best course of action is to open a support ticket and have our support team take a closer look at it.

Please reach out to our support team so they can do a more thorough investigation and can guide you better: Link

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

Hope this helps. Please let us know if you have any other queries.

jwinchell40 · ‎06-04-2024

@rashidanwar - Have you tried using the _metadata properties as part of your df, if you want it alongside the data

- df = spark.read.format("parquet").select("*","_metadata.file_path","_metadata.file_name",")

If you just want to list out the files have you tried this; I do not fully understand your use case so this may not be what you are trying to achieve.

- https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#list-files

rashidanwar · ‎06-04-2024

Thank you @Anonymous!
Let me try your solution and I'll get back to you.

Anonymous · ‎06-09-2024

Hi @rashidanwar ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
Thanks

Anonymous · ‎06-03-2024

Hi @rashidanwar ,

Thanks for using Fabric Community.
Regarding query 1, can you please refer - click here

Additional Docs -
Loading data from ADLS behind firewalls to Fabric Lakehouse (youtube.com)
Overview of managed private endpoints for Microsoft Fabric - Microsoft Fabric | Microsoft Learn

Regarding query 2, can you please check below docs -
Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn

Hope this is helpful. Please do let me know incase of further queries.

Anonymous · ‎06-04-2024

Hi @rashidanwar ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

How to get a list of the blob with full path stored in Azure Blob Storage Container

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

How to get a list of the blob with full path stored in Azure Blob Storage Container

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025