Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
rashidanwar
Helper III
Helper III

How to get a list of the blob with full path stored in Azure Blob Storage Container

Hi Everyone 

Any help with the following would be highly appreciated.

I am using the following code to get the Content of the Parquet files stored in an Azure Blob Storage Container.  The below code gets the data of all the files successfully. I am using a Fabric PySpark Notebook for this.

Now I have two issues that need to be resolved.

1. I am only able to get the data when the access level of the container is set to "anonymus". I want to get the code worked when the access level of the container is set to "private". 

2. I want to get a list of all the blobs stored in the container and the list should contain the complete path to the blob. Like if I have 2 files stored in the container and their complete path should be as follows.
Entity/App/20240521/00001.parquet
Entity/App/20240525/00001.parquet

Code:
from
 pyspark.sql import SparkSession
from pyspark import SparkContext

 

blob_account_name = "parquetfiles1"
blob_container_name = "container1"
blob_sas_token = "sp=rl...."

 

# Initialize Spark Session
spark = SparkSession.builder.appName("azure").getOrCreate()

 

# Set the configuration
spark.conf.set(
f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',
blob_sas_token)

 

# Build the base path for the container (without specific file path)
base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'

 

df = (
    spark.read.format("parquet")
    .option("recursiveFileLookup", "true")
    .load(base_path)
)
 
display(df)

Regards
Rashid Anwar
1 ACCEPTED SOLUTION
rashidanwar
Helper III
Helper III

@jwinchell40, thank you for your message. I used the following code and I am able to access the list of the parquet files in my blob storage container.

from pyspark.sql.functions import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
from notebookutils.mssparkutils.fs import ls

blob_account_name = "account_name"
blob_container_name = "container"
blob_sas_token = "sp=rl&....."

# Initialize Spark Session
spark = SparkSession.builder.appName("azure").getOrCreate()

# Set the configuration
spark.conf.set(
f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',
blob_sas_token)

# Build the base path for the container (without specific file path)
base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'



# Function to list all files in a directory recursively
def list_all_files(directory😞
    all_files = []
    items = ls(directory)
    for item in items:
        if item.isDir:
            all_files.extend(list_all_files(item.path))
        else:
            all_files.append(item.path)
    return all_files

main_directories = mssparkutils.fs.ls(base_path)

# Initialize a list to hold all Parquet file paths
parquet_files = []

# Iterate through each main directory and list all files
for main_dir in main_directories:
    if main_dir.isDir:
        all_files = list_all_files(main_dir.path)
        # Filter Parquet files
        parquet_files.extend([file for file in all_files if file.endswith('.parquet')])

# Print all Parquet file paths
for parquet_file in parquet_files:
    print(parquet_file)

View solution in original post

11 REPLIES 11
rashidanwar
Helper III
Helper III

@jwinchell40, thank you for your message. I used the following code and I am able to access the list of the parquet files in my blob storage container.

from pyspark.sql.functions import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
from notebookutils.mssparkutils.fs import ls

blob_account_name = "account_name"
blob_container_name = "container"
blob_sas_token = "sp=rl&....."

# Initialize Spark Session
spark = SparkSession.builder.appName("azure").getOrCreate()

# Set the configuration
spark.conf.set(
f'spark.hadoop.fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net',
blob_sas_token)

# Build the base path for the container (without specific file path)
base_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net'



# Function to list all files in a directory recursively
def list_all_files(directory😞
    all_files = []
    items = ls(directory)
    for item in items:
        if item.isDir:
            all_files.extend(list_all_files(item.path))
        else:
            all_files.append(item.path)
    return all_files

main_directories = mssparkutils.fs.ls(base_path)

# Initialize a list to hold all Parquet file paths
parquet_files = []

# Iterate through each main directory and list all files
for main_dir in main_directories:
    if main_dir.isDir:
        all_files = list_all_files(main_dir.path)
        # Filter Parquet files
        parquet_files.extend([file for file in all_files if file.endswith('.parquet')])

# Print all Parquet file paths
for parquet_file in parquet_files:
    print(parquet_file)
Anonymous
Not applicable

Hi @rashidanwar ,

Glad to know your issue got resolved. Please continue using Fabric Community on your further queries.

rashidanwar
Helper III
Helper III

Thank you @Anonymous for the infomration.
Below is the support ticket/tracking id 
TrackingID#2405230050000012

Anonymous
Not applicable

Hi @rashidanwar ,

Thanks for sharing the support ticket.

Please allow some time, so team can check and provide a resolution.

In case if you got a resolution please do share with the community as it can be helpful to others .

rashidanwar
Helper III
Helper III

@Anonymous,
regarding query 1: Creating managed private endpoints in Fabric is only available for workspaces assigned to Fabric capacities with SKUs F64 or larger. Unfortunately, this does not help in my cas

I have tried every suggestion but without success. I've had multiple meetings with the Microsoft team, and they haven't been able to resolve the issue yet. I'm at a loss for what to do next.

I have Parquet files stored in Azure Blob Storage, some of which have nested data structures. Initially, I tried to directly consume the files using Power BI, but the Parquet.Document() function in Power Query cannot read the nested data structure and throws an error. Interestingly, I have been able to access my Azure Blob Storage data in Power BI with the container access level set to "Private". I then decided to use a Fabric PySpark Notebook to retrieve the data, but I haven't had any success with that either. In fabric I am to get the data unanimously but can't get when access level is set to Private.






Anonymous
Not applicable

Hi @rashidanwar ,

At present inorder to access the data with access level set to 'Private' is by using Manage Private Endpoint.

In your case, the best course of action is to open a support ticket and have our support team take a closer look at it.

 

Please reach out to our support team so they can do a more thorough investigation and can guide you better: Link 

 

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

 

Hope this helps. Please let us know if you have any other queries.

jwinchell40
Super User
Super User

@rashidanwar - Have you tried using the _metadata properties as part of your df, if you want it alongside the data

- df = spark.read.format("parquet").select("*","_metadata.file_path","_metadata.file_name",")

 

If you just want to list out the files have you tried this; I do not fully understand your use case so this may not be what you are trying to achieve.

https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#list-files

rashidanwar
Helper III
Helper III

Thank you @Anonymous!
Let me try your solution and I'll get back to you.

Anonymous
Not applicable

Hi @rashidanwar ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
Thanks

Anonymous
Not applicable

Hi @rashidanwar ,

Thanks for using Fabric Community.
Regarding query 1, can you please refer - click here 

Additional Docs -
Loading data from ADLS behind firewalls to Fabric Lakehouse (youtube.com)
Overview of managed private endpoints for Microsoft Fabric - Microsoft Fabric | Microsoft Learn

Regarding query 2, can you please check below docs -
Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn

Hope this is helpful. Please do let me know incase of further queries.

Anonymous
Not applicable

Hi @rashidanwar ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Prices go up Feb. 11th.

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

JanFabricDW_carousel

Fabric Monthly Update - January 2025

Unlock the latest Fabric Data Warehouse upgrades!

JanFabricDF_carousel

Fabric Monthly Update - January 2025

Take your data replication to the next level with Fabric's latest updates!