Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
ritikesh
Microsoft Employee
Microsoft Employee

Optimise Large number of CSV files to be read from Lakehouse as using spark.read

Hi,

I've uploaded  ~40k csv files in Azure Lakehouse and I need to read it into a dataframe in a python notebook in fabric for further processing. I'm using spark.read for this, please suggest some methods to optimise this query and reduce the time complexity.
Some Constraints:
- There is an ID column in each row,
- All the csv files have the same schema

- There are around 10 rows in each csv

- The aim is to optimise code for 10 million records

Right now, i'm using:

df = spark.read.option("header", "true").csv(path_to_csv_files + "/*/*.csv")

Also, these csv files were created by delta table, i used zorder to optimise that query
1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @ritikesh,

You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.

 

1.png

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)

After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)

2.png

Notice: the sample has two files with 1k rows.

GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x

Regards,

Xiaoxin Sheng

View solution in original post

1 REPLY 1
Anonymous
Not applicable

Hi @ritikesh,

You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.

 

1.png

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)

After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)

2.png

Notice: the sample has two files with 1k rows.

GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x

Regards,

Xiaoxin Sheng

Helpful resources

Announcements
Fabric July 2025 Monthly Update Carousel

Fabric Monthly Update - July 2025

Check out the July 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.

Top Solution Authors