Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered

Reply
ritikesh
Microsoft Employee
Microsoft Employee

Optimise Large number of CSV files to be read from Lakehouse as using spark.read

Hi,

I've uploaded  ~40k csv files in Azure Lakehouse and I need to read it into a dataframe in a python notebook in fabric for further processing. I'm using spark.read for this, please suggest some methods to optimise this query and reduce the time complexity.
Some Constraints:
- There is an ID column in each row,
- All the csv files have the same schema

- There are around 10 rows in each csv

- The aim is to optimise code for 10 million records

Right now, i'm using:

df = spark.read.option("header", "true").csv(path_to_csv_files + "/*/*.csv")

Also, these csv files were created by delta table, i used zorder to optimise that query
1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @ritikesh,

You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.

 

1.png

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)

After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)

2.png

Notice: the sample has two files with 1k rows.

GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x

Regards,

Xiaoxin Sheng

View solution in original post

1 REPLY 1
Anonymous
Not applicable

Hi @ritikesh,

You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.

 

1.png

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)

After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)

2.png

Notice: the sample has two files with 1k rows.

GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x

Regards,

Xiaoxin Sheng

Helpful resources

Announcements
May FBC25 Carousel

Fabric Monthly Update - May 2025

Check out the May 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.

Top Kudoed Authors