Solved: Optimise Large number of CSV files to be read from...

ritikesh · ‎07-11-2024

Hi,

I've uploaded ~40k csv files in Azure Lakehouse and I need to read it into a dataframe in a python notebook in fabric for further processing. I'm using spark.read for this, please suggest some methods to optimise this query and reduce the time complexity.
Some Constraints:
- There is an ID column in each row,
- All the csv files have the same schema

- There are around 10 rows in each csv

- The aim is to optimise code for 10 million records

Right now, i'm using:

df = spark.read.option("header", "true").csv(path_to_csv_files + "/*/*.csv")

Also, these csv files were created by delta table, i used zorder to optimise that query

Anonymous · ‎07-11-2024

Hi @ritikesh,

You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)

After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)

Notice: the sample has two files with 1k rows.

GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x

Regards,

Xiaoxin Sheng

View solution in original post

Anonymous · ‎07-11-2024

Hi @ritikesh,

You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)

After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.

df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)

Notice: the sample has two files with 1k rows.

GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x

Regards,

Xiaoxin Sheng

Optimise Large number of CSV files to be read from Lakehouse as using spark.read

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

FabCon is coming to Atlanta

Optimise Large number of CSV files to be read from Lakehouse as using spark.read

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026