Join us for an expert-led overview of the tools and concepts you'll need to pass exam PL-300. The first session starts on June 11th. See you there!
Get registeredJoin us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered
Hi,
I've uploaded ~40k csv files in Azure Lakehouse and I need to read it into a dataframe in a python notebook in fabric for further processing. I'm using spark.read for this, please suggest some methods to optimise this query and reduce the time complexity.
Some Constraints:
- There is an ID column in each row,
- All the csv files have the same schema
- There are around 10 rows in each csv
- The aim is to optimise code for 10 million records
Right now, i'm using:
Solved! Go to Solution.
Hi @ritikesh,
You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)
After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)
Notice: the sample has two files with 1k rows.
GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x
Regards,
Xiaoxin Sheng
Hi @ritikesh,
You can right click on one of the csv file that stored in your folder of thousands of files, then choose 'load data' -> 'spark' to generate spark code of the load file from specific file path.
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
display(df)
After above steps, you can modify the generated code to remove the detail file path to load data from all the files of folder instead of specific file.
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/")
display(df)
Notice: the sample has two files with 1k rows.
GitHub - databricks/spark-csv: CSV Data Source for Apache Spark 1.x
Regards,
Xiaoxin Sheng