Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Greetings, community. I have a scenario where I need to skip the first few rows of a CSV file and then save that back into a lakehouse. I need the lakehouse to be dynamic since I'll be deploying the notebook across multiple environments. I am trying to use the following PySpark code as follows but without success as it doesn't skip any rows as far as I can tell:
df = spark.read.format("csv").option("skipRows",25).option("header","true").load(ABFSPath)
Anyone have ideas on how I can achieve this?
Hi @arpost ,
Thanks for the reply from @frithjof_v .
Your requirement is that you want to skip the first few lines of a CSV file when loading it into a PySpark DataFrame, am I understanding this correctly?
Here's my csv data used for testing, 5 rows in total:
It is true that the first two lines are not skipped correctly when using the following syntax, so I understand your anxiety.
Another method can be tried:
Reads the CSV file into the RDD and skips the first two lines while removing the header:
# Define the file path
file_path = “Files/products.csv”
# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0])
# Convert the RDD to a DataFrame without the headers
df = spark.read.csv(rdd, header=False)
# Display the DataFrame
display(df)
The display looks like below:
For the time being, I have not found a way to preserve the original header, so I have to define it manually:
# Define the file path
file_path = “Files/products.csv”
# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0])
# Define the header
header = [“ProductID”, “ProductName”, “Category”, “ListPrice”]
# Convert the RDD to a DataFrame and add a header
df = spark.read.csv(rdd).toDF(*header)
# Display the DataFrame
display(df)
The display will look as shown below:
Replace the lines that need to be skipped inside the code according to your needs.
If you have any other questions please feel free to contact me.
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
@Anonymous, thanks for sharing that. This is definitely promising. The one "blocker" for me would be the static header as I need this solution to be able to dynamically use the first row after skipping the previous rows.
Some methods are mentioned in this thread: https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/td-p...
Some thoughts / suggestions to try:
Does the order of the options matter in PySpark? I don't know.
Does it make a difference if you rearrange the expression like this?
df = spark.read.format("csv").option("header","true").option("skipRows",25).load(ABFSPath)
or remove the header option like this
df = spark.read.format("csv").option("skipRows",25).load(ABFSPath)
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Fabric update to learn about new features.