Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
arpost
Post Prodigy
Post Prodigy

How do you remove top N rows from a CSV when loading it into a notebook?

Greetings, community. I have a scenario where I need to skip the first few rows of a CSV file and then save that back into a lakehouse. I need the lakehouse to be dynamic since I'll be deploying the notebook across multiple environments. I am trying to use the following PySpark code as follows but without success as it doesn't skip any rows as far as I can tell:

 

 

df = spark.read.format("csv").option("skipRows",25).option("header","true").load(ABFSPath)

 

 

Anyone have ideas on how I can achieve this?

3 REPLIES 3
Anonymous
Not applicable

Hi @arpost ,

 

Thanks for the reply from @frithjof_v .

 

Your requirement is that you want to skip the first few lines of a CSV file when loading it into a PySpark DataFrame, am I understanding this correctly?

 

Here's my csv data used for testing, 5 rows in total:

vhuijieymsft_0-1721958892861.png

 

It is true that the first two lines are not skipped correctly when using the following syntax, so I understand your anxiety.

vhuijieymsft_1-1721958892866.png

 

Another method can be tried:

 

Reads the CSV file into the RDD and skips the first two lines while removing the header:

# Define the file path
file_path = “Files/products.csv”

# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0])

# Convert the RDD to a DataFrame without the headers
df = spark.read.csv(rdd, header=False)

# Display the DataFrame
display(df)

 

The display looks like below:

vhuijieymsft_2-1721958982911.png

 

For the time being, I have not found a way to preserve the original header, so I have to define it manually:

# Define the file path
file_path = “Files/products.csv”

# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0])

# Define the header
header = [“ProductID”, “ProductName”, “Category”, “ListPrice”]

# Convert the RDD to a DataFrame and add a header
df = spark.read.csv(rdd).toDF(*header)

# Display the DataFrame
display(df)

 

The display will look as shown below:

vhuijieymsft_3-1721958982917.png

 

Replace the lines that need to be skipped inside the code according to your needs.

 

If you have any other questions please feel free to contact me.

 

Best Regards,
Yang
Community Support Team

 

If there is any post helps, then please consider Accept it as the solution  to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

@Anonymous, thanks for sharing that. This is definitely promising. The one "blocker" for me would be the static header as I need this solution to be able to dynamically use the first row after skipping the previous rows.

frithjof_v
Super User
Super User

Some methods are mentioned in this thread: https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/td-p...

 

Some thoughts / suggestions to try:

Does the order of the options matter in PySpark? I don't know.

Does it make a difference if you rearrange the expression like this?

 

df = spark.read.format("csv").option("header","true").option("skipRows",25).load(ABFSPath)

 

or remove the header option like this

 

df = spark.read.format("csv").option("skipRows",25).load(ABFSPath)

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.

Top Solution Authors