How do you remove top N rows from a CSV when loadi...

arpost · ‎07-25-2024

Greetings, community. I have a scenario where I need to skip the first few rows of a CSV file and then save that back into a lakehouse. I need the lakehouse to be dynamic since I'll be deploying the notebook across multiple environments. I am trying to use the following PySpark code as follows but without success as it doesn't skip any rows as far as I can tell:

df = spark.read.format("csv").option("skipRows",25).option("header","true").load(ABFSPath)

Anyone have ideas on how I can achieve this?

Anonymous · ‎07-25-2024

Hi @arpost ,

Thanks for the reply from @frithjof_v .

Your requirement is that you want to skip the first few lines of a CSV file when loading it into a PySpark DataFrame, am I understanding this correctly?

Here's my csv data used for testing, 5 rows in total:

It is true that the first two lines are not skipped correctly when using the following syntax, so I understand your anxiety.

Another method can be tried:

Reads the CSV file into the RDD and skips the first two lines while removing the header:

# Define the file path
file_path = “Files/products.csv”

# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0])

# Convert the RDD to a DataFrame without the headers
df = spark.read.csv(rdd, header=False)

# Display the DataFrame
display(df)

The display looks like below:

For the time being, I have not found a way to preserve the original header, so I have to define it manually:

# Define the file path
file_path = “Files/products.csv”

# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0])

# Define the header
header = [“ProductID”, “ProductName”, “Category”, “ListPrice”]

# Convert the RDD to a DataFrame and add a header
df = spark.read.csv(rdd).toDF(*header)

# Display the DataFrame
display(df)

The display will look as shown below:

Replace the lines that need to be skipped inside the code according to your needs.

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

arpost · ‎07-26-2024

@Anonymous, thanks for sharing that. This is definitely promising. The one "blocker" for me would be the static header as I need this solution to be able to dynamically use the first row after skipping the previous rows.

frithjof_v · ‎07-25-2024

Some methods are mentioned in this thread: https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/td-p...

Some thoughts / suggestions to try:

Does the order of the options matter in PySpark? I don't know.

Does it make a difference if you rearrange the expression like this?

df = spark.read.format("csv").option("header","true").option("skipRows",25).load(ABFSPath)

or remove the header option like this

df = spark.read.format("csv").option("skipRows",25).load(ABFSPath)

How do you remove top N rows from a CSV when loading it into a notebook?

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Party with Power BI’s own Guy in a Cube

How do you remove top N rows from a CSV when loading it into a notebook?

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025