topic Re: How do you remove top N rows from a CSV when loading it into a notebook? in Data Engineering

How do you remove top N rows from a CSV when loading it into a notebook?

arpost — Thu, 25 Jul 2024 16:06:52 GMT

Greetings, community. I have a scenario where I need to skip the first few rows of a CSV file and then save that back into a lakehouse. I need the lakehouse to be dynamic since I'll be deploying the notebook across multiple environments. I am trying to use the following PySpark code as follows but without success as it doesn't skip any rows as far as I can tell:

df = spark.read.format("csv").option("skipRows",25).option("header","true").load(ABFSPath)

Anyone have ideas on how I can achieve this?

Re: How do you remove top N rows from a CSV when loading it into a notebook?

frithjof_v — Thu, 25 Jul 2024 20:27:16 GMT

Some methods are mentioned in this thread: https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/td-p/28059

Some thoughts / suggestions to try:

Does the order of the options matter in PySpark? I don't know.

Does it make a difference if you rearrange the expression like this?

df = spark.read.format("csv").option("header","true").option("skipRows",25).load(ABFSPath)

or remove the header option like this

df = spark.read.format("csv").option("skipRows",25).load(ABFSPath)

Re: How do you remove top N rows from a CSV when loading it into a notebook?

Anonymous — Fri, 26 Jul 2024 01:58:18 GMT

Hi @arpost ,

Thanks for the reply from @frithjof_v .

Your requirement is that you want to skip the first few lines of a CSV file when loading it into a PySpark DataFrame, am I understanding this correctly?

Here's my csv data used for testing, 5 rows in total:

It is true that the first two lines are not skipped correctly when using the following syntax, so I understand your anxiety.

Another method can be tried:

Reads the CSV file into the RDD and skips the first two lines while removing the header:

# Define the file path file_path = “Files/products.csv” # Read the CSV file into the RDD and skip the first two lines rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0]) # Convert the RDD to a DataFrame without the headers df = spark.read.csv(rdd, header=False) # Display the DataFrame display(df)

The display looks like below:

For the time being, I have not found a way to preserve the original header, so I have to define it manually:

# Define the file path file_path = “Files/products.csv” # Read the CSV file into the RDD and skip the first two lines rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] > 2).map(lambda x: x[0]) # Define the header header = [“ProductID”, “ProductName”, “Category”, “ListPrice”] # Convert the RDD to a DataFrame and add a header df = spark.read.csv(rdd).toDF(*header) # Display the DataFrame display(df)

The display will look as shown below:

Replace the lines that need to be skipped inside the code according to your needs.

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

Re: How do you remove top N rows from a CSV when loading it into a notebook?

arpost — Fri, 26 Jul 2024 14:36:14 GMT

@Anonymous, thanks for sharing that. This is definitely promising. The one "blocker" for me would be the static header as I need this solution to be able to dynamically use the first row after skipping the previous rows.