<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How do you remove top N rows from a CSV when loading it into a notebook? in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4063979#M3307</link>
    <description>&lt;P&gt;@Anonymous&lt;/a&gt;, thanks for sharing that. This is definitely promising. The one "blocker" for me would be the static header as I need this solution to be able to dynamically use the first row after skipping the previous rows.&lt;/P&gt;</description>
    <pubDate>Fri, 26 Jul 2024 14:36:14 GMT</pubDate>
    <dc:creator>arpost</dc:creator>
    <dc:date>2024-07-26T14:36:14Z</dc:date>
    <item>
      <title>How do you remove top N rows from a CSV when loading it into a notebook?</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4061899#M3283</link>
      <description>&lt;P&gt;Greetings, community. I have a scenario where I need to skip the first few rows of a CSV file and then save that back into a lakehouse. I need the lakehouse to be dynamic since I'll be deploying the notebook across multiple environments. I am trying to use the following PySpark code as follows but without success as it doesn't skip any rows as far as I can tell:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.read.format("csv").option("skipRows",25).option("header","true").load(ABFSPath)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Anyone have ideas on how I can achieve this?&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jul 2024 16:06:52 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4061899#M3283</guid>
      <dc:creator>arpost</dc:creator>
      <dc:date>2024-07-25T16:06:52Z</dc:date>
    </item>
    <item>
      <title>Re: How do you remove top N rows from a CSV when loading it into a notebook?</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4062176#M3286</link>
      <description>&lt;P&gt;Some methods are mentioned in this thread: &lt;A href="https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/td-p/28059" target="_blank" rel="noopener"&gt;https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/td-p/28059&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Some thoughts / suggestions to try:&lt;/P&gt;&lt;P&gt;Does the order of the options matter in PySpark? I don't know.&lt;/P&gt;&lt;P&gt;Does it make a difference if you rearrange the expression like this?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;df = spark.read.format("csv").option("header","true").option("skipRows",25).load(ABFSPath)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;or remove the header option like this&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;df = spark.read.format("csv").option("skipRows",25).load(ABFSPath)&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jul 2024 20:27:16 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4062176#M3286</guid>
      <dc:creator>frithjof_v</dc:creator>
      <dc:date>2024-07-25T20:27:16Z</dc:date>
    </item>
    <item>
      <title>Re: How do you remove top N rows from a CSV when loading it into a notebook?</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4062470#M3290</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/307993"&gt;@arpost&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for the reply from&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/437984"&gt;@frithjof_v&lt;/a&gt;&amp;nbsp;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Your requirement is that you want to skip the first few lines of a CSV file when loading it into a PySpark DataFrame, am I understanding this correctly?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here's my csv data used for testing, 5 rows in total:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vhuijieymsft_0-1721958892861.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1139195i61F92BC35DCFDE6A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vhuijieymsft_0-1721958892861.png" alt="vhuijieymsft_0-1721958892861.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It is true that the first two lines are not skipped correctly when using the following syntax, so I understand your anxiety.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vhuijieymsft_1-1721958892866.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1139196iA9D9BFB81BD90F35/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vhuijieymsft_1-1721958892866.png" alt="vhuijieymsft_1-1721958892866.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another method can be tried:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Reads the CSV file into the RDD and skips the first two lines while removing the header:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;# Define the file path
file_path = “Files/products.csv”

# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] &amp;gt; 2).map(lambda x: x[0])

# Convert the RDD to a DataFrame without the headers
df = spark.read.csv(rdd, header=False)

# Display the DataFrame
display(df)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The display looks like below:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vhuijieymsft_2-1721958982911.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1139198i277156A4978E8C5D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vhuijieymsft_2-1721958982911.png" alt="vhuijieymsft_2-1721958982911.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For the time being, I have not found a way to preserve the original header, so I have to define it manually:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;# Define the file path
file_path = “Files/products.csv”

# Read the CSV file into the RDD and skip the first two lines
rdd = sc.textFile(file_path).zipWithIndex().filter(lambda x: x[1] &amp;gt; 2).map(lambda x: x[0])

# Define the header
header = [“ProductID”, “ProductName”, “Category”, “ListPrice”]

# Convert the RDD to a DataFrame and add a header
df = spark.read.csv(rdd).toDF(*header)

# Display the DataFrame
display(df)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The display will look as shown below:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vhuijieymsft_3-1721958982917.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1139199iCDF52C6381609DDC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vhuijieymsft_3-1721958982917.png" alt="vhuijieymsft_3-1721958982917.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Replace the lines that need to be skipped inside the code according to your needs.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you have any other questions please feel free to contact me.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;BR /&gt;Yang&lt;BR /&gt;Community Support Team&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If there is any post&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;helps&lt;/EM&gt;&lt;/STRONG&gt;, then please consider&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;Accept it as the solution&lt;/EM&gt;&lt;/STRONG&gt;&amp;nbsp;&amp;nbsp;to help the other members find it more quickly.&lt;BR /&gt;If I misunderstand your needs or you still have problems on it, please feel free to let us know.&amp;nbsp;&lt;STRONG&gt;&lt;EM&gt;Thanks a lot!&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jul 2024 01:58:18 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4062470#M3290</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-07-26T01:58:18Z</dc:date>
    </item>
    <item>
      <title>Re: How do you remove top N rows from a CSV when loading it into a notebook?</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4063979#M3307</link>
      <description>&lt;P&gt;@Anonymous&lt;/a&gt;, thanks for sharing that. This is definitely promising. The one "blocker" for me would be the static header as I need this solution to be able to dynamically use the first row after skipping the previous rows.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jul 2024 14:36:14 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/How-do-you-remove-top-N-rows-from-a-CSV-when-loading-it-into-a/m-p/4063979#M3307</guid>
      <dc:creator>arpost</dc:creator>
      <dc:date>2024-07-26T14:36:14Z</dc:date>
    </item>
  </channel>
</rss>

