<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Duplicated rows between notebook and SQL Endpoint in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707454#M1838</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/686932"&gt;@amaaiia&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;I tried checking the same with my Lakehouse tables, but the count is same. I checked for two tables.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vnikhilanmsft_0-1708333813868.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1044792i87EDDCAD246CB14D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vnikhilanmsft_0-1708333813868.png" alt="vnikhilanmsft_0-1708333813868.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vnikhilanmsft_1-1708333859160.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1044793iD61D57A46B05C81F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vnikhilanmsft_1-1708333859160.png" alt="vnikhilanmsft_1-1708333859160.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The discrepancy in row counts between SQL Endpoint and notebook could be due to several reasons:&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Data Duplication&lt;/STRONG&gt;: There might be duplicate rows in your data. When you read the data into a DataFrame and use df.count(), it counts all rows, including duplicates.&amp;nbsp;If this is the case, you can remove duplicates using the &lt;STRONG&gt;df.dropDuplicates()&lt;/STRONG&gt; function.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Data Inconsistency&lt;/STRONG&gt;: There might be inconsistencies between the data in your SQL Endpoint and the data in your notebook. This could be due to issues with data ingestion, data updates, or data synchronization. You can verify this by comparing a subset of your data in both SQL Endpoint and notebook.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Caching Issues:&lt;/STRONG&gt; Sometimes, Spark might cache the DataFrame, and if the underlying data changes, the cached DataFrame might not reflect these changes2. You can try to clear the cache using the &lt;STRONG&gt;spark.catalog.clearCache()&lt;/STRONG&gt; function in your notebook.&lt;BR /&gt;&lt;BR /&gt;Hope this helps. Please let me know if you have any further questions.&lt;/P&gt;</description>
    <pubDate>Mon, 19 Feb 2024 09:19:23 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2024-02-19T09:19:23Z</dc:date>
    <item>
      <title>Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707317#M1835</link>
      <description>&lt;P&gt;I'm having troubles when trying to count the number of rows in my table from Lakehouse.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I count the rows from SQL Endpoint, I get&amp;nbsp;221555 rows, if I read the table from notebook and then I count the rows (df.count()) I get that number twice:&amp;nbsp;443110&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can this be possible?&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2024 08:37:59 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707317#M1835</guid>
      <dc:creator>amaaiia</dc:creator>
      <dc:date>2024-02-19T08:37:59Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707342#M1836</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/686932"&gt;@amaaiia&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Thanks for using Fabric Community.&lt;BR /&gt;Can you please provide the screenshots for the SQL code and the notebook code? This would help me to understand the question better.&amp;nbsp;&lt;BR /&gt;Thanks&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2024 08:45:16 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707342#M1836</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-02-19T08:45:16Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707399#M1837</link>
      <description>&lt;P&gt;Sure, when I get the number of rows from SQL endpoint:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amaaiia_3-1708332935251.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1044780i79E215419428AC97/image-size/medium?v=v2&amp;amp;px=400" role="button" title="amaaiia_3-1708332935251.png" alt="amaaiia_3-1708332935251.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And when I count them through notebook:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="amaaiia_0-1708334633234.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1044803i9B62B9D6E40CE042/image-size/medium?v=v2&amp;amp;px=400" role="button" title="amaaiia_0-1708334633234.png" alt="amaaiia_0-1708334633234.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2024 09:24:05 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707399#M1837</guid>
      <dc:creator>amaaiia</dc:creator>
      <dc:date>2024-02-19T09:24:05Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707454#M1838</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/686932"&gt;@amaaiia&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;I tried checking the same with my Lakehouse tables, but the count is same. I checked for two tables.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vnikhilanmsft_0-1708333813868.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1044792i87EDDCAD246CB14D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vnikhilanmsft_0-1708333813868.png" alt="vnikhilanmsft_0-1708333813868.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vnikhilanmsft_1-1708333859160.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1044793iD61D57A46B05C81F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vnikhilanmsft_1-1708333859160.png" alt="vnikhilanmsft_1-1708333859160.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The discrepancy in row counts between SQL Endpoint and notebook could be due to several reasons:&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Data Duplication&lt;/STRONG&gt;: There might be duplicate rows in your data. When you read the data into a DataFrame and use df.count(), it counts all rows, including duplicates.&amp;nbsp;If this is the case, you can remove duplicates using the &lt;STRONG&gt;df.dropDuplicates()&lt;/STRONG&gt; function.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Data Inconsistency&lt;/STRONG&gt;: There might be inconsistencies between the data in your SQL Endpoint and the data in your notebook. This could be due to issues with data ingestion, data updates, or data synchronization. You can verify this by comparing a subset of your data in both SQL Endpoint and notebook.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Caching Issues:&lt;/STRONG&gt; Sometimes, Spark might cache the DataFrame, and if the underlying data changes, the cached DataFrame might not reflect these changes2. You can try to clear the cache using the &lt;STRONG&gt;spark.catalog.clearCache()&lt;/STRONG&gt; function in your notebook.&lt;BR /&gt;&lt;BR /&gt;Hope this helps. Please let me know if you have any further questions.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Feb 2024 09:19:23 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3707454#M1838</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-02-19T09:19:23Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3711275#M1839</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/686932"&gt;@amaaiia&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Thanks&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2024 14:12:56 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3711275#M1839</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-02-20T14:12:56Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3712054#M1840</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I didn't find a solution. As I was in dev environment trying a demo I just deleted an laoded data again. I'm not having duplicates now. I hope this won't happen again.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2024 18:55:02 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3712054#M1840</guid>
      <dc:creator>amaaiia</dc:creator>
      <dc:date>2024-02-20T18:55:02Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3712317#M1841</link>
      <description>&lt;P&gt;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/686932"&gt;@amaaiia&lt;/a&gt;&amp;nbsp;Thanks for sharing!&lt;BR /&gt;&lt;BR /&gt;Did you use Dataflow Gen2 to ingest data into your Lakehouse?&lt;BR /&gt;&lt;BR /&gt;Here is a similar issue:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://community.fabric.microsoft.com/t5/General-Discussion/Duplicated-Rows-In-Tables-Built-By-Notebook/m-p/3680801#M3995" target="_blank"&gt;https://community.fabric.microsoft.com/t5/General-Discussion/Duplicated-Rows-In-Tables-Built-By-Notebook/m-p/3680801#M3995&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2024 21:26:05 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3712317#M1841</guid>
      <dc:creator>frithjof_v</dc:creator>
      <dc:date>2024-02-20T21:26:05Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3719925#M1842</link>
      <description>&lt;P&gt;I'm experiencing a similar issue:&lt;BR /&gt;My dataflow Gen2 is storing data in lakehouse. Once in a while the 'replace' table setting in the dataflow Gen2 doesn't seem to work and it results in having the same data copied twice. It only seems to be affecting smaller tables. If I delete the table it works for a few days but then suddenly there are duplicates again.&lt;/P&gt;</description>
      <pubDate>Fri, 23 Feb 2024 13:36:49 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3719925#M1842</guid>
      <dc:creator>MysticSapphire</dc:creator>
      <dc:date>2024-02-23T13:36:49Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3811281#M1843</link>
      <description>&lt;P&gt;I created an idea:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=cea60fc6-93f2-ee11-a73e-6045bd7cb2b6" target="_blank" rel="noopener nofollow noreferrer"&gt;https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=cea60fc6-93f2-ee11-a73e-6045bd7cb2b6&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please vote if you want this issue to be fixed&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 15:02:09 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3811281#M1843</guid>
      <dc:creator>frithjof_v</dc:creator>
      <dc:date>2024-04-04T15:02:09Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3811418#M1844</link>
      <description>&lt;P&gt;Hi all, there's apparently a bug where the metadata on which parquet file is the latest and greatest can get hosed up between the lakehouse SQL endpoint and the notebooks. I worked a ticket with Microsoft and they had me run the following code. After running it, wait 30 minutes or so to retry running the notebook. In my case this perfectly fixed the issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;SPAN&gt;mssparkutils.fs.unmount(&lt;/SPAN&gt;&lt;SPAN&gt;"/default"&lt;/SPAN&gt;&lt;SPAN&gt;, {&lt;/SPAN&gt;&lt;SPAN&gt;"scope"&lt;/SPAN&gt;&lt;SPAN&gt;:&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;"default_lh"&lt;/SPAN&gt;&lt;SPAN&gt;})&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.reset()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;sc._jvm.com.microsoft.spark.notebook.common.trident.TridentRuntimeContext.personalizeSession()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Hope this helps,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Scott&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 04 Apr 2024 15:55:16 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3811418#M1844</guid>
      <dc:creator>Scott_Powell</dc:creator>
      <dc:date>2024-04-04T15:55:16Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3811462#M1845</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/535913"&gt;@Scott_Powell&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Thanks for sharing the solution here. Please continue using Fabric Community for any help regarding your queries.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Apr 2024 16:15:17 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/3811462#M1845</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-04-04T16:15:17Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicated rows between notebook and SQL Endpoint</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/4597319#M7722</link>
      <description>&lt;P&gt;Does this mean that when reading lakehouse tables from a notebook that I need to run these 3 lines of code for every table I read??&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2025 21:49:05 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Duplicated-rows-between-notebook-and-SQL-Endpoint/m-p/4597319#M7722</guid>
      <dc:creator>DCELL</dc:creator>
      <dc:date>2025-03-05T21:49:05Z</dc:date>
    </item>
  </channel>
</rss>

