Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
nikhil0511
Advocate I
Advocate I

Duplicate rows when Full load notebook runs

Hello Forum, 

 

I have a table - named OINV ingested from SAP to Bronze layer lakehouse via a dataflow. I would like to ingest this table from Bronze to Silver lakehouse with the following code. 

 

Cell 1

%%sql
DROP TABLE IF EXISTS OINV
 
Cell 2
#Load initially full OINV from Bronze to Silver
from pyspark.sql.types import *

source_table_name = 'OINV'
target_table_name = 'OINV_new'

OINV_df = spark.read.parquet('Files/' + source_table_name)
OINV_df\
    .write\
    .mode('overwrite')\
    .format('delta')\
    .save('Tables/' + target_table_name)
 
 At the initial run, when the dataflow refresh completes, the table is created in bronze layer. I have created a shortcut in silver lakehouse before running the notebook. In the first run the number of rows in both the tables are same. When I run the code second time, the number of rows muliplies by 2 or sometimes 3. Interesting thing is when I run further the same code, the rows count is same. This happens only when I run the code the second time. 
 
Not only this is happening in same workspace but for all the workspaces in our tenant. 
 
Please help us in fixing this.
Thank you
Nikhi
1 ACCEPTED SOLUTION

Hi @nikhil0511 

 

Apologies for the inconvenience.

Please reach out to our support team to gain deeper insights and explore potential solutions. It's highly recommended that you reach out to our support team. Their expertise will be invaluable in suggesting the most appropriate approach.

Please go ahead and raise a support ticket to reach our support team:

https://support.fabric.microsoft.com/support

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

 

Thank you.

View solution in original post

6 REPLIES 6
nikhil0511
Advocate I
Advocate I

Hello V-cboorla-msft,

 

Today I have used the following code to run and see if there is a chance the number of rows doesn't double, after your advice.

spark.catalog.clearCache()
 
%%sql
DROP TABLE IF EXISTS OINV
 
#Load initially full OINV from Bronze to Silver
from pyspark.sql.types import *

source_table_name = 'OINV'
target_table_name = 'OINV_new'

OINV_df = spark.read.parquet('Files/' + source_table_name)
OINV_df\
    .write\
    .mode('overwrite')\
    .format('delta')\
    .save('Tables/' + target_table_name)
 
df = spark.sql("SELECT * FROM DE_LH_200_SILVER_MarketingDocuments.OINV")
df.count()
I have deleted the Tables in Bronze lakehouse that was created by the dataflow. Re-run the dataflow and executed the above Notebook cells. Initially I have 657966 rows. After refreshing the dataflow second time and executing the notebook, the result is 1315948.
 
Seems the caching step and table deletion step is not useful. 
Please note I am connecting the tables from the Brone layer as shortcut in Files/OINV.

 

Hi @nikhil0511 

 

Apologies for the inconvenience.

Please reach out to our support team to gain deeper insights and explore potential solutions. It's highly recommended that you reach out to our support team. Their expertise will be invaluable in suggesting the most appropriate approach.

Please go ahead and raise a support ticket to reach our support team:

https://support.fabric.microsoft.com/support

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

 

Thank you.

I have created the ticket withsupport as suggested. 

2406170050002527

Hi @nikhil0511 

 

We haven’t heard from you on the last response and was just checking back to see if you've had a chance to submit a support ticket. If you have, a reference to the ticket number would be greatly appreciated. This will allow us to track the progress of your request and ensure you receive the most efficient support possible.

 

Thank you.

nikhil0511
Advocate I
Advocate I

Hello V-cboorla-msft, 

Thank you for the response. 

We are indeed using the clear cache code for one of our tables from Bronze to silver. 

# Step 1: Unpersist the DataFrame if it is cached

if 'epages_web_stores_df' in locals() or 'epages_web_stores_df' in globals():
    epages_web_stores_df.unpersist()
# Step 2: Set the DataFrame variable to None
epages_web_stores_df = None
Even after using this and use the same code that overwrites the table data, we are still seeing the entire table with duplicate rows. The number of rows simply double. Even in the first run.
v-cboorla-msft
Community Support
Community Support

Hi @nikhil0511 

 

Thanks for using Microsoft Fabric Community.

It seems like you are experiencing an issue where the number of rows in your table multiplies after the second run of your code. Here are few possible reasons that might help you.

Data Duplication:

  • If your data source (in this case, the ‘OINV’ table in the Bronze layer) is being updated between your runs, and the new data includes rows that were already present in the previous data, this could lead to duplication when you overwrite the Silver layer table.
  • Ensure that the ‘OINV’ table in the Bronze layer isn’t being updated with duplicate rows between your runs.

Caching Issues:

  • Spark, the processing engine you are using, caches data for performance optimization. Sometimes, this can lead to unexpected results if the cache isn’t invalidated properly.
  • Before running your code, you can clear the Spark cache to ensure that you are working with the most recent data. You can do this by calling spark.catalog.clearCache()

If the issue still persists, please do let us know. Glad to help.

 

I hope this information helps.

 

Thank you.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Prices go up Feb. 11th.

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

JanFabricDW_carousel

Fabric Monthly Update - January 2025

Unlock the latest Fabric Data Warehouse upgrades!