Solved: Duplicate rows when Full load notebook runs

nikhil0511 · ‎06-13-2024

Hello Forum,

I have a table - named OINV ingested from SAP to Bronze layer lakehouse via a dataflow. I would like to ingest this table from Bronze to Silver lakehouse with the following code.

Cell 1

%%sql

DROP TABLE IF EXISTS OINV

Cell 2

#Load initially full OINV from Bronze to Silver

from pyspark.sql.types import *

source_table_name = 'OINV'

target_table_name = 'OINV_new'

OINV_df = spark.read.parquet('Files/' + source_table_name)

OINV_df\

.write\

.mode('overwrite')\

.format('delta')\

.save('Tables/' + target_table_name)

At the initial run, when the dataflow refresh completes, the table is created in bronze layer. I have created a shortcut in silver lakehouse before running the notebook. In the first run the number of rows in both the tables are same. When I run the code second time, the number of rows muliplies by 2 or sometimes 3. Interesting thing is when I run further the same code, the rows count is same. This happens only when I run the code the second time.

Not only this is happening in same workspace but for all the workspaces in our tenant.

Please help us in fixing this.

Thank you

Nikhi

v-cboorla-msft · ‎06-21-2024

Hi @nikhil0511

Apologies for the inconvenience.

Please reach out to our support team to gain deeper insights and explore potential solutions. It's highly recommended that you reach out to our support team. Their expertise will be invaluable in suggesting the most appropriate approach.

Please go ahead and raise a support ticket to reach our support team:

https://support.fabric.microsoft.com/support

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

Thank you.

View solution in original post

nikhil0511 · ‎06-17-2024

Hello V-cboorla-msft,

Today I have used the following code to run and see if there is a chance the number of rows doesn't double, after your advice.

spark.catalog.clearCache()

%%sql

DROP TABLE IF EXISTS OINV

#Load initially full OINV from Bronze to Silver

from pyspark.sql.types import *

source_table_name = 'OINV'

target_table_name = 'OINV_new'

OINV_df = spark.read.parquet('Files/' + source_table_name)

OINV_df\

.write\

.mode('overwrite')\

.format('delta')\

.save('Tables/' + target_table_name)

df = spark.sql("SELECT * FROM DE_LH_200_SILVER_MarketingDocuments.OINV")

df.count()

I have deleted the Tables in Bronze lakehouse that was created by the dataflow. Re-run the dataflow and executed the above Notebook cells. Initially I have 657966 rows. After refreshing the dataflow second time and executing the notebook, the result is 1315948.

Seems the caching step and table deletion step is not useful.

Please note I am connecting the tables from the Brone layer as shortcut in Files/OINV.

v-cboorla-msft · ‎06-21-2024

Hi @nikhil0511

Apologies for the inconvenience.

Please reach out to our support team to gain deeper insights and explore potential solutions. It's highly recommended that you reach out to our support team. Their expertise will be invaluable in suggesting the most appropriate approach.

Please go ahead and raise a support ticket to reach our support team:

https://support.fabric.microsoft.com/support

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

Thank you.

nikhil0511 · ‎07-05-2024

I have created the ticket withsupport as suggested.

2406170050002527

v-cboorla-msft · ‎06-23-2024

Hi @nikhil0511

We haven’t heard from you on the last response and was just checking back to see if you've had a chance to submit a support ticket. If you have, a reference to the ticket number would be greatly appreciated. This will allow us to track the progress of your request and ensure you receive the most efficient support possible.

Thank you.

nikhil0511 · ‎06-14-2024

Hello V-cboorla-msft,

Thank you for the response.

We are indeed using the clear cache code for one of our tables from Bronze to silver.

# Step 1: Unpersist the DataFrame if it is cached

if 'epages_web_stores_df' in locals() or 'epages_web_stores_df' in globals():

epages_web_stores_df.unpersist()

# Step 2: Set the DataFrame variable to None

epages_web_stores_df = None

Even after using this and use the same code that overwrites the table data, we are still seeing the entire table with duplicate rows. The number of rows simply double. Even in the first run.

v-cboorla-msft · ‎06-14-2024

Hi @nikhil0511

Thanks for using Microsoft Fabric Community.

It seems like you are experiencing an issue where the number of rows in your table multiplies after the second run of your code. Here are few possible reasons that might help you.

Data Duplication:

If your data source (in this case, the ‘OINV’ table in the Bronze layer) is being updated between your runs, and the new data includes rows that were already present in the previous data, this could lead to duplication when you overwrite the Silver layer table.
Ensure that the ‘OINV’ table in the Bronze layer isn’t being updated with duplicate rows between your runs.

Caching Issues:

Spark, the processing engine you are using, caches data for performance optimization. Sometimes, this can lead to unexpected results if the cache isn’t invalidated properly.
Before running your code, you can clear the Spark cache to ensure that you are working with the most recent data. You can do this by calling spark.catalog.clearCache()

If the issue still persists, please do let us know. Glad to help.

I hope this information helps.

Thank you.

Duplicate rows when Full load notebook runs

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

New Offer! Become a Certified Fabric Data Engineer

Duplicate rows when Full load notebook runs

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025