Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
Sethulakshmi
Frequent Visitor

Cache memory causing duplicates in Fabric notebook

Hello,

I'm encountering an issue with a merge operation in a notebook, where I'm accessing tables from a lakehouse. The merge command fails with a duplicate error. However, when I query the table using SQL Server Management Studio (SSMS) connected to the lakehouse, it shows zero duplicates. I suspected a caching problem and attempted to resolve it by disabling the cache using the following code:
spark.conf.set("spark.synapse.vegas.useCache", "false")

df.cache()

df.unpersist()

I also manually switched environments within the notebook and found no duplicates. The perplexing aspect is that the issue persists when the notebook is triggered via a pipeline, even though there are no duplicates when tested manually. What could be the potential reasons behind this discrepancy, and how can it be addressed?

2 REPLIES 2
v-yiruan-msft
Community Support
Community Support

Hi @Sethulakshmi ,

Thanks for reaching out to us with your problem. The discrepancy between the manual testing and the pipeline-triggered execution could be due to a variety of factors.

  • Environment Differences: There might be differences between the environment in which you’re manually testing the notebook and the environment in which the pipeline runs. These differences could be in terms of software versions, configurations, or data states.
  • Data Timing Issues: If your data is being updated frequently, it’s possible that duplicates are introduced between the time you manually check for duplicates and the time the pipeline runs.
  • Caching Mechanism: As you suspected, the issue might be related to caching. However, it’s important to note that the caching mechanism is orchestrated and upheld by the Microsoft Fabric itself, and it doesn’t offer users the capability to manually clear the cache. Caching in Fabric data warehousing - Microsoft Fabric | Microsoft Learn

To address this issue, you could try the following:

  • Debugging: Add logging statements in your notebook to capture the state of your data at various points in your pipeline. This could help you identify where and when the duplicates are introduced.
  • Data Snapshot: Create a snapshot of your data before running the merge operation. This could help you identify if the duplicates are present in the data at the time of the merge operation.
  • Environment Consistency: Ensure that the environment in which you’re manually testing the notebook is identical to the environment in which the pipeline runs.

 

Best Regards

Community Support Team _ Rena
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Thanks for the reply,

Regarding the environment consistency, i assume when we trigger a notebook from pipeline there is no option to choose environment, so how can i check the consistency? Please suggest if there is an option.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.