Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
Sethulakshmi
Frequent Visitor

Cache memory causing duplicates in Fabric notebook

Hello,

I'm encountering an issue with a merge operation in a notebook, where I'm accessing tables from a lakehouse. The merge command fails with a duplicate error. However, when I query the table using SQL Server Management Studio (SSMS) connected to the lakehouse, it shows zero duplicates. I suspected a caching problem and attempted to resolve it by disabling the cache using the following code:
spark.conf.set("spark.synapse.vegas.useCache", "false")

df.cache()

df.unpersist()

I also manually switched environments within the notebook and found no duplicates. The perplexing aspect is that the issue persists when the notebook is triggered via a pipeline, even though there are no duplicates when tested manually. What could be the potential reasons behind this discrepancy, and how can it be addressed?

2 REPLIES 2
v-yiruan-msft
Community Support
Community Support

Hi @Sethulakshmi ,

Thanks for reaching out to us with your problem. The discrepancy between the manual testing and the pipeline-triggered execution could be due to a variety of factors.

  • Environment Differences: There might be differences between the environment in which you’re manually testing the notebook and the environment in which the pipeline runs. These differences could be in terms of software versions, configurations, or data states.
  • Data Timing Issues: If your data is being updated frequently, it’s possible that duplicates are introduced between the time you manually check for duplicates and the time the pipeline runs.
  • Caching Mechanism: As you suspected, the issue might be related to caching. However, it’s important to note that the caching mechanism is orchestrated and upheld by the Microsoft Fabric itself, and it doesn’t offer users the capability to manually clear the cache. Caching in Fabric data warehousing - Microsoft Fabric | Microsoft Learn

To address this issue, you could try the following:

  • Debugging: Add logging statements in your notebook to capture the state of your data at various points in your pipeline. This could help you identify where and when the duplicates are introduced.
  • Data Snapshot: Create a snapshot of your data before running the merge operation. This could help you identify if the duplicates are present in the data at the time of the merge operation.
  • Environment Consistency: Ensure that the environment in which you’re manually testing the notebook is identical to the environment in which the pipeline runs.

 

Best Regards

Community Support Team _ Rena
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Thanks for the reply,

Regarding the environment consistency, i assume when we trigger a notebook from pipeline there is no option to choose environment, so how can i check the consistency? Please suggest if there is an option.

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

Power BI Carousel June 2024

Power BI Monthly Update - June 2024

Check out the June 2024 Power BI update to learn about new features.

PBI_Carousel_NL_June

Fabric Community Update - June 2024

Get the latest Fabric updates from Build 2024, key Skills Challenge voucher deadlines, top blogs, forum posts, and product ideas.