Cache memory causing duplicates in Fabric noteboo...

Sethulakshmi · ‎01-10-2024

Hello,

I'm encountering an issue with a merge operation in a notebook, where I'm accessing tables from a lakehouse. The merge command fails with a duplicate error. However, when I query the table using SQL Server Management Studio (SSMS) connected to the lakehouse, it shows zero duplicates. I suspected a caching problem and attempted to resolve it by disabling the cache using the following code:
spark.conf.set("spark.synapse.vegas.useCache", "false")

df.cache()

df.unpersist()

I also manually switched environments within the notebook and found no duplicates. The perplexing aspect is that the issue persists when the notebook is triggered via a pipeline, even though there are no duplicates when tested manually. What could be the potential reasons behind this discrepancy, and how can it be addressed?

Anonymous · ‎01-11-2024

Hi @Sethulakshmi ,

Thanks for reaching out to us with your problem. The discrepancy between the manual testing and the pipeline-triggered execution could be due to a variety of factors.

Environment Differences: There might be differences between the environment in which you’re manually testing the notebook and the environment in which the pipeline runs. These differences could be in terms of software versions, configurations, or data states.
Data Timing Issues: If your data is being updated frequently, it’s possible that duplicates are introduced between the time you manually check for duplicates and the time the pipeline runs.
Caching Mechanism: As you suspected, the issue might be related to caching. However, it’s important to note that the caching mechanism is orchestrated and upheld by the Microsoft Fabric itself, and it doesn’t offer users the capability to manually clear the cache. Caching in Fabric data warehousing - Microsoft Fabric | Microsoft Learn

To address this issue, you could try the following:

Debugging: Add logging statements in your notebook to capture the state of your data at various points in your pipeline. This could help you identify where and when the duplicates are introduced.
Data Snapshot: Create a snapshot of your data before running the merge operation. This could help you identify if the duplicates are present in the data at the time of the merge operation.
Environment Consistency: Ensure that the environment in which you’re manually testing the notebook is identical to the environment in which the pipeline runs.

Best Regards

Sethulakshmi · ‎01-11-2024

Thanks for the reply,

Regarding the environment consistency, i assume when we trigger a notebook from pipeline there is no option to choose environment, so how can i check the consistency? Please suggest if there is an option.

Cache memory causing duplicates in Fabric notebook

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Cache memory causing duplicates in Fabric notebook

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025