Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
from pyspark.sql.functions import col, date_format
from pyspark.sql.types import IntegerType, StringType
print("Debug: ORIGINAL_VON_SCD")
#debug:
debug_info = final_dfs_versioned["issues"].filter(col('vorgangs_ID') == 1000)
debug_info.show()
# Creating a deep copy of each DataFrame in the dictionary
dfs_to_load_into_db = {}
for table_name, df in final_dfs_versioned.items():
# Deep copy by selecting all columns and caching
copied_df = df.select("*").cache()
copied_df.count() # Trigger an action to cache the DataFrame
dfs_to_load_into_db[table_name] = copied_df
print("Debug: KOPIE!")
#debug:
debug_info = dfs_to_load_into_db["issues"].filter(col('vorgangs_ID') == 1000)
debug_info.show()
#Debug: Alle Tables leeren vor Import!
confirmation = 1
if confirmation == 1:
print("Alles wird gelöscht")
for table_name, table_config in table_mappings.items():
# Leere die Tabelle
spark.sql(f"DELETE FROM SAMPLE_LH.{table_name}")
#DATAFRAMES HAVE CHANGED FROM HERE ON! (the debug output)
print("Debug: ORIGINAL_VON_SCD")
#debug:
debug_info = final_dfs_versioned["issues"].filter(col('vorgangs_ID') == 1000)
debug_info.show()
print("Debug: KOPIE!")
#debug:
debug_info = dfs_to_load_into_db["issues"].filter(col('vorgangs_ID') == 1000)
debug_info.show()
Hi there,
I have a question about a behavior in Notebooks in Fabric.
Context: What I am trying to do is load new data and use SCD2 to compare it with the old data in the lakehouse to get a new df that I want to load into the lakehouse, replacing all the old data there.
While deleting the old data I stumpled upon a strange behaviour:
What I do not understand is that deleting data in the lakehouse somehow affects my pyspark frameworks: it deletes data or inserts data in a strange way (See the Comment "DATAFRAMES HAVE CHANGED FROM HERE ON!").
Even if I copy one of them into a new one and test it with this one, it shows the same behavior. This is strange to me because I did not expect this behavior.
I have to say that the table names inside the lakehouse are the same as in my pyspark dataframes (which are saved in a dictionary), but still the behavior is very strange to me.
Do you guys have any idea what is causing this and how I could fix the behavior so that I can manipulate the data in the lakehouse independently. Probably I need to decouple it somehow? (Is it because of the same table names?)
Thank you in advance and have a nice day.
Morris
Solved! Go to Solution.
Hi @Morris98 ,
Thanks for using Fabric Community.
As I understand you're facing a strange issue with your DataFrames after deleting data from the lakehouse. Let's break down what's happening and how to fix it.
The Issue:
When you create copies of your DataFrames using select("*").cache(), they seem to be referencing the same data in the Delta Lake tables. This is because DataFrames are like snapshots of data – they don't actually hold the data themselves.
So, when you delete data from the lakehouse tables using spark.sql(), both the original and copied DataFrames are affected because they point to the same underlying data source.
The Fix:
There are a couple of ways to achieve what you want:
# Delete data
spark.sql(f"DELETE FROM SAMPLE_LH.{table_name}")
# Read data again for "issues" table
debug_info = spark.read.format("delta").load(f"SAMPLE_LH.{table_name}")
By implementing one of these solutions, you can ensure your DataFrames are not influenced by deletions in the lakehouse and manipulate data independently.
Hope this might help. Do let me know incase of further queries.
Thank you so much for your detailed answer v-gchenna-msft
Your explanation of the Undelying issue for a newbie like me was very helpful.
I will definitely implement one of your solutions as soon as I get back to my laptop. The first option in particular seems to fit well with the rest of my code and I will give it a try.
Thank you again and have a nice day!
Best wishes,
Morris
Hi @Morris98 ,
Glad to know that you got some insights. Please continue using Fabric Community on your further queries.
Hi @Morris98 ,
Thanks for using Fabric Community.
As I understand you're facing a strange issue with your DataFrames after deleting data from the lakehouse. Let's break down what's happening and how to fix it.
The Issue:
When you create copies of your DataFrames using select("*").cache(), they seem to be referencing the same data in the Delta Lake tables. This is because DataFrames are like snapshots of data – they don't actually hold the data themselves.
So, when you delete data from the lakehouse tables using spark.sql(), both the original and copied DataFrames are affected because they point to the same underlying data source.
The Fix:
There are a couple of ways to achieve what you want:
# Delete data
spark.sql(f"DELETE FROM SAMPLE_LH.{table_name}")
# Read data again for "issues" table
debug_info = spark.read.format("delta").load(f"SAMPLE_LH.{table_name}")
By implementing one of these solutions, you can ensure your DataFrames are not influenced by deletions in the lakehouse and manipulate data independently.
Hope this might help. Do let me know incase of further queries.
Check out the September 2024 Fabric update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
5 | |
3 | |
2 | |
1 | |
1 |
User | Count |
---|---|
9 | |
7 | |
3 | |
3 | |
2 |