Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Even though I have set up a Lakehouse as default to my notebook, When I try to refer tables directly in Spark notebook, they are not recognized. Please find attached screen print and suggest what could be the reason.
Note: In attached screen print using select on a lakehouse table and creating a dataframe will work.
Solved! Go to Solution.
For your code above @SamKum , the join function requires specifying a dataframe to join into, not a Delta Table.
With the Default Lakehouse attached to your notebook, you can reference the Delta Table to build a dataframe either via the absolute path:
absolute_path = 'abfss://00000000-0000-0000-0000-000000000000@onelake.dfs.fabric.microsoft.com/00000000-0000-0000-0000-000000000000/Tables/DimCustomer'
#Can I connect absolutely?
absolute_df = spark.read.format("delta").load(absolute_path)
display(absolute_df)
Or the Relative Path:
relative_path = 'Tables/DimCustomer'
relative_df = spark.read.format("delta").load(relative_path)
display(relative_df)
Or via a Spark SQL query against the default lakehouse:
#Default way from drag-and-drop
df = spark.sql("SELECT * FROM Databard_Demo.DimCustomer LIMIT 1000")
display(df)
You are correct, using the relative path and Spark SQL is simpler. However, it requires attaching your notebook to a default lakehouse, which must then be managed should you deploy your notebook to different environments (i.e. Dev/Test/Prod). Without a default lakehouse, the Spark SQL and relative paths won't work, and you would have to reference the absolute path.
As far as I'm aware, I've never seen a way where Delta Tables can be referenced directly like in your example, at least not without defining the Delta Table as a variable in PySpark. The closest I've seen is loading the Delta Tables into their own variables, and then using Spark SQL to join the tables. Here's an example:
#These show absolute paths, but you can do relative with a default lakehouse
delta_table_path_1 = "abfss://<your-workspace-id>@onelake.dfs.fabric.microsoft.com/<your-lakehouse-id>/<your-delta-table-path-1>"
delta_table_path_2 = "abfss://<your-workspace-id>@onelake.dfs.fabric.microsoft.com/<your-lakehouse-id>/<your-delta-table-path-2>"
# Load the Delta Tables
delta_table_df_1 = spark.read.format("delta").load(delta_table_path_1)
delta_table_df_2 = spark.read.format("delta").load(delta_table_path_2)
# Create temporary views
delta_table_df_1.createOrReplaceTempView("delta_table_1")
delta_table_df_2.createOrReplaceTempView("delta_table_2")
# Execute a SQL query to join the tables
# Assumes Customer data for demo purposes, but could be any query
joined_df = spark.sql("""
SELECT
t1.CustomerKey,
t1.FirstName,
t1.LastName,
t2.OrderKey,
t2.OrderDate
FROM
delta_table_1 t1
JOIN
delta_table_2 t2
ON
t1.CustomerKey = t2.CustomerKey
""")
# Show the result
joined_df.show()
Hope this helps! If it does, be sure to accept an answer so others can learn from this conversation.
Hi @SamKum
I wanted to check if you had the opportunity to review the information. Please feel free to contact us if you have any further questions. If my response has addressed your query, please accept it as a solution and give a 'Kudos' so other members can easily find it.
Thank you.
Hi @SamKum ,
Thank you for reaching to the Microsoft forum community.
I wanted to check if you had the opportunity to review the information provided by @DataBard . Please feel free to contact us if you have any further questions. If the response has addressed your query, please accept it as a solution and give a 'Kudos' so other members can easily find it.
Thank you.
For your code above @SamKum , the join function requires specifying a dataframe to join into, not a Delta Table.
With the Default Lakehouse attached to your notebook, you can reference the Delta Table to build a dataframe either via the absolute path:
absolute_path = 'abfss://00000000-0000-0000-0000-000000000000@onelake.dfs.fabric.microsoft.com/00000000-0000-0000-0000-000000000000/Tables/DimCustomer'
#Can I connect absolutely?
absolute_df = spark.read.format("delta").load(absolute_path)
display(absolute_df)
Or the Relative Path:
relative_path = 'Tables/DimCustomer'
relative_df = spark.read.format("delta").load(relative_path)
display(relative_df)
Or via a Spark SQL query against the default lakehouse:
#Default way from drag-and-drop
df = spark.sql("SELECT * FROM Databard_Demo.DimCustomer LIMIT 1000")
display(df)
You are correct, using the relative path and Spark SQL is simpler. However, it requires attaching your notebook to a default lakehouse, which must then be managed should you deploy your notebook to different environments (i.e. Dev/Test/Prod). Without a default lakehouse, the Spark SQL and relative paths won't work, and you would have to reference the absolute path.
As far as I'm aware, I've never seen a way where Delta Tables can be referenced directly like in your example, at least not without defining the Delta Table as a variable in PySpark. The closest I've seen is loading the Delta Tables into their own variables, and then using Spark SQL to join the tables. Here's an example:
#These show absolute paths, but you can do relative with a default lakehouse
delta_table_path_1 = "abfss://<your-workspace-id>@onelake.dfs.fabric.microsoft.com/<your-lakehouse-id>/<your-delta-table-path-1>"
delta_table_path_2 = "abfss://<your-workspace-id>@onelake.dfs.fabric.microsoft.com/<your-lakehouse-id>/<your-delta-table-path-2>"
# Load the Delta Tables
delta_table_df_1 = spark.read.format("delta").load(delta_table_path_1)
delta_table_df_2 = spark.read.format("delta").load(delta_table_path_2)
# Create temporary views
delta_table_df_1.createOrReplaceTempView("delta_table_1")
delta_table_df_2.createOrReplaceTempView("delta_table_2")
# Execute a SQL query to join the tables
# Assumes Customer data for demo purposes, but could be any query
joined_df = spark.sql("""
SELECT
t1.CustomerKey,
t1.FirstName,
t1.LastName,
t2.OrderKey,
t2.OrderDate
FROM
delta_table_1 t1
JOIN
delta_table_2 t2
ON
t1.CustomerKey = t2.CustomerKey
""")
# Show the result
joined_df.show()
Hope this helps! If it does, be sure to accept an answer so others can learn from this conversation.
Hi @SamKum,
The code should be:
joined_df = sales_example.join(product_example, product_example.product_ID == sales_example.product_ID, "inner")
You should refer the dataframe name rather than the table name.
Hi @SamKum ,
Adding a default lakehouse makes it easier to use the UI to create dataframes, as well as giving you an option to refer to the table path to find the table in your lakehouse relatively instead of having to reference the absolute path. Either way, you typically have to refer to the path along with your table itself (or lakehouse.schema.table, in the case of line 1 in your code).
In your case, instead of referencing 'products', either use the dataframe you are creating in the above example, or use a spark function to reference the Delta table like this:
# Read Delta tables
delta_table_1 = spark.read.format("delta").load("/path/to/delta_table_1")
Hopefully this helps! If it does, be sure to accept the answer so others in the community can learn from your question.
@DataBard , If creating datframe is only way to go, in your solution path to delta table is absolue path? If yes, my attempt of creating dataframe with a select query is much simpler.
I saw an Udemy video where they demonstrated directly referring delta table name from lakehouse attached to the notebook. Do we you know if this was allowed in earlier version or this was never a feature?
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
User | Count |
---|---|
6 | |
4 | |
2 | |
2 | |
2 |