Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI Data Visualization World Championships is back! It's time to submit your entry. Live now!

Reply
holgergubbels
Frequent Visitor

PySpark Update Warehouse

Hi community,

I really have problems implementing a medallion architecture in fabric - I always struggle at some point with (I thought) easiest things... It is really frustrating - and Copilot/chatgpt/Claude are answering wrong things...

 

I have a landing zone in a lakehouse. I did this using Pipeline/Copy activity. Now I want to load everything in bronze - using PySpark. I decided to use a separate Warehouse to store all my metadata (Log/RunConfig/Wartermarks etc.). Especially because Pipelines have easy access to SQL-Tables in a Warehouse.

 

Now I am inside a PySpark notebook. E.g.: When it comes to Logging I want to insert a log entry and update the same one after finishing the job. Watermarks as well have to be updated.

First I thought I use StoredProcedures. But I could not find a way to use StoredProcedures. Same problem when it comes to "Update" Statements. I do not want to use the verbose jdbc-connector, because I need to insert driver,user,passowrd etc.

 

Is there any way to update data in a Warehouse or call StoredProcs, where my Metadata is stored? Or ist probably a bad idea to do something like that? Do I miss something? It is really hard to get an architecture up and running when there is so less documentation...

Thanks

Holger

 

1 ACCEPTED SOLUTION

Hi @chetanhiwale ,

thanks for your answer! I decided to use another approach:

- Load to landing: pipeline using copy activity, logs/watermarks via SP in Warehouse

- Raw to Bronze: Piepline for logging, watermark, call notebook for the load.

- same with bronze to silver

 

It is not the best approach I think, because the spark session is always startet and temrinated for every job. But it is the best I could find... 

 

Holger

View solution in original post

6 REPLIES 6
chetanhiwale
Resolver I
Resolver I

Hi @holgergubbels , 

We had the same issue during our implementation. These are the things which we implemented.
1. Checks the logs types as data and fabric items logs/audits. 
Data logs/audits were written using JDBC.
Fabric Item logs/audits were done using stored proc in a pipeline. 

2. For data logs/audits we used JDBC connection. This was easy to use and was able to run our stored proc. We were fetching our secrets from Keyvault. 

 

Honestly, I dont recommend adding logs/audits using Synapse sql connector for writng any data. Although its best for reading the data. 

For your case the best option will be 
1. Create your stored proc in Warehouse

2. Add secrets of your warehouse in Azure Key vault

3. Fetch those secrets in runtime in notebook

4. Use those credentials to hit the stored proc using Pyodbc module.

 

Hope this helps. Lets me know if you have any doubts

Hi @chetanhiwale ,

thanks for your answer! I decided to use another approach:

- Load to landing: pipeline using copy activity, logs/watermarks via SP in Warehouse

- Raw to Bronze: Piepline for logging, watermark, call notebook for the load.

- same with bronze to silver

 

It is not the best approach I think, because the spark session is always startet and temrinated for every job. But it is the best I could find... 

 

Holger

v-pgoloju
Community Support
Community Support

Hi @holgergubbels,

 

Thank you for reaching out to the Microsoft Fabric Forum Community, and special thanks to @apturlov , @tayloramy  @deborshi_nag  for prompt and helpful responses.

Just following up to see if the Response provided by community members were helpful in addressing the issue. if the issue still persists Feel free to reach out if you need any further clarification or assistance.

 

Best regards,
Prasanna Kumar

 

apturlov
Responsive Resident
Responsive Resident

Hi @holgergubbels, as @deborshi_nag and @tayloramy mentioned above the confusion in this situation comes from misunderstanding the Fabric resources and how to use them. Here is a decision guide that may help you make the right choice between Warehouse and Lakehouse Microsoft Fabric Decision Guide: Choose between Warehouse and Lakehouse - Microsoft Fabric | Microso....

My preference, when implementing a medallion architecture in Fabric, is to use a Lakehouse for both data landing, raw (bronze), and curated (silver) medallion tiers. I would use a Warehouse for consumption (gold) tiers only because dimensional modelling is better implemented in a Warehouse and consumption is usually better aligned with using SQL. From this point of view, Fabric tools map naturally: Fabric pipelines can orchestrate all batch operations, including but not limited to data ingestion, landing, validations and transformations. Complex data transformations are implemented with Notebooks, and data loading into a Warehouse using script activity or a stored procedure activity. The metadata could be stored in a Lakehouse (files or detla tables) or in a SQL database. I would advise against using a Warehouse as a metadata storage.

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

tayloramy
Super User
Super User

Hi @holgergubbels

 

The way that I would typically approach this is to just work with the lakehouses natively. I store my metadata in a Lakehouse as it makes it nice and easy to interact with in notebooks. 

 

Then when updating records, I will filter on records where completed (or another similar flag) is null and update the most recent record (which in my case corresponds with the current run) for whatever table/process is currently running. 

 

 


If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.
deborshi_nag
Impactful Individual
Impactful Individual

Hello @holgergubbels 

 

In Microsoft Fabric, Spark and SQL are two different engines. While they share the same storage (OneLake), Spark cannot directly execute a UPDATE statement or a Stored Procedure inside a Fabric Data Warehouse using standard Spark commands. 

 

The recommended approach -

 

1. To Write/Append Metadata

You can use SynapseSQL connector to write/append metadata. 

df.write.mode("overwrite").synapsesql("<warehouse>.<schema>.<table>")
 
2. To Update or call Stored Procedures
To perform a specific UPDATE or run a Stored Procedure for watermarking, you have two main choices
 
a) The Pipeline Activity
Use Fabric Pipeline as the orchestrator
i) Run your Spark Notebook.
ii) Add a Stored Procedure Activity or a Script Activity immediately after Notebook in the pipeline. 
iii) Pass parameters from the Notebook back to the Pipeline using mssparkutils.notebook.exit()
 
b) Python pyodbc
If you absolutely need to trigger a stored proc during the Spark job, you can use a Python library like pyodbc. Since Fabric Warehouses support Azure AD authentication, you can get a token via mssparkutils to avoid hardcoding credentials. 
 
Hope this helps - please appreciate by leaving a Kudos or accepting as a Solution!
 
I trust this will be helpful. If you found this guidance useful, you are welcome to acknowledge with a Kudos or by marking it as a Solution.

Helpful resources

Announcements
December Fabric Update Carousel

Fabric Monthly Update - December 2025

Check out the December 2025 Fabric Holiday Recap!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.