Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now

Reply
ToddChitt
Super User
Super User

Save PySpark datafram to Fabric Warehouse via T-SQL type MERGE operation

I am building out a solution that will need to process lots of JSON files. I initially set it up to save the JSON file to a known location and known file name in the Lakehouse, then use a Dataflow Gen2 to extract it into several tables in the Lakehouse, and finally some SQL stored procedures to MERGE that data into the Warehouse tables. But since I cannot parameterize the name and location of the source file for the Dataflow, this is a choke point, forcing me to process only one file at a time. Unacceptable. 

 

Next solution was to use a PySpark Notebook, where by I CAN parameterize the JSON file location and name. But again, I am writing to the same Lakehouse staging tables (using an OVERWRITE operation) and Warehouse stored procedures. So now the Lakehouse tables become the choke point as I want to avoid the possiblility of a second JSON file getting processed while the first file is not finished with the MERGE into the Warehouse.

 

So now I am looking for a setup of a parameterized Notebook, with associated Data Frames, to write the data directly to the Warehouse tables. I suspect that some %%SQL code in the Notebook would be required. I know that the T-SQL MERGE statement is not supported in Fabric Warehouse (yet) but I am good with SQL stored procedures and know how to write two independent UPDATE and INSERT statements. 

When I search online for solution on writing to a WHAREHOUSE, I get samples of writing to a LAKEHOUSE. Not the same. 

 

Does anyone have any suggestions, samples, or other advice? I'm fairly new to PySpark so learning a lot quickly.

Thanks in advance.




Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!





8 REPLIES 8
Anonymous
Not applicable

Hi @ToddChitt ,

Thanks for using Fabric Community.
Unfortunately, you cannot directly send data to a Warehouse table in Fabric notebooks

But you can write the data to lake house and use it with three part name in warehouse 

vgchennamsft_0-1712816467690.png


Here is the Session where pawel showed this in our community session , skip to 1:02:05
https://www.youtube.com/watch?v=u7peyrNhYgw

Helpful Docs to refer -
Load data to MS Fabric Warehouse from notebook - Stack Overflow

vgchennamsft_1-1712817028137.png

Updating Fabric warehouse tables through a notebook - Stack Overflow

vgchennamsft_3-1712817091366.png


I hope it might give some ideas to your use case. 

Anonymous
Not applicable

Hi @ToddChitt ,

We have a good news for you -

vgchennamsft_0-1712915841322.png


After the release -

vgchennamsft_1-1712915991404.png

 

Click Here 

Docs to refer -
What's new and planned for Synapse Data Engineering in Microsoft Fabric - Microsoft Fabric | Microso...

Hope this is helpful. Please let me know incase of further queries.

Anonymous
Not applicable

Hello @ToddChitt ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Anonymous
Not applicable

Hi @ToddChitt ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

What are you saying?  Quote: 'But you can write the data to lake house and use it with three part name in warehouse ...'
That it is possible to write to a LH table from the LH SQL endpoint?   I thought that CRUD ops on a LH were only possible from a Notebook using either %%SQL or PySpark. Yes? No?

ToddChitt
Super User
Super User

Hello @Element115 and thanks for the quick response.

Not sure I understand what you are suggesting. I wish to AVOID the Lakehouse staging tables as they get truncated and reloaded for each JSON file to be processed. That means that the files must be processed sequentially. I want to be able to run them in parallel, like dozens of files at the same time (and each file will have its own Client Id additional data element so I can't simply just take all the files in a folder and treat them as one.

 

I was hoping for a way to connect directly to the Warehouse, but I don't see that option for a Notebook.




Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!





If the problem is to extract data from JSON files, and you are proficient in SQL and sp, then why not do it inside your DB (SQL Server has all the functionality for that), and create the tables or views from that in your DB?  Then use a pipeline Copy data activity to get the data from these tables or views into a Fabric WH.

 

If it's a matter of figuring out where is best to park the JSON files (not in a LH), then I guess on some filesystem or even a DB. I did persist a bunch of JSON from a web REST API once--SQL Server has all the functionality needed for that. 

 

Let's say you persist your JSON in SQL Server. Then it's a piece of cake to just use a pipeline to access the DB, retrieve the JSON with a SQL query (or in the above scenario, retrieve the data from tables or views), and continue processing with the pipeline. The pipeline let's you process stuff concurrently.  Don't know if you can read from a FS from a PL though.

 

Or perhaps you could use Databricks, also available within a pipeline. Here are the activities available in a pipeline:

 

Screenshot 2024-04-10 163442.jpgScreenshot 2024-04-10 163500.jpg

 

Note sure why you would necessarily need a Notebook.  When you say MERGE, do you mean like APPEND?  Because if so, you can use Copy data from an on-prem SQL Server in append mode.  But you need to implement Incremental refresh manually (just finished doing that in fact for a test project).

 

Element115
Power Participant
Power Participant

Here's  a suggestion. Since you are using sp, I'd create all the necessary sp for the warehouse, then orchestrate everything from within one or more pipelines with the Stored procedure activity and any other activity required.  Pipelines are parametrizable.  Also, with a Script activity you can query the SQL endpoint of your LH, get all the metadata you need, and then use that with your sp and push it out to the WH.

Helpful resources

Announcements
November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Live Sessions with Fabric DB

Be one of the first to start using Fabric Databases

Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.

November Update

Fabric Monthly Update - November 2024

Check out the November 2024 Fabric update to learn about new features.

Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.