Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
lherbert501
Post Patron
Post Patron

ETL Data flow multiple servers

Hi,

 

I currently have a stored procedure as my etl process and it inserts data into two tables in SSMS.

 

It currently uses synonyms in SSMS using multiple servers,tables and joins.

 

My question is if I was to try to translate this over to a data flow in Fabric, would you look at it the same way using synomyms and joins, or connect directly to each server and source seperately?

 

I'm a bit confused on this as I need to make the transformations using multiple servers on multiple tables whilst joined together.

 

Thanks

 

Liam

 

 

1 ACCEPTED SOLUTION

To be honest, Fabric is new to me as well. And I have more of a Power BI background, so I know Power Query quite well, but I don't have so much real experience about SQL database, stored procedure, data warehouse admin, etc.

 

I have tested a lot of features in Fabric, but I haven't used them in production (I have only used Power BI in production).

 

If I understand correctly, you want to build an analytical data store in Fabric (either as a Fabric Lakehouse or Fabric Warehouse). 

 

I guess you have some operational systems which are your ultimate data sources. I think a common pattern is

 

Data source -> Pipeline (with or without a custom SQL query) -> Staging files or tables (Bronze Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Silver Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Gold Lakehouse)

 

or

 

Data source -> Pipeline (with or without a custom SQL query) -> Staging tables (Warehouse) -> Stored procedure (ETL, upsert, etc.) -> Gold Tables (Warehouse)

 

I think also Lakehouse (Bronze) -> Lakehouse (Silver) -> Warehouse (Gold) is a normal pattern. Here you will probably find a lot of information if you Google "Medallion architecture in Fabric".

But I think this is up to your organization to decide if you want to go all in on Lakehouse, Warehouse or a mix.

 

Fabric Lakehouse uses Spark language (you can choose between and mix PySpark, SparkSQL, Scala or SparkR).

 

Fabric Warehouse uses T-SQL language (but there are some limitations, because the underlying storage format is Delta Lake so not all T-SQL commands are available). I suppose you can Google to search for which workarounds people are using to adapt to the language limitations.

 

A general advice is to use the tools which work well with your current skillset (also bearing in mind which direction you want to go).

Using the Fabric Trial, you can test Fabric and its tools cost-free for a period.

 

So I don't think using Dataflows Gen2 is a must. Actually, my impression is that it is quite heavy on resource-consumption. My impression is that it is a tool for low-code, UI-based ETL (which I like, coming from a Power BI background).

 

Hopefully someone with more practical experience than me can guide you further.

 

I think this is a great blog (serverlesssql.com) by @AndyDDC.

He can probably correct any mistakes I made in the text above here 😁  

 

I also like this blog: Fabric: Lakehouse or Data Warehouse? - Sam Debruyn

Although I don't think it's necessary to use both Lakehouse and Data Warehouse. It's just an option.  

View solution in original post

10 REPLIES 10
frithjof_v
Super User
Super User

I'm not entirely sure what synonyms in SSMS / SQL are.

 

Are the server and table names dynamic? If yes, could you explain a bit more about how they are dynamic?

Or are they static (do not change)?

 

Joins can be done in Power Query M, and they can (in some circumstances) fold back to the source. Easiest way to check it, is by connecting to each source table in Power Query M, and then join the tables in Power Query M, and then check if query folding is taking place (either by looking for the query folding indicator in Power Query, or check the query logs in your source system).

 

Are you planning to use purely Power Query M in the Dataflow Gen2, or are you planning to write a native SQL query as the first step in Power Query?

 

Query folding on native queries - Power Query | Microsoft Learn

Hi @frithjof_v 

 

Thankyou so much for getting back to me.

 

So a Synonym is just essentially an alias for a database object. They can change a lot with what I deal with e.g. pointing it to test and dev environments with the same fields.

 

If e.g. you were joining to the destination table that you are inserting into (snapshot table) to check if the id exists or not, could this be done in power query?

 

I'm open to using both, but as we have quite an old fashioned current method of bulk stored procedures inserting for our ETL, I didn't know if it was better to change the whole thing to Dataflow Gen2, or still use the whole existing code in SSMS just into Power Query. 

 

I thought the latter defeated the object of migrating our data warehouse but I'm not sure.

 

Thanks

 

Liam

 

In Power Query M, you could join your new data with the current data in your destination table, in order to check if there are matches. (You can do inner joins, anti joins, etc.)

 

You could also find the max ID from your destination table, and filter your new data to only process rows which have a higher ID, as an example. 

 

But it made me think... 

Are your current stored procedures doing an upsert of the data in your destination table?

 

Dataflows Gen2 only support full overwrite ("flush and fill") or append method. So you either overwrite everything, or you can append new rows. But not upsert.

 

The upsert operation would need to be done in another tool, like Notebook (Fabric Lakehouse) or Stored procedure (Fabric Warehouse).

 

What is your data storage?

Is it in the cloud or on-prem?

Are you planning to use a Fabric Warehouse or a Fabric Lakehouse?

 

Dataflows Gen2 supports the following destinations:

frithjof_v_0-1722263835957.png

 

 

Do you have a sketch or a diagram which outlines what you need to achieve? (from source via ETL to destination)

Thanks @frithjof_v 

 

That's good to know re the existing data.

 

Yes there is a few that do insert & updates at the same time. We currently have an on prem Sql Server, with data coming out of SSMS.

 

I guess I wasn't sure yet on Warehouse or a Fabric Lakehouse. Do you recomend a way to go with this in the scenario of e.g. 10 Stored procedures that upsert via sql agent twice a day, translating to Fabric?

 

I'm open to changing a lot of the process to fit in with Fabric but I'm not sure what the best way to go is. Would it be Gen2 > Pipeline > Lakehouse/Warehouse? Is Stored procedure in Warehouse just the same as I'm doing now?

 

Apologies if any of this is a silly question - Fabric is very new to us. 

 

Thanks again for your help

 

Liam

 

 

 

 

 

 

To be honest, Fabric is new to me as well. And I have more of a Power BI background, so I know Power Query quite well, but I don't have so much real experience about SQL database, stored procedure, data warehouse admin, etc.

 

I have tested a lot of features in Fabric, but I haven't used them in production (I have only used Power BI in production).

 

If I understand correctly, you want to build an analytical data store in Fabric (either as a Fabric Lakehouse or Fabric Warehouse). 

 

I guess you have some operational systems which are your ultimate data sources. I think a common pattern is

 

Data source -> Pipeline (with or without a custom SQL query) -> Staging files or tables (Bronze Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Silver Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Gold Lakehouse)

 

or

 

Data source -> Pipeline (with or without a custom SQL query) -> Staging tables (Warehouse) -> Stored procedure (ETL, upsert, etc.) -> Gold Tables (Warehouse)

 

I think also Lakehouse (Bronze) -> Lakehouse (Silver) -> Warehouse (Gold) is a normal pattern. Here you will probably find a lot of information if you Google "Medallion architecture in Fabric".

But I think this is up to your organization to decide if you want to go all in on Lakehouse, Warehouse or a mix.

 

Fabric Lakehouse uses Spark language (you can choose between and mix PySpark, SparkSQL, Scala or SparkR).

 

Fabric Warehouse uses T-SQL language (but there are some limitations, because the underlying storage format is Delta Lake so not all T-SQL commands are available). I suppose you can Google to search for which workarounds people are using to adapt to the language limitations.

 

A general advice is to use the tools which work well with your current skillset (also bearing in mind which direction you want to go).

Using the Fabric Trial, you can test Fabric and its tools cost-free for a period.

 

So I don't think using Dataflows Gen2 is a must. Actually, my impression is that it is quite heavy on resource-consumption. My impression is that it is a tool for low-code, UI-based ETL (which I like, coming from a Power BI background).

 

Hopefully someone with more practical experience than me can guide you further.

 

I think this is a great blog (serverlesssql.com) by @AndyDDC.

He can probably correct any mistakes I made in the text above here 😁  

 

I also like this blog: Fabric: Lakehouse or Data Warehouse? - Sam Debruyn

Although I don't think it's necessary to use both Lakehouse and Data Warehouse. It's just an option.  

Thanks again @frithjof_v  for your help.

 

That's really useful 🙂

Glad it was useful!

 

Here is an interesting feature, which aims to bring data pipeline-like performance to Dataflows Gen2: Fast copy in Dataflows Gen2 - Microsoft Fabric | Microsoft Learn

I haven't tested it.

 

In addition, I guess there are other and newer ways of getting data from a data source, like streaming, events, CDC, webhooks, etc.

 

As far as I know, you could use Eventhouse or Notebooks in Fabric for these kind of approaches.

 

However I am on thin ice now... Just mentioning words I have heard about 😁

Thanks @frithjof_v 

 

The part I sometimes struggle to translate I think is the transformation part.

 

Also e.g if certain tables require them to be snapshots and bring in a data from a dynamic date period on every run. 

 

Plenty of questions but it's all new 🙂

Could you expand a bit on this part:

 

"Also e.g if certain tables require them to be snapshots and bring in a data from a dynamic date period on every run"

 

To bring in data from a dynamic date period on every run, I think you can use variables (or parameters) in Data pipeline, and use a SQL query in Data pipeline which incorporates these variables so the date range will be dynamic.

(You could probably also use Notebook to achieve the same. In some cases also Dataflows Gen2.)

 

For creating snapshot tables:

Do you wish to add a snapshot timestamp column and append the snapshots inside a single table, or do you wish to create separate tables for each snapshot?

 

 

If you want, please provide some more details about the tables you want to create. E.g. are they dimension tables or fact tables?

 

You can also stage your data as files (Lakehouse) or tables (Lakehouse or Warehouse) if required.

@frithjof_v 

Apologies, I missed your last reply on this!

 

To bring in data from a dynamic date period on every run, I think you can use variables (or parameters) in Data pipeline, and use a SQL query in Data pipeline which incorporates these variables so the date range will be dynamic. - This is really good to know

 

So the snapshot would contain a timestamp in a table, with a dynamic date running on every days run. E.g today will run previous x days. They would be predominantly fact tables. 

 

Thanks for all your help with this

 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Jan NL Carousel

Fabric Community Update - January 2025

Find out what's new and trending in the Fabric community.

Top Solution Authors