The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Hi,
I currently have a stored procedure as my etl process and it inserts data into two tables in SSMS.
It currently uses synonyms in SSMS using multiple servers,tables and joins.
My question is if I was to try to translate this over to a data flow in Fabric, would you look at it the same way using synomyms and joins, or connect directly to each server and source seperately?
I'm a bit confused on this as I need to make the transformations using multiple servers on multiple tables whilst joined together.
Thanks
Liam
Solved! Go to Solution.
To be honest, Fabric is new to me as well. And I have more of a Power BI background, so I know Power Query quite well, but I don't have so much real experience about SQL database, stored procedure, data warehouse admin, etc.
I have tested a lot of features in Fabric, but I haven't used them in production (I have only used Power BI in production).
If I understand correctly, you want to build an analytical data store in Fabric (either as a Fabric Lakehouse or Fabric Warehouse).
I guess you have some operational systems which are your ultimate data sources. I think a common pattern is
Data source -> Pipeline (with or without a custom SQL query) -> Staging files or tables (Bronze Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Silver Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Gold Lakehouse)
or
Data source -> Pipeline (with or without a custom SQL query) -> Staging tables (Warehouse) -> Stored procedure (ETL, upsert, etc.) -> Gold Tables (Warehouse)
I think also Lakehouse (Bronze) -> Lakehouse (Silver) -> Warehouse (Gold) is a normal pattern. Here you will probably find a lot of information if you Google "Medallion architecture in Fabric".
But I think this is up to your organization to decide if you want to go all in on Lakehouse, Warehouse or a mix.
Fabric Lakehouse uses Spark language (you can choose between and mix PySpark, SparkSQL, Scala or SparkR).
Fabric Warehouse uses T-SQL language (but there are some limitations, because the underlying storage format is Delta Lake so not all T-SQL commands are available). I suppose you can Google to search for which workarounds people are using to adapt to the language limitations.
A general advice is to use the tools which work well with your current skillset (also bearing in mind which direction you want to go).
Using the Fabric Trial, you can test Fabric and its tools cost-free for a period.
So I don't think using Dataflows Gen2 is a must. Actually, my impression is that it is quite heavy on resource-consumption. My impression is that it is a tool for low-code, UI-based ETL (which I like, coming from a Power BI background).
Hopefully someone with more practical experience than me can guide you further.
I think this is a great blog (serverlesssql.com) by @AndyDDC.
He can probably correct any mistakes I made in the text above here 😁
I also like this blog: Fabric: Lakehouse or Data Warehouse? - Sam Debruyn
Although I don't think it's necessary to use both Lakehouse and Data Warehouse. It's just an option.
I'm not entirely sure what synonyms in SSMS / SQL are.
Are the server and table names dynamic? If yes, could you explain a bit more about how they are dynamic?
Or are they static (do not change)?
Joins can be done in Power Query M, and they can (in some circumstances) fold back to the source. Easiest way to check it, is by connecting to each source table in Power Query M, and then join the tables in Power Query M, and then check if query folding is taking place (either by looking for the query folding indicator in Power Query, or check the query logs in your source system).
Are you planning to use purely Power Query M in the Dataflow Gen2, or are you planning to write a native SQL query as the first step in Power Query?
Query folding on native queries - Power Query | Microsoft Learn
Hi @frithjof_v
Thankyou so much for getting back to me.
So a Synonym is just essentially an alias for a database object. They can change a lot with what I deal with e.g. pointing it to test and dev environments with the same fields.
If e.g. you were joining to the destination table that you are inserting into (snapshot table) to check if the id exists or not, could this be done in power query?
I'm open to using both, but as we have quite an old fashioned current method of bulk stored procedures inserting for our ETL, I didn't know if it was better to change the whole thing to Dataflow Gen2, or still use the whole existing code in SSMS just into Power Query.
I thought the latter defeated the object of migrating our data warehouse but I'm not sure.
Thanks
Liam
In Power Query M, you could join your new data with the current data in your destination table, in order to check if there are matches. (You can do inner joins, anti joins, etc.)
You could also find the max ID from your destination table, and filter your new data to only process rows which have a higher ID, as an example.
But it made me think...
Are your current stored procedures doing an upsert of the data in your destination table?
Dataflows Gen2 only support full overwrite ("flush and fill") or append method. So you either overwrite everything, or you can append new rows. But not upsert.
The upsert operation would need to be done in another tool, like Notebook (Fabric Lakehouse) or Stored procedure (Fabric Warehouse).
What is your data storage?
Is it in the cloud or on-prem?
Are you planning to use a Fabric Warehouse or a Fabric Lakehouse?
Dataflows Gen2 supports the following destinations:
Do you have a sketch or a diagram which outlines what you need to achieve? (from source via ETL to destination)
Thanks @frithjof_v
That's good to know re the existing data.
Yes there is a few that do insert & updates at the same time. We currently have an on prem Sql Server, with data coming out of SSMS.
I guess I wasn't sure yet on Warehouse or a Fabric Lakehouse. Do you recomend a way to go with this in the scenario of e.g. 10 Stored procedures that upsert via sql agent twice a day, translating to Fabric?
I'm open to changing a lot of the process to fit in with Fabric but I'm not sure what the best way to go is. Would it be Gen2 > Pipeline > Lakehouse/Warehouse? Is Stored procedure in Warehouse just the same as I'm doing now?
Apologies if any of this is a silly question - Fabric is very new to us.
Thanks again for your help
Liam
To be honest, Fabric is new to me as well. And I have more of a Power BI background, so I know Power Query quite well, but I don't have so much real experience about SQL database, stored procedure, data warehouse admin, etc.
I have tested a lot of features in Fabric, but I haven't used them in production (I have only used Power BI in production).
If I understand correctly, you want to build an analytical data store in Fabric (either as a Fabric Lakehouse or Fabric Warehouse).
I guess you have some operational systems which are your ultimate data sources. I think a common pattern is
Data source -> Pipeline (with or without a custom SQL query) -> Staging files or tables (Bronze Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Silver Lakehouse) -> Notebook (ETL, upsert, etc.) -> Tables (Gold Lakehouse)
or
Data source -> Pipeline (with or without a custom SQL query) -> Staging tables (Warehouse) -> Stored procedure (ETL, upsert, etc.) -> Gold Tables (Warehouse)
I think also Lakehouse (Bronze) -> Lakehouse (Silver) -> Warehouse (Gold) is a normal pattern. Here you will probably find a lot of information if you Google "Medallion architecture in Fabric".
But I think this is up to your organization to decide if you want to go all in on Lakehouse, Warehouse or a mix.
Fabric Lakehouse uses Spark language (you can choose between and mix PySpark, SparkSQL, Scala or SparkR).
Fabric Warehouse uses T-SQL language (but there are some limitations, because the underlying storage format is Delta Lake so not all T-SQL commands are available). I suppose you can Google to search for which workarounds people are using to adapt to the language limitations.
A general advice is to use the tools which work well with your current skillset (also bearing in mind which direction you want to go).
Using the Fabric Trial, you can test Fabric and its tools cost-free for a period.
So I don't think using Dataflows Gen2 is a must. Actually, my impression is that it is quite heavy on resource-consumption. My impression is that it is a tool for low-code, UI-based ETL (which I like, coming from a Power BI background).
Hopefully someone with more practical experience than me can guide you further.
I think this is a great blog (serverlesssql.com) by @AndyDDC.
He can probably correct any mistakes I made in the text above here 😁
I also like this blog: Fabric: Lakehouse or Data Warehouse? - Sam Debruyn
Although I don't think it's necessary to use both Lakehouse and Data Warehouse. It's just an option.
Glad it was useful!
Here is an interesting feature, which aims to bring data pipeline-like performance to Dataflows Gen2: Fast copy in Dataflows Gen2 - Microsoft Fabric | Microsoft Learn
I haven't tested it.
In addition, I guess there are other and newer ways of getting data from a data source, like streaming, events, CDC, webhooks, etc.
As far as I know, you could use Eventhouse or Notebooks in Fabric for these kind of approaches.
However I am on thin ice now... Just mentioning words I have heard about 😁
Thanks @frithjof_v
The part I sometimes struggle to translate I think is the transformation part.
Also e.g if certain tables require them to be snapshots and bring in a data from a dynamic date period on every run.
Plenty of questions but it's all new 🙂
Could you expand a bit on this part:
"Also e.g if certain tables require them to be snapshots and bring in a data from a dynamic date period on every run"
To bring in data from a dynamic date period on every run, I think you can use variables (or parameters) in Data pipeline, and use a SQL query in Data pipeline which incorporates these variables so the date range will be dynamic.
(You could probably also use Notebook to achieve the same. In some cases also Dataflows Gen2.)
For creating snapshot tables:
Do you wish to add a snapshot timestamp column and append the snapshots inside a single table, or do you wish to create separate tables for each snapshot?
If you want, please provide some more details about the tables you want to create. E.g. are they dimension tables or fact tables?
You can also stage your data as files (Lakehouse) or tables (Lakehouse or Warehouse) if required.
Apologies, I missed your last reply on this!
To bring in data from a dynamic date period on every run, I think you can use variables (or parameters) in Data pipeline, and use a SQL query in Data pipeline which incorporates these variables so the date range will be dynamic. - This is really good to know
So the snapshot would contain a timestamp in a table, with a dynamic date running on every days run. E.g today will run previous x days. They would be predominantly fact tables.
Thanks for all your help with this
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Fabric update to learn about new features.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |