Medallion Architecture in Fabric

NicholasJackson · ‎05-24-2023

What would the best-practice for establishing the medallion architecture inside of Fabric?

My initial thoughts would be multiple workspaces inside of OneLake, but I am unsure if this is recommended or if others have thoughts.

cmaneu · ‎05-24-2023

Hello @NicholasJackson,
That is an excellent question! And I'm not sure there is a definitive answer or recommendation to that, yet. Before answering to your question, I would like to highlight few Fabric features, so we all have the relevant context:

OneLake is at the tenant level: As soon as you create the first Fabric artifact, a OneLake is "created" at the tenant level. While there is only one, there are still some considerations for designing your data estate (see the 2 next features
OneLake can be multi-region: In Premium, you can have workspaces attached to different regions (if you're not in premium, there is still a "pinned" region). You would design your data estate to limit the data movement & querying between regions as much as you can.
OneLake can be multi-cloud: That means that some of your bronze data may be on another cloud, and you don't necessarily want to move all of them
Data (Files/Tables) can be shared between workspaces: We're still in preview, so all the access management features are not yet here. For now, you can grant a read access to a lakehouse
Cross-database and cross warehouse/lake queries: You can easily execute cross database queries. And datawarehouse data can be accessed from DataEngineering/Science tools, and - to some extent - data stored in the lakehouse can be accessed through SQL or Power BI.

Having said that, there are a lot of possibilities depending on your requirements and the size/footprint of your organization. For now, I would apply the following criterias to decide how to structure your data estate. This is my opinion, and I would love to see other answers to your question 🙂 :

Do you have data residency requirements that spans multiple regions?
If you can have all your data in one single region, you can start with one lakehouse, at least for your bronze data
Do you have data protection / internal data boundaries requirements?
For example, in listed companies, some detailed financial data may only be accessed by a very short list of people that have been identified. If so, all your bronze data will need to span across several lakehouses to create this boundary. If all your data engineers can have access to all the data, then you can have one lakehouse for the bronze (and possibly silver) data
Do you have internal costs management requirements?
If you need to invoice your own departments according to what fabric resources they consume, the easiest way today is to isolate data & jobs (pipelines, flows) in different workspaces
How much autonomy do you want to give to teams, and at what level (bronze/silver/gold)?
Who is responsible for each layer, and what is the autonomy of each teams for each of these levels.
Are you already using "Power BI workspaces"?
If so, you should probably integrate your current architecture in your design.

With these criterias, we can imagine that a "small" organization can have one lakehouse (in one workspace) to store both the bronze and silver layers, and one to many workspaces for the gold layer and their corresponding Power BI Reports. For an organization with more scale, the layers below gold may be splitted to several lakehouses/workspaces.

The gold data would likely be "always" stored into business-specific workspaces (modulo some "core" aggregates that you might want to share). With such organization, you might end up with a lot of workspaces. Don't forget about the new Domains feature (preview). It allows to enhance the discoverability of data for a specific business area or field within the organization (learn more in the docs).

Hope this first answer gives you more clarity!

View solution in original post

HimanshuS-msft · ‎05-24-2023

Hi @NicholasJackson ,

I am in total agreement with what @cmaneu called out above . Being a big fan of the Medallion Architecture , I think we can start by creating subfolders inside the LakeHouse & may be you can explore that option also .

Ideally in the past I have used different containers in a storage acccount to set the permission correctly .
Since we have the Azure data factory & Synapse as part Fabric we can use them to move data across folders .
Let me know if you have any other questions/thoughts around this .

Thanks
Himanshu

Noeleke1301 · ‎06-21-2023

Lots of questions still on how to put theory to practice regarding this medaillon structure.

- Would you use the medaillon layer in the name (for example bronze_sales, silver_sales, gold_sales)?

- Would you load bronze tables as tables at all, or would you just leave them as files?

- How do you know which source system the data comes from? Do you incorporate that into the table/file name?

Noeleke1301 · ‎06-09-2023

That's a great idea. Is it possible to manage access to those files? So engineer A has access to bronze+silver and engineer B has access to bronze+gold for example. Or would I need to create separate lakehouses to achieve that?

MawashiKid2 · ‎06-06-2023

Makes sense. That's indeed what I would have 1st had in mind. TBH I never defined ADLS GEN2 + default 'root' Container (FileSys) through AZ Synapse Studio, but on previously created separate Azure Data Storage Account. Then I could easily create hierarchical directories for consistency which would be reflected & accessed under Data Hub> Files tab. Bronze was mainly 'raw' Parquet data while Silver was mostly converted Delta format (log, partitions). Microsoft Fabric - Onelake is a new ball game...Well sort of...Prevailing question is probably examining what basically remains (and may still fit) vs was has changed and may no longer be applicable in Fabric context.🫡

cmaneu · ‎05-24-2023

Hello @NicholasJackson,
That is an excellent question! And I'm not sure there is a definitive answer or recommendation to that, yet. Before answering to your question, I would like to highlight few Fabric features, so we all have the relevant context:

OneLake is at the tenant level: As soon as you create the first Fabric artifact, a OneLake is "created" at the tenant level. While there is only one, there are still some considerations for designing your data estate (see the 2 next features
OneLake can be multi-region: In Premium, you can have workspaces attached to different regions (if you're not in premium, there is still a "pinned" region). You would design your data estate to limit the data movement & querying between regions as much as you can.
OneLake can be multi-cloud: That means that some of your bronze data may be on another cloud, and you don't necessarily want to move all of them
Data (Files/Tables) can be shared between workspaces: We're still in preview, so all the access management features are not yet here. For now, you can grant a read access to a lakehouse
Cross-database and cross warehouse/lake queries: You can easily execute cross database queries. And datawarehouse data can be accessed from DataEngineering/Science tools, and - to some extent - data stored in the lakehouse can be accessed through SQL or Power BI.

Having said that, there are a lot of possibilities depending on your requirements and the size/footprint of your organization. For now, I would apply the following criterias to decide how to structure your data estate. This is my opinion, and I would love to see other answers to your question 🙂 :

Do you have data residency requirements that spans multiple regions?
If you can have all your data in one single region, you can start with one lakehouse, at least for your bronze data
Do you have data protection / internal data boundaries requirements?
For example, in listed companies, some detailed financial data may only be accessed by a very short list of people that have been identified. If so, all your bronze data will need to span across several lakehouses to create this boundary. If all your data engineers can have access to all the data, then you can have one lakehouse for the bronze (and possibly silver) data
Do you have internal costs management requirements?
If you need to invoice your own departments according to what fabric resources they consume, the easiest way today is to isolate data & jobs (pipelines, flows) in different workspaces
How much autonomy do you want to give to teams, and at what level (bronze/silver/gold)?
Who is responsible for each layer, and what is the autonomy of each teams for each of these levels.
Are you already using "Power BI workspaces"?
If so, you should probably integrate your current architecture in your design.

With these criterias, we can imagine that a "small" organization can have one lakehouse (in one workspace) to store both the bronze and silver layers, and one to many workspaces for the gold layer and their corresponding Power BI Reports. For an organization with more scale, the layers below gold may be splitted to several lakehouses/workspaces.

The gold data would likely be "always" stored into business-specific workspaces (modulo some "core" aggregates that you might want to share). With such organization, you might end up with a lot of workspaces. Don't forget about the new Domains feature (preview). It allows to enhance the discoverability of data for a specific business area or field within the organization (learn more in the docs).

Hope this first answer gives you more clarity!

NicholasJackson · ‎05-24-2023

This is an incredible response, thank you so much!

I work mostly in the SMB area, so keeping it all in one lakehouse sounds like a good approach, at least for now.

Thanks again!

Medallion Architecture in Fabric