This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
With the general availability of Microsoft Fabric this past Ignite, there are a lot of questions centered around the functionality of each component but more importantly, what architecture designs and solutions are best for analytics in Fabric. Specifically, how your data estate for analytics data warehousing/reporting will change or differ from existing designs and how to choose the right path moving forward. This article will be focused on helping you understand the differences between the Data Warehouse, Data Lakehouse, and a KQL Database, Fabric solution designs, warehouse/lakehouse/real-time analytics use cases, and to get the best of the Data Warehouse, Data Lakehouse, and Real-Time Analytics/KQL Database.
Topics covered in this article:
First, we will cover what Microsoft Fabric is from a high level to have a baseline understanding of the architecture/relevant components for this article as well as the relevant components for discussing the Data Lakehouse, Data Warehouse, and KQL Database in further detail.
Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place.
Diagram_of_the_software_as_a_service_foundation_beneath_the_different_experience
Microsoft Fabric is built on a SaaS foundation as it brings together new and existing components from Power BI, Azure Synapse, and Azure Data Factory into a single integrated environment. You can interact/build on these components through the experiences such as Data Engineering, Data Factory, Data Warehouse, Data Science, Real-Time Analytics, and Power BI. The advantages of having these components/experiences on one shared platform are:
Although the components of Fabric use ‘Synapse’ as a prefix for Data Engineering, Data Warehousing, Data Science, and Real-Time Analytics -> this is NOT the same as Azure Synapse Analytics. This can lead to confusion because of naming. The technology is built on top of the existing technology of Synapse but there are a lot of significant changes/improvements to the architecture/functionality that makes it a revolutionized version. Those experienced with Azure Synapse Analytics (Dedicated Pools, Serverless Pools, Spark Pools) will find the concepts of the Microsoft Fabric components familiar, but that the technology/functionality is significantly improved and optimized.
OneLake is the data lake that is the foundation on which all Fabric services are built. OneLake is built on top of ADLS (Azure Data Lake Storage) Gen2 and services as the tenant-wide store for data and can support any type of file, structured or unstructured. Every Fabric tenant automatically provisions OneLake, with no extra resources to set up or manage. All Fabric data items like data warehouses and lakehouses store their data automatically in OneLake in Delta Parquet format.
It is a single unified storage system for all developers (professional and citizen) where all compliance, security, and policies are enforced centrally and uniformly. OneLake eliminates the data silos across organizations and eliminating the need to physically copy or move data for different teams/engines to use whether the data is in OneLake or if it is stored in a shortcut compatible location. You can think of OneLake as the OneDrive for you for your organization’s data.
Diagram_showing_the_function_and_structure_of_OneLake
Microsoft Fabric Lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location, in open format. The default file format is delta parquet. You can store data in 2 physical locations that are provisioned automatically, files (unmanaged) or tables (managed delta tables).
The Lakehouse allows you to perform transformations via Spark computes or use SQL endpoints for analysis/exploration of your data. The Data Lakehouse’s default file format is delta parquet which is optimal for analytic workload performance.
Data Lakehouses are not a new concept or proprietary to Microsoft Fabric, as they are very common in analytic workload architectures using the recommended medallion architecture (more on this later). This same concept of a Data Lakehouse is present in Microsoft Fabric with some new functionality and seamless integration with the components inside of Microsoft Fabric and a few other services outside of Microsoft Fabric. Including the ability to provide one copy of your data and share amongst other teams with zero data duplication.
Microsoft Fabric Data Warehouse is a lake centric data warehouse built on an enterprise grade distributed processing engine. One of the major advantages of Fabric Data Warehouse compared to other data warehousing solutions is that there is no need to copy the data from a Data Warehouse for other compute engines or teams to consume since the warehouse uses OneLake as its storage. One single copy of the data stored in Microsoft OneLake in a delta parquet format. Because of this, you can have cross-database querying to leverage different sources (OneLake or other sources through shortcuts) of data seamlessly and with zero data duplication.
At its core, the Data Warehouse is a SQL MPP Engine (massively parallel processing engine) with Delta Tables and TDS endpoints, which will provide you with full T-SQL DDL and DML support (check availability of specific functions/features). The compute is a serverless infrastructure to allow for infinite scaling with dynamic resource allocation, instant scaling up or down with no physical provisioning involved, and physical compute resources are assigned within milliseconds to jobs.
Real-Time Analytics is a fully managed big data analytics platform optimized for streaming, and time-series data. It utilizes a query language and engine with exceptional performance for searching structured, semi-structured, and unstructured data. Real-Time Analytics is fully integrated with the entire suite of Fabric products, for both data loading, data transformation, and advanced visualization scenarios. Just like the rest of the Microsoft Fabric components, Real-Time Analytics is a SaaS experience.
With Real-Time Analytics in Microsoft Fabric, you enable your organization to focus and scale up their analytics solution while democratizing data for the needs of both the citizen data scientist all the way to the advanced data engineer. Real-time analytics have become essential in many scenarios in the enterprise world, such as cybersecurity, asset tracking and management, predictive maintenance, supply chain optimization, customer experience, energy management, inventory management, quality control, environmental monitoring, fleet management, and health and safety. This is achieved through reducing the complexity of data integration. You are able to gain quick access to data insights within a few seconds of provisioning, automatic data streaming, indexing, and partitioning for any data source or format, and on-demand query generation and visualizations. Real-Time Analytics lets you focus on your analytics solutions by scaling up seamlessly with the service as your data and query needs grow.
Read more here What makes Real-Time Analytics unique? To discover more about the advantages of Real-Time Analytics.
Real-Time Analytics uses a KQL database as a data store. We will cover the architecture/components of a Real-Time Analytics solution later in this article, but for now I will give a brief overview of the KQL database. KQL stands for Kusto Query Language, and it is a query language used to interact with data located in a KQL database. KQL database can be used as a general term to apply to both Fabric Real-Time Analytics and Azure Data Explorer since it is referring to a database/store that is interacted with using KQL.
At its core, Real-Time Analytics shares the same core engine with identical core capabilities as Azure Data Explorer, just with different management behavior. (Azure Data Explorer being a PaaS offering and Real-Time Analytics a SaaS offering). So, if you are familiar with Azure Data Explorer then that knowledge is translatable to Fabric Real-Time Analytics, with differences in management.
Azure Data Explorer provides unparalleled performance for ingesting and querying telemetry, logs, events, traces, and time series data. It features optimized storage formats, indexes, and uses advanced data statistics for efficient query planning and just-in-time compiled query execution. Azure Data Explorer offers data caching, text indexing, row store, column compression, distributed data query, separation of storage and compute.
See this documentation to view the differences in functionality/management between the Real-Time Analytics and Azure Data Explorer: Differences between Real-Time Analytics and Azure Data Explorer - Microsoft Fabric | Microsoft Learn
Now that we have a basic high-level understanding of what Microsoft Fabric is, and specifically OneLake, Data Lakehouse, Data Warehouse, and and Real-Time Analytics/KQL Database. It is time to break down the differences between the Data Lakehouse, Data Warehouse, and and Real-Time Analytics/KQL Database so we can better understand the use cases for each and help with designing our solutions.
First, we will focus on the comparison between the Data Lakehouse and Data Warehouse before comparing the and Real-Time Analytics/KQL Database.
Both the Data Lakehouse and Data Warehouse are open data format – delta parquet, lake centric approaches and have some cross functionality, but there are differences that I have broken down into categories.
There are 3 different endpoints between each the Lakehouse and Data Warehouse.
To interact with Lakehouse files/tables for analysis, transformations, or processing using Spark, you will connect to the endpoint for your Lakehouse that is separate from the SQL Analytics Endpoint. Just like standard methods outside of Fabric for interacting with files/delta tables, you will connect using either the URL, ABFS path, or mounting the Lakehouse directly in your explorer. Using Spark allows you to perform write operations with your choice of Scala, PySpark, Spark SQL, or R. However, if you wish to use T-SQL, you will need to use the SQL Analytics Endpoint where you can only perform read-only operations.
This endpoint provides a SQL-based experience for the Lakehouse delta tables. The SQL Analytics Endpoint offers the ability to interact, query, and serve data within the Lakehouse using SQL. This experience is read-only and applies only to delta tables. This would be the ‘Tables’ section of the Lakehouse, and the ‘Files’ section are not readable/discoverable through the SQL endpoint. The SQL Analytics Endpoint allows you to analyze the delta tables using T-SQL, save functions, generate views, and apply SQL object level security. It enables data engineers to build a relational layer on top of physical data in the Lakehouse and expose it to analysis and reporting tools using the SQL connection string.
A SQL Analytics Endpoint is automatically created when you create a Lakehouse, which points to the delta table storage. Each Lakehouse has only one SQL Endpoint, and each workspace can have more than one Lakehouse. This means that the number of SQL Endpoints in a workspace matches the number of Lakehouses.
It is important to note that the creation/modification of delta tables (and the data within the delta tables) must be completed using Apache Spark, Delta Tables that are created through Spark within the Lakehouse are automatically discoverable through the endpoint. And if there are external delta tables that are created with Spark code then they will not be visible to the SQL analytics endpoint until you make a shortcut to the external delta table for it to be visible.
You can set object-level security for accessing data using SQL analytics endpoint. These security rules will only apply for accessing data via SQL analytics endpoint. This means that if you want to make sure your data is not accessible in other ways (through different endpoints or directly) then you must set workspace roles and permissions.
You can connect to this endpoint outside of Fabric as well with tools such as SSMS or Azure Data Studio by supplying your authentication and endpoint connection string just like any other SQL server connection.
Behind the scenes, the SQL analytics endpoint is using the same engine as the Warehouse endpoint to serve high performance, low latency SQL queries. Which means that it is also a TDS endpoint, just with restrictions on the DML/DDL functionality and T-SQL surface limitations when compared with the Data Warehouse endpoint.
It is important to note that MSFT documentation will describe the SQL Analytics Endpoint for the Lakehouse as a Data Warehouse. Although this can be true depending on your context, I will not refer to it as a data warehouse to avoid confusion when discussing the Synapse Data Warehouse. I will explicitly refer to it by the SQL Analytics Endpoint or SQL querying on the Lakehouse.
Within my Fabric workspace, I have Lakehouse called ‘BronzeLakehouse’. Within that Lakehouse, I have 2 items (different item types) associated with my Lakehouse. The SQL Analytics Endpoint (red) and the Lakehouse endpoint (green)
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated" class="wp-image-6489" />
What can I see using the ‘Lakehouse’ endpoint? Files (red) and Delta Tables (green). You can check which endpoint you are currently viewing in the top right corner via a drop-down selection (shown expanded)
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated" class="wp-image-6490" />
What can I see using the SQL Analytics Endpoint of the Lakehouse? Delta Tables only.
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated" class="wp-image-6491" />
The Synapse Data Warehouse or Warehouse Endpoint functions in a ‘traditional’ SQL data warehouse fashion. This means that this endpoint supports the full T-SQL capabilities like your enterprise data warehouse. Unlike the SQL Analytics Endpoint, the Data Warehouse Endpoint offers you:
Pairing the full read/write capabilities and the cross-database ingestion capabilities, you are able to ingest from multiple Data Warehouses or Lakehouses seamlessly. When ingesting data into the Data Warehouse, it will create a Delta Table that is stored in OneLake.
Here is a diagram that can help explain the difference between the SQL Analytics Endpoint and the Data Warehouse Endpoint.
Diagram_of_a_SQL_analytics_endpoint_and_a_Warehouse_in_Microsoft_Fabric
Example of a cross database query to load data into the Data Warehouse from the Lakehouse.
Warehouse table ‘holiday.Warehouse_Holiday_Clean’ is created and loaded with a CTAS statement with the ‘SilverLakehouse.dbo.Holiday_Clean’ delta table as the source.
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated" class="wp-image-6493" />
The KQL Database is not a new concept or as complex as the endpoints for the Lakehouse or Data Warehouse. We will discuss the use cases and other factors that differentiate Real-Time Analytics/KQL database from the other offerings, but this section is for the endpoint differences.
There are no limitations on read/write capabilities depending on your language or endpoint as there are not different endpoints for a KQL database. You will be able to read/write using KQL, Spark, connector ecosystem (no code), or T-SQL. Depending on exactly what you are trying to do, there may be a better language or process that you will have to evaluate in a case-by-case basis.
The limitations of read/write capabilities would be applied to the account permissions. KQL databases offer the ability to apply ONLY row level security, which is account specific and not compute engine specific like the Data Warehouse and Lakehouse.
Below are some of the main benefits of using a KQL Database that I want to highlight:
There is also a preview feature called ‘One Logical Copy’. This provides you the ability to create a one logical copy of KQL Database data by enabling data availability in OneLake. Enabling data availability of your KQL database in OneLake means that you can query data with high performance and low latency in your KQL database, and query the same data in Delta Lake format via other Fabric engines such as Direct Lake mode in Power BI, Warehouse, Lakehouse, Notebooks, and more. One logical copy (Preview) - Microsoft Fabric | Microsoft Learn
Another difference between the Data Lakehouse, Data Warehouse, and KQL Database is the type of data that is stored and how it is organized.
Within the Data Lakehouse:
Within the Data Warehouse:
Within the KQL Database:
In many architecture and design sessions, the right service/pattern may not be the best for your team due to skillsets of the team members. For instance, transforming data in Spark may be the best design choice for performance, cost, etc. However, no one on the team has experience or knowledge with Spark or any language besides T-SQL. That will affect your design to consider the availability/uses of your developer’s skillset.
Within the Data Lakehouse:
Within the Data Warehouse:
Within the KQL Database:
Here is a chart that provides a comparison between the Data Warehouse, Lakehouse, and KQL Database. You can view the full documentation here: Fabric decision guide - choose a data store - Microsoft Fabric | Microsoft Learn
| Data warehouse | Lakehouse | KQL Database | |
| Data volume | Unlimited | Unlimited | Unlimited |
| Type of data | Structured | Unstructured,semi-structured,structured | Unstructured, semi-structured, structured |
| Primary developer persona | Data warehouse developer, SQL engineer | Data engineer, data scientist | Citizen Data scientist, Data engineer, Data scientist, SQL engineer |
| Primary developer skill set | SQL | Spark(Scala, PySpark, Spark SQL, R) | No code, KQL, SQL |
| Data organized by | Databases, schemas, and tables | Folders and files, databases, and tables | Databases, schemas, and tables |
| Read operations | Spark,T-SQL | Spark,T-SQL | KQL, T-SQL, Spark, Power BI |
| Write operations | T-SQL | Spark(Scala, PySpark, Spark SQL, R) | KQL, Spark, connector ecosystem |
| Multi-table transactions | Yes | No | Yes, for multi-table ingestion. See update policy. |
| Primary development interface | SQL scripts | Spark notebooks,Spark job definitions | KQL Queryset, KQL Database |
| Security | Object level (table, view, function, stored procedure, etc.), column level, row level, DDL/DML, dynamic data masking | Row level, table level (when using T-SQL), none for Spark | Row-level Security |
| Access data via shortcuts | Yes (indirectly through the lakehouse) | Yes | Yes |
| Can be a source for shortcuts | Yes (tables) | Yes (files and tables) | Yes |
| Query across items | Yes, query across lakehouse and warehouse tables | Yes, query across lakehouse and warehouse tables;query across lakehouses (including shortcuts using Spark) | Yes, query across KQL Databases, lakehouses, and warehouses with shortcuts |
| Advanced analytics | Time Series native elements, Full geospatial storing and query capabilities | ||
| Advanced formatting support | Full indexing for free text and semi-structured data like JSON | ||
| Ingestion latency | Queued ingestion, Streaming ingestion has a couple of seconds latency |
Now that there is a deeper understanding of the differences between the Data Warehouse, Data Lakehouse, and Real-Time Analytics/KQL Database, it is time to review some of the use cases to determine which option to use. There are a lot of specific use cases out there and it would be impossible to cover every possible scenario, so I am going to cover some general scenarios.
To determine whether you should use the Lakehouse alone, Data Warehouse alone, KQL Database alone, or mixing them together, will generally depend on these requirements for your solution.
When designing a solution, it is important to understand how the data you are working with will be consumed. A Lakehouse, Data Warehouse, and KQL Database have a lot of flexibility/overlapping ability for serving data to users. However, the specific needs/nuances can vary from solution to solution so here are some common examples of different consumption methods for the data:
Each database, Data Warehouse, Lakehouse, KQL, SQL Server, Cosmos DB, etc., are all optimized for different read/write sizes and workloads. So, understanding these optimizations is key to determining which solution is best based on the requirements.
For different applications and ETL/ELTs, there can be solution needs that require specific functionality of either a Lakehouse, a Data Warehouse, or Real-Time Analytics/KQL Database. Again, the nuances of every solution make it impossible to cover all scenarios, but I will cover 4 common requirements for applications or ETL/ELT workloads.
ACID stands for:
Why ACID is important?
What does ACID compliance look like in the Lakehouse and Data Warehouse?
Multi-table transactions are a way to group changes to multiple tables into a single transaction. This allows you to control the commit or rollback of read and write queries, and modify data that is stored in tables using transactions to group changes together.
Ex: You change details about a purchase order that affects three tables (dbo.ItemStock, dbo.PurchaseOrderHistory, dbo.OrderShippingInfo). You can group the transaction of cancelling an order so that the changes will apply to all 3 tables or none at all. So, when a user queries any of the tables, they will see the consistent change across all tables, and not contradicting/incorrect data.
Multi-table transactions are ONLY supported in the Data Warehouse.
The Lakehouse currently does not support this functionality.
KQL Databases do not support transactions, it's a distributed system supporting eventual consistency. A KQL Database can achieve an update behavior similar to multi-table transaction via the ‘update policy’ but this is not ACID compliant nor guaranteed order of execution/durability.
Dynamic data masking limits sensitive data exposure by masking it to nonprivileged users. It helps prevent unauthorized viewing of sensitive data by enabling administrators to specify how much sensitive data to reveal, with minimal effect on the application layer.
Lakehouse Data Masking:
Data Warehouse Data Masking:
KQL Database Data Masking:
The general explanation of what Real-Time Analytics is being able to process, view, analyze, and measure data as soon as it has been collected. This means that data must be available for consumption with extremely low latency, such as within a few seconds. The solution must be able to handle large amounts of data while providing that few second latency from collection. The size you want to think of is going to be in petabytes and higher with the data format widely varying. Some of the general use cases for Real-Time Analytics involve security audit logs, inventory data, supply chain information from assets on a production floor to the delivery trucks and streaming smart device information in any industry.
Data Warehouse:
Lakehouse:
KQL Database:
Each database, Data Warehouse, Lakehouse, KQL, SQL Server, Cosmos DB, etc., are all optimized for different read/write sizes and workloads. So, understanding these optimizations is key to determining which solution is best based on the requirements.
These are just 4 examples of common requirements within analytic solutions that will dictate which component to use, Lakehouse, Data Warehouse, or KQL Database.
For ACID compatibility, you will need to review the requirements of the solution/other factors since utilizing the ACID capability in either the Data Warehouse and Lakehouse; you will need to create a Delta Table (all tables in the Lakehouse and Data Warehouse are Delta Tables). This means there is likely a need to either convert a file into a Delta Table or load directly into one for both components. You will not be able to use a KQL Database. So, the choice will depend on the other factors for the use cases such as skillset, other requirements, and consumption.
For multi-table transactions, you will need to use the Data Warehouse over the lakehouse and KQL Database to perform these transactions. Understanding this can save you a lot of headaches and time when designing your solution/functionality for your users.
For dynamic data masking, you can use either the Data Warehouse or the Lakehouse and KQL Database to achieve this functionality. IF using a SQL endpoint for the Lakehouse. If you have need for files to have Dynamic Data Masking or other engines like Spark, then you can certainly try to design your Lakehouse to remove sensitive columns and expose different versions of the data to the business users, however that can quickly become unmanageable depending on the scale. The data warehouse offers the ability to have a one stop shop for business users and protects sensitive information via dynamic data masking without worrying about the use of files or other engines.
For Real-Time Analytics, you will want to use KQL Database. You may be able to use the Lakehouse depending on the exact requirements, but KQL Databases are optimized/designed for this exact scenario. As always, you can certainly try to design a solution in the Lakehouse, but the KQL database is just more performant for this scenario and has many optimizations out of the box. Data Warehouse is not a good choice for Real-Time Analytics solutions.
The last major category of the use case decisions is the skillset of your developers. This includes citizen developers, professional developers, data engineers, data scientists, etc. An important piece of information when I advise customers on architecture decisions is understanding the skillset of the individuals working with the data. The most optimal service/design for performance and cost may not be the best solution for the team since no one would know how to develop it.
High-level breakdown of the skillsets used and the types of developers for the Lakehouse, Data Warehouse, and KQL Database.
After reviewing some general use cases combined with the different requirements that you may have, it should be clearer when to use a Lakehouse vs a Data Warehouse vs a KQL Database. Often these decisions are complicated and require a lot of factors to be considered. By starting with analyzing how the data is going to be consumed/used, the requirements for your application or ETL/ELT, and the skillset of developers/users; your design/architecture decisions can be made quicker. And if not, then at the very least, you will find out what other questions to ask.
Understanding the functionality, feature support, use cases, and differences between the Data Lakehouse, Data Warehouse, and a KQL Database will assist you in building an architecture for different types of solutions. We will combine everything we have learned so far and apply it to architectures for different workloads/scenarios.
These examples are not intended to be reference architectures or recommendations of specific best practices across all scenarios, as we covered just some of the complexities that need to be considered.
The goal is to demonstrate architectures using the Lakehouse exclusively, Data Warehouse exclusively, Real-Time Analytics/KQL Database exclusively, the Lakehouse and Data Warehouse together, and Real-Time Analytics/KQL Database and a Lakehouse or Data Warehouse together to provide a better understanding of different design patterns based on your criteria.
The solution architecture/design will be broken into these categories:
A medallion architecture is not unique to Fabric and has been best practice/very common for lakehouse/data warehousing architectures in the cloud. And it continues to be best practice in Fabric. So it is important that we review the basics of a medallion architecture before diving into specific architectures using it. The medallion architecture is the recommended approach in Fabric.
The medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. A medallion architecture comprises of three distinct layers or zones: bronze (raw), silver (validated), and gold (enriched). Each layer indicates the quality of data stored in the lakehouse, with higher levels representing higher quality.
The names of the zones, like ‘validated’ for the silver zone, can vary across sources and designs but concepts are the same.
Bronze
Raw zone. First layer that stores the source data in its original format.
Typically, there is little to no user access at this layer.
Silver
Enriched zone. Raw data that has been cleansed and standardized. The data is now structured with rows and columns, and can be integrated with other sources across the enterprise.
Transformations at the silver level should be focused on data quality and consistency, not on data modeling.
Gold
Curated zone. Final layer sourced from the silver layer. This data is refined, “business ready”, and meets analytical/business requirements.
This can be the Lakehouse or a Data Warehouse depending on needs. More on this design below.
You can have multiple gold layers for different users or domains with specific optimizations/designs. For example, you might have separate gold layers for your finance and sales which is utilizing a STAR schema and optimized for analytics/reporting. And you might also have a gold layer for your data scientists that is optimized for machine learning.
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated with medium confidence" class="wp-image-6494" />
Here is an example of a Lakehouse architecture utilizing a medallion architecture. Below are notes on different sections of the diagram.
General decision points for using this architecture pattern:
Diagram_of_an_example_of_OneLake_medallion_architecture_that_shows_data_sources
*It is recommended that each Lakehouse be separated into its own Fabric Workspace. This allows better control and governance at each zone level compared to having all the Lakehouses in a single workspace.
Users will consume and analyze the data through 2 methods:
Users/teams can pull data into different reporting tools and even cross database query between Lakehouses and Data Warehouses for different reporting need.
This is also the method for exploration or ad hoc analysis. Whether it is business users or SQL analysts/developers, the SQL Analytics endpoint offers the ability to have a full, read-only, SQL experience all from the Gold Lakehouse.
Through the SQL Analytics Endpoint, privileged users can apply object level and row level security to protect sensitive data. You can create views and functions to customize and control your end user’s experience an access just like a traditional SQL environment that many users are used to.
Users/teams will consume data through Power BI via reports, datasets, dashboards, apps, etc. That are sourced from the Gold Lakehouse.
Direct lake mode is a new semantic model capability for parquet formatted files to load directly from the data lake without having to query a Lakehouse endpoint OR import/duplicate data within the Power BI model.
Here is an example of a Data Warehouse architecture. Below are notes on different sections of the diagram.
General decision points for using this architecture pattern:
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated" class="wp-image-6496" />
Although not shown in the diagram, it is still recommended to have clear separation of data zones within your Data Warehouse design. Just like a medallion architecture but using the Data Warehouse instead of a Lakehouse. Data Warehouses often have the same concept as a medallion architecture, with some concept of a landing zone, staging zone, and production zone which provide the same benefit as the medallion architecture.
It is recommended to implement this concept into your Fabric Data Warehouse as well. To accomplish this, there are 2 options depending on the amount of administrative oversight tolerable.
Here is an example of a Data Warehouse and Lakehouse combined architecture. Below are notes on different sections of the diagram.
General decision points for using this architecture pattern:
This architecture diagram is the same as the Lakehouse medallion architecture diagram except with 2 differences.
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
Description automatically generated" class="wp-image-6497" />
Combining both the Lakehouse and Data Warehouse into a single architecture can provide you with the best of both worlds (depending on your requirements).
With this architecture, you allow the consumption layer (through Power BI or warehouse endpoint) to take full advantage of the Data Warehouse like DDL/DML support for users/developers at the gold layer and Data Warehouse endpoint for other reporting tools/processes – while not having to sacrifice performance or duplicate data just for a different consumption method. All the data transformations within the ETL/ELT are performed with Spark to achieve the best performance and flexibility of your code while leveraging the primary skillset of the developers and still allowing the best end user experience/functionality.
Here is an example of a Real-Time Analytics/KQL Database architecture. Below are notes on different sections of the diagram.
General decision points for using this architecture pattern:
Lakehouse_vs_Data_Warehouse_vs_Real-Time_Analytics_KQL_Database_Deep_Dive_into_U
KQL Database is the data store for Real-Time Analytics. All the benefits and use cases are above. A medallion architecture would not be used in the KQL Database, as this is meant for raw telemetry data and modeling/cleansing occurs either midflight or at the consumption layer. It is still recommended to have some sort of organization or naming convention in your KQL objects.
One important concept for designing your Real-Time Analytics solution is archiving. You will want to have an archiving strategy set in place to prevent the data from growing/incurring costs if not needed. Why would you want to archive data? Well, most likely, the data in your Real-Time Analytics solution is valuable for only a short period of time, and for more historical analysis you have different assets. Each solution has different requirements and business needs so the amount of data to be retained can vary but very rarely is there a true requirement to keep all the data in a Real-Time Analytics solution.
Within Real-Time Analytics, you can define data retention policies and export the data in different ways to various destinations.
The reason for including the Lakehouse and Data Warehouse is for that deeper analysis or further use of analytics that fall outside of the scope of the Real-Time Analytics reporting. The benefits of this are a solution that provides you with Real-Time Analytics and the historical retention of the data to be used in other analytical reporting.
One Logical Copy (Preview) allows you to interact with the data from the KQL Database in OneLake, which allows you to work with the data without physically copying it for different compute engines. Provides the ability of the lakehouse or Data Warehouse to use a shortcut to connect to the data in KQL without physically copying it to a lakehouse.
Now if you want to load that data into different files or warehouses for historical retention or other reasons, you would be physically moving/copying the data. This can be accomplished via Pipelines, Dataflows, Spark Notebooks, and even cross database query functionality between the Lakehouse, Datawarehouse, and KQL database when enabling the one logical copy in KQL (allows you to create a shortcut).
If you are using the KQL database as a source of data for your data warehouse or lakehouse, review the above architectures for guidelines when implementing these solutions.
Users/teams will consume data through Power BI via reports, datasets, dashboards, apps, etc. That are sourced from the gold warehouse or tables.
When you enable ‘One Logical Copy’ then you will be able to use Direct lake mode.
Direct Lake Mode is a new semantic model capability for parquet formatted files to load directly from the data lake without having to query a warehouse endpoint OR import/duplicate data within the Power BI model. Learn about Direct Lake in Power BI and Microsoft Fabric - Power BI | Microsoft Learn
Otherwise, you will have to use import mode or direct query mode.
The KQL Queryset is the item used to run queries, view, and customize query results on data from a KQL database.
Uses KQL for creating queries, views, functions, control commands, customized results and many SQL functions.
There are tabs in a KQL Queryset that allow you to connect/associate with different KQL Databases and switch them out to run the same query in different states.
You can collaborate with others and save queries.
You can also explore data using Notebooks via Spark/KQL.
This allows you the ability to ingest, analyze, use in Machine Learning, or any of your processes via Notebooks directly from the KQL Database.
When thinking about creating new solutions, or migrating existing ones, within Fabric, understanding the capabilities of the Fabric components and your solution requirements are critical. By examining different use cases for the Data Warehouse, Lakehouse, and Real-Time Analytics/KQL Database, examples of different architecture patterns, and deep diving into the capabilities of each the Data Warehouse, Lakehouse, and Real-Time Analytics/KQL Database, there should be a clearer architecture design path when combined with the knowledge of your user/workload requirements.
What is OneLake? - Microsoft Fabric | Microsoft Learn
What is a lakehouse? - Microsoft Fabric | Microsoft Learn
What is data warehousing in Microsoft Fabric? - Microsoft Fabric | Microsoft Learn
Fabric decision guide - choose a data store - Microsoft Fabric | Microsoft Learn
Tutorial: Ingest data into a lakehouse - Microsoft Fabric | Microsoft Learn
Fabric decision guide - copy activity, dataflow, or Spark - Microsoft Fabric | Microsoft Learn
OneLake shortcuts - Microsoft Fabric | Microsoft Learn
Implement medallion lakehouse architecture in Microsoft Fabric - Microsoft Fabric | Microsoft Learn
Overview of Real-Time Analytics - Microsoft Fabric | Microsoft Learn
Differences between Real-Time Analytics and Azure Data Explorer - Microsoft Fabric | Microsoft Learn
Fabric Change the Game: Real – time Analytics | Microsoft Fabric Blog | Microsoft Fabric
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.