This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
For nearly three years, Microsoft’s internal Azure Data team has been developing data engineering solutions using Microsoft Fabric. Throughout this journey, we’ve refined our Continuous Integration and Continuous Deployment (CI/CD) approach by experimenting with various branching models, workspace structures, and parameterization techniques. This article walks you through why we chose our strategy and how to implement it in a way that scales.
Key points you’ll learn:
Our data engineering workflows revolve around the Notebook/Lakehouse paradigm within Microsoft Fabric, including items such as:
Initially, we faced several common issues while deploying notebooks and pipelines:
Through workspace segmentation, branching discipline, and parameterizing environment references, we significantly improved development speed, reduced error rates, and made production deployments more predictable.
We maintain six core workspace categories, each typically existing in two (or more) deployment environments: Pre-Production (PPE) and Production (PROD). Workspace isolation enables cleaner workspaces and allows us to concentrate on the most critical tasks. Additionally, the isolation enables necessary deployment patterns like deploying a Notebook that creates a new Lakehouse table, prior to deploying a semantic model. The specific workspace categories you decide to use should align to your workflow, and to your deployment patterns.
Each category is related to a distinct workspace for a given deployment environment – for instance, the below would translate to 12 workspaces if using two environments. This is obviously a lot to manage for a single project, so we enforce strict naming conventions to streamline navigation, and assign distinct color-coded icons to each workspace.
Optimizing_for_CI_CD_in_Microsoft_Fabric
We maintain one Git repository that corresponds to the core code base, with directories for our yaml deployment pipelines, deployment scripts, and a workspace directory containing subdirectories for each workspace category. This structure supports the ability to seamlessly add additional workspaces as needed, without the additional overhead of thinking through new repositories or branching strategies.
Unlike conventional repositories where main is the default branch, we opted to use ppe as the primary branch. This ensures that in-flight work doesn’t accidentally point to production resources. It encourages a safer, more deliberate process to move from PPE to PROD with explicit parameterization of production endpoints.
The first time setting up a new workspace (or workspace category defined above):
ppe branch.ppe branch to initialize the main branch.main branch.Each engineer sets up feature workspaces attached to specific capacities and configurations. Initially, we tried using a single workspace per engineer, but switching branches for multiple in-progress items proved inefficient.
ppe branch.feature/name branch and develop the required work.ppe branch.main branch and squash-merge.Optimizing_for_CI_CD_in_Microsoft_Fabric
Even with an ideal branching strategy and workspace structure, there are still important factors to consider during development. We've highlighted a few key considerations, though this list is not exhaustive and does not cover all item types. The main goal is to emphasize the importance of developers adopting a CI/CD mindset and thinking about how their code will be promoted from one deployment environment to another.
Sample connection dictionary in Util_Connection_Library Notebook
core_prod = "abfss://eng-prod-storage@onelake.dfs.fabric.microsoft.com/Core.Lakehouse"
core_default = f"abfss://eng-{env}-storage@onelake.dfs.fabric.microsoft.com/Core.Lakehouse"
connection = {
"dataprod_default": f"{core_default}/Tables/Dataprod/",
"curate_default": f"{core_default}/Tables/Curate/",
"temp_default": f"{core_default}/Files/Temp/",
"intake_prod": f"{core_prod}/Files/Intake/",
"hr_prod": "abfss://data@contosohr.dfs.core.windows.net/",
"finance_prod": "abfss://data@contosofinance.dfs.core.windows.net/",
"marketing_prod": "abfss://data@contosomarketing.dfs.core.windows.net/"
}
In a Notebook, import the library, or %run the Notebook containing the library.
%run Util_Connection_Library
Once imported, refer to the connection dictionary directly in the reads and writes.
spark.read.format("delta").load(connection["dataprod_default"] + "DIM_Calendar")\
.createOrReplaceTempView("vwCalendar")
Data pipelines, Lakehouse Shortcuts, Dataflow Gen2, and semantic models rely on Fabric connections (found in 'Manage connections and gateways').
We utilize the fabric-cicd Python library to automate deployments, providing a code-first solution for deploying Microsoft Fabric items from a repository into a workspace. Refer to this article, Introducing fabric-cicd Deployment Tool, for a more in-depth overview of the functionality.
By adopting these principles, your team can establish a robust and repeatable approach to data engineering in Microsoft Fabric. By following a safe branching model, dedicated deployment environment workspaces, thorough parameterization, and automated deployments with fabric-cicd, you can navigate the complexities of modern data solutions with confidence. Feel free to adapt the specifics to your organization’s needs or constraints. Keep in mind that there is never one right answer for the perfect CI/CD flow, this is simply one way to do it. Good luck, and happy deploying!
Contributors: Jacob Knightley, Joe Muziki, Kiefer Sheldon, Will Crayger (Lucid)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.