Spark Connector for Fabric Warehouse: Unified Anal...

uzuntasgokberk · ‎03-24-2025

Introduction: Why the Spark Connector Matters

Imagine you’re a data professional, comfortable with Apache Spark, but need to tap into data stored in Microsoft Fabric’s Data Warehouse or SQL Analytics Endpoint Lakehouse, allowing you to query and analyze data with minimal configuration. The Spark connector for Microsoft Fabric is your solution, simplifying integration and enhancing productivity. This tool is poised to revolutionize how Spark developers and data scientists interact with Microsoft’s data platform.

What is the Spark Connector for Microsoft Fabric Data Warehouse?

The Spark Connector for Microsoft Fabric Data Warehouse is a native integration that enables Spark sessions in Fabric to directly work with data stored in Fabric’s relational warehouses and Lakehouse SQL endpoints.

Importantly, the Spark Connector works not only within a single Fabric workspace but also across multiple workspaces, making it flexible for enterprise scenarios. For example, you can run Spark code in one workspace and easily query a data warehouse that resides in another workspace, as long as you have the proper access.

Another key aspect of this connector is that it enforces Microsoft Fabric’s security models. Any table or view you access through Spark will respect the security rules defined in the SQL engine, including object-level security (OLS), row-level security (RLS), and column-level security (CLS)

Key Features and Capabilities: What Sets It Apart

The Spark Connector comes with several powerful capabilities that make it easier to integrate Spark with the Microsoft Fabric Data Warehouse or SQL Analytics Endpoint lakehouse. Below are its key features:

Seamless Data Access (Within and Across Workspaces): You can work with data from a Fabric warehouse or a Lakehouse's SQL endpoint in your current workspace, or even access data across different workspaces and enabling flexible cross-workspace data integration without moving data.
Simplified Spark API: The connector exposes a simplified API in Spark essentially an extension of the Spark DataFrame reader and writer. It abstracts the underlying connection details, so you can query a table or view with a single "spark.read.synapsesql()" call.
Automatic Endpoint Discovery: When you reference a warehouse or Lakehouse by name, the connector auto-discovers the correct SQL endpoint and port based on the workspace context.
Integrated Security Enforcement: The Spark Connector uses your Fabric login (Microsoft Entra ID) transparently, so data access is subject to the same permissions and security rules as if you were querying directly in the SQL warehouse. Object permissions and data masking rules (OLS/RLS/CLS) remain in effect, ensuring secure data access through Spark.
Built-in and Pre-Installed: The connector is included in Microsoft Fabric's Spark runtime by default. There's no need to install additional packages or drivers. As soon as you open a Fabric Spark notebook or job, the connector's functionality is available.

These features collectively make the Spark Connector a convenient bridge between Spark's ecosystem and Fabric's powerful SQL-based data warehouse. Next, let's look at how authentication and permissions work behind the scenes to enable this integration.

Authentication: Seamless Sign-In

The Spark Connector uses Microsoft Entra ID (formerly Azure Active Directory) for authenticating to the Fabric Data Warehouse. When you are logged into a Fabric workspace and running a Spark notebook or job, your user credentials are automatically passed through to the SQL engine for any connector operations, requiring no additional configuration.

Permissions: Ensure Access

You need at least Read permission on the warehouse or SQL analytics endpoint, similar to CONNECT permission in SQL Server, plus object-level permissions for specific tables or views.

Reading Data with Spark from a Fabric Data Warehouse

One of the primary uses of the Spark Connector is to read data from a Fabric Data Warehouse or Lakehouse into a Spark DataFrame for analysis. Begin your Spark code with these imports:

import com.microsoft.spark.fabric
from com.microsoft.spark.fabric import Constants

Reading Data Within the Same Workspace

If the Spark notebook's workspace is the same as the target data's workspace, you only need to reference the three-part name of the table or view you want to read. The format is:

WarehouseOrLakehouseName.SchemaName.TableOrViewName

For example, if you have a warehouse called SalesDW, and inside it a schema dbo with a table Customers, you can load that table into Spark as follows:

df = spark.read.synapsesql("SalesDW.dbo.Customers")

This single line will connect to the SalesDW warehouse or lakehouse's SQL endpoint, fetch the data from dbo.Customers, and return a Spark DataFrame df containing that data.

You can treat df like any DataFrame, for instance, you can apply Spark operations or simply display the data.

df_limited = spark.read.synapsesql("SalesDW.dbo.Customers").limit(10)
df_filtered = spark.read.synapsesql("SalesDW.dbo.Customers").filter("Region = 'EMEA'")
df_selected = spark.read.synapsesql("SalesDW.dbo.Customers").select("CustomerID", "Name", "Region")

These will push down the limit, filter, or column selection to the SQL query in most cases, making the retrieval efficient.

In the above example reads all rows from a table, showcasing simplicity.

Reading Data Across Different Workspaces

A standout feature of the Spark Connector is the ability to read data from a warehouse or Lakehouse that resides in another Fabric workspace. This is useful, for example, your Spark notebook is in a development workspace but you need to query a production data warehouse in a separate workspace. To do this, you will provide the workspace ID and the name of the warehouse or Lakehouse, as options before calling synapsesql().

In practice, you use Spark's read.option() to specify these IDs using constants from the connector's Constants class. For example:

import com.microsoft.spark.fabric
from com.microsoft.spark.fabric.Constants import Constants

ws_id = "WSID"
wh_id = "GUIDWHID"
# The lines with comments are optional. 
#Uncomment and modify them if you need to specify a particular database or schema.
#db_name = "SalesDW"
#sch_name = "dbo"

df_remote = (spark.read 
    .option(Constants.WorkspaceId, ws_id) 
    .option(Constants.DatawarehouseId, wh_id)
    #.option(Constants.DatabaseName,db_name)
    #.option(Constants.SchemaName,sch_name)
    .synapsesql("SalesDW.dbo.Customers"))

In the above PySpark code, we set the WorkspaceId to the other workspace's ID and the DatawarehouseId to the specific Data Warehouse item's ID (for a Lakehouse, you would use LakehouseId instead). Then we call synapsesql("SalesDW.dbo.Customers") as before.

Note: If you don't provide workspace or warehouse IDs, below the images how to get necessary informations. Also, for cross-workspace reads, performance might be slightly impacted.

How to Get Directly Your Workspace ID

If you are admin fabric user you can use this method.

Go to the Admin Portal in your Fabric Service.
Select Workspaces.
Check details related workspace.

Showing the direct method to locate your Workspace ID.

How to Get WorkAround Your Workspace ID

Navigate to the relevant workspace; the area highlighted in grey indicates your workspace ID.

Screenshot 2025-03-18 at 11.38.07.png

In the below example reads all rows from other workspaces's table, showcasing simplicity.

Using Custom SQL Queries (Passthrough)

Beyond reading entire tables or views, the Spark Connector also supports running a custom SQL query against the warehouse or Lakehouse endpoint and retrieving the result into a Spark DataFrame.

To do this, you can use the option(Constants.DatabaseName, "WarehouseName") along with synapsesql("Your T-SQL query"). For example:

query = "SELECT COUNT(*) as CustomerCount FROM dbo.Customers"
df_query = (spark.read 
    .option(Constants.DatabaseName, "SalesDW") 
    .synapsesql(query))

Here we specify the database (warehouse) name for context and pass a T-SQL query string to synapsesql() from same Fabric workspace.The connector will send that exact query to the Fabric Data Warehouse.

In the below example reads all rows with sql query from same workspace warehouse's table, showcasing simplicity.

In the below example demonstrates how to read all rows from a table in another workspace's warehouse using an SQL query, highlighting its simplicity.

Writing Data to the Fabric Data Warehouse from Spark
In addition to reading data, the Spark Connector allows Spark users to write DataFrame data into a Fabric Data Warehouse table. This means you can take the results of Spark transformations or machine learning predictions and insert them into the governed, central warehouse for reporting or further analysis.
When writing, it's important to note that the Fabric Data Warehouse (and the Lakehouse SQL endpoint) have some constraints: currently, only warehouse tables are writable via this connector. The Lakehouse's SQL endpoint is read-only, so you cannot write back to a Lakehouse through SQL

Example: Writing a Spark DataFrame to a Warehouse Table
Suppose you have a Spark DataFrame df that you want to save into a Fabric warehouse called SalesDW, under schema dbo as a new table TopCustomers. The code would look like:

df.write.synapsesql("SalesDW.dbo.TopCustomers")

By default, this will attempt to create a new table TopCustomers in SalesDW.dbo (since it doesn't exist yet) and insert all data from the DataFrame. The connector's write mechanism uses a two-phase approach behind the scenes: first, it stages the Spark data in an intermediate storage location, and then it issues a T-SQL COPY INTO command to efficiently load the staged data into the warehouse table. This approach is designed for scalability, so even large DataFrames can be ingested in a robust way.

If the table TopCustomers already exists in the warehouse, the default behavior (called ErrorIfExists mode) is to throw an error and avoid accidentally overwriting data. However, you can specify different save modes depending on your scenario:

ErrorIfExists (Default): Abort the write operation if the target table exists, to prevent unintentional data replacement.
Ignore: If the table exists, skip the write operation without error (no data changes); if it doesn't exist, create it and load the data.
Append: If the table exists, keep it and add (append) the new data rows to it; if it doesn't exist, create the table first.
Overwrite: If the table exists, replace its entire contents with the DataFrame's data (essentially a full overwrite); if it doesn't exist, create it and load data.

Make sure to choose the mode that fits your use case. You can specify the mode like so:

df.write.mode("append").synapsesql("SalesDW.dbo.TopCustomers")

Note: One important consideration is that the write capability is relatively new and is only available in the latest Fabric Spark runtime (1.3 or later). If you are using an older Spark runtime in Fabric, the write.synapsesql() might not yet be supported.

Example: Writing a Spark DataFrame to a SQL Enpoint Lakehouse Table
If you need to create a new table in a Lakehouse, one approach is to write to a Warehouse and then move or copy the data into the Lakehouse if necessary. Alternatively, use the Delta Lake capabilities in Spark to write directly to the Lakehouse's files.

Suppose you want to store your DataFrame in a Lakehouse in Delta format, rather than writing directly to a Fabric warehouse table. The code you showed looks like this:

df.write.format("delta")\
    .mode("overwrite")\
    .save("abfss://path.Lakehouse/Tables/TopCustomers")

This will create (or overwrite) a Delta-based data folder at the specified location (abfss://path.Lakehouse/Tables/TopCustomers) inside your Lakehouse storage. Once written, all the data is stored as Delta files in that path.

In the above example demonstrates how to read all rows from a table in another workspace's sql enpoint lakehouse using an SQL query and how to write in in lakehouse, highlighting its simplicity.

Best Practices for Using the Spark Connector
To make the most of the Spark Connector for Fabric Data Warehouse, consider the following best practices and tips:

Verify Permissions Ahead of Time: Before running a query or write, make sure your user has the necessary Fabric permissions on the target warehouse and tables. If you're unsure, attempt a small read (e.g., limit 1) to test access. For cross-workspace access, confirm you are using the correct workspace and item IDs and that your account exists in the target workspace with appropriate rights.
Use Filtering and Limits to Minimize Data Transfer: When dealing with very large tables, try to push down filters or limits in your Spark read operations. As shown, using .filter() or .limit() on the DataFrame can help the connector apply those constraints at the source (SQL) side, so you only bring back the subset of data you need.
Monitor Performance and Adjust if Needed: Keep an eye on how long your synapsesql reads or writes take. The connector will use the warehouse's compute for query processing, so large or complex SQL operations may be limited by the warehouse's performance tier. If you need faster reads, you might consider scaling up the warehouse for the duration of your query or ensuring statistics on the warehouse side are up to date for optimal query plans. On the Spark side, treat the incoming data like any other data source - consider using .repartition() if needed for large output, and be mindful of collecting too large a DataFrame to the driver.
Avoid Writing to Lakehouse Endpoints: Remember that only Warehouse targets support writing via this connector. Don't attempt to write to a Lakehouse's SQL endpoint; if you need to create a new table in a Lakehouse, one approach is to write to a Warehouse and then move or copy the data into the Lakehouse if necessary. Alternatively, use the Delta Lake capabilities in Spark to write directly to the Lakehouse's files.

By following these practices, you'll ensure smoother use of the connector and avoid common pitfalls.

Conclusion
The Spark Connector for Microsoft Fabric Data Warehouse opens up a new realm of possibilities for data professionals by combining the power of Apache Spark with the robust SQL-based storage of Fabric's cloud data platform. It empowers data developers to use Spark's rich libraries on data where it lives. By following the guidelines and tips provided, developers alike can harness this tool to streamline their data pipelines and glean insights faster. Happy data exploring with Spark and Microsoft Fabric's Data Warehouse!

Spark Connector for Fabric Warehouse: Unified Analytics

A Cost-Effective Fabric Solution Driven by Azure A...

Fabric Data Agent with GitHub Copilot Agent Mode

Ownership of Fabric Items

Working with Data in Microsoft Fabric

Unlock the power of V-Order: Revolutionize Data Re...

Win a 3 Day Ticket to FabCon Vienna