The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Introduction: Why the Spark Connector Matters
Imagine you’re a data professional, comfortable with Apache Spark, but need to tap into data stored in Microsoft Fabric’s Data Warehouse or SQL Analytics Endpoint Lakehouse, allowing you to query and analyze data with minimal configuration. The Spark connector for Microsoft Fabric is your solution, simplifying integration and enhancing productivity. This tool is poised to revolutionize how Spark developers and data scientists interact with Microsoft’s data platform.
What is the Spark Connector for Microsoft Fabric Data Warehouse?
The Spark Connector for Microsoft Fabric Data Warehouse is a native integration that enables Spark sessions in Fabric to directly work with data stored in Fabric’s relational warehouses and Lakehouse SQL endpoints.
Importantly, the Spark Connector works not only within a single Fabric workspace but also across multiple workspaces, making it flexible for enterprise scenarios. For example, you can run Spark code in one workspace and easily query a data warehouse that resides in another workspace, as long as you have the proper access.
Another key aspect of this connector is that it enforces Microsoft Fabric’s security models. Any table or view you access through Spark will respect the security rules defined in the SQL engine, including object-level security (OLS), row-level security (RLS), and column-level security (CLS)
Key Features and Capabilities: What Sets It Apart
The Spark Connector comes with several powerful capabilities that make it easier to integrate Spark with the Microsoft Fabric Data Warehouse or SQL Analytics Endpoint lakehouse. Below are its key features:
These features collectively make the Spark Connector a convenient bridge between Spark's ecosystem and Fabric's powerful SQL-based data warehouse. Next, let's look at how authentication and permissions work behind the scenes to enable this integration.
Authentication: Seamless Sign-In
The Spark Connector uses Microsoft Entra ID (formerly Azure Active Directory) for authenticating to the Fabric Data Warehouse. When you are logged into a Fabric workspace and running a Spark notebook or job, your user credentials are automatically passed through to the SQL engine for any connector operations, requiring no additional configuration.
Permissions: Ensure Access
You need at least Read permission on the warehouse or SQL analytics endpoint, similar to CONNECT permission in SQL Server, plus object-level permissions for specific tables or views.
Reading Data with Spark from a Fabric Data Warehouse
One of the primary uses of the Spark Connector is to read data from a Fabric Data Warehouse or Lakehouse into a Spark DataFrame for analysis. Begin your Spark code with these imports:
import com.microsoft.spark.fabric
from com.microsoft.spark.fabric import Constants
Reading Data Within the Same Workspace
If the Spark notebook's workspace is the same as the target data's workspace, you only need to reference the three-part name of the table or view you want to read. The format is:
WarehouseOrLakehouseName.SchemaName.TableOrViewName
For example, if you have a warehouse called SalesDW, and inside it a schema dbo with a table Customers, you can load that table into Spark as follows:
df = spark.read.synapsesql("SalesDW.dbo.Customers")
This single line will connect to the SalesDW warehouse or lakehouse's SQL endpoint, fetch the data from dbo.Customers, and return a Spark DataFrame df containing that data.
You can treat df like any DataFrame, for instance, you can apply Spark operations or simply display the data.
df_limited = spark.read.synapsesql("SalesDW.dbo.Customers").limit(10)
df_filtered = spark.read.synapsesql("SalesDW.dbo.Customers").filter("Region = 'EMEA'")
df_selected = spark.read.synapsesql("SalesDW.dbo.Customers").select("CustomerID", "Name", "Region")
These will push down the limit, filter, or column selection to the SQL query in most cases, making the retrieval efficient.
In the above example reads all rows from a table, showcasing simplicity.
Reading Data Across Different Workspaces
A standout feature of the Spark Connector is the ability to read data from a warehouse or Lakehouse that resides in another Fabric workspace. This is useful, for example, your Spark notebook is in a development workspace but you need to query a production data warehouse in a separate workspace. To do this, you will provide the workspace ID and the name of the warehouse or Lakehouse, as options before calling synapsesql().
In practice, you use Spark's read.option() to specify these IDs using constants from the connector's Constants class. For example:
import com.microsoft.spark.fabric
from com.microsoft.spark.fabric.Constants import Constants
ws_id = "WSID"
wh_id = "GUIDWHID"
# The lines with comments are optional.
#Uncomment and modify them if you need to specify a particular database or schema.
#db_name = "SalesDW"
#sch_name = "dbo"
df_remote = (spark.read
.option(Constants.WorkspaceId, ws_id)
.option(Constants.DatawarehouseId, wh_id)
#.option(Constants.DatabaseName,db_name)
#.option(Constants.SchemaName,sch_name)
.synapsesql("SalesDW.dbo.Customers"))
In the above PySpark code, we set the WorkspaceId to the other workspace's ID and the DatawarehouseId to the specific Data Warehouse item's ID (for a Lakehouse, you would use LakehouseId instead). Then we call synapsesql("SalesDW.dbo.Customers") as before.
Note: If you don't provide workspace or warehouse IDs, below the images how to get necessary informations. Also, for cross-workspace reads, performance might be slightly impacted.
How to Get Directly Your Workspace ID
If you are admin fabric user you can use this method.
Showing the direct method to locate your Workspace ID.
How to Get WorkAround Your Workspace ID
Navigate to the relevant workspace; the area highlighted in grey indicates your workspace ID.
In the below example reads all rows from other workspaces's table, showcasing simplicity.
Using Custom SQL Queries (Passthrough)
Beyond reading entire tables or views, the Spark Connector also supports running a custom SQL query against the warehouse or Lakehouse endpoint and retrieving the result into a Spark DataFrame.
To do this, you can use the option(Constants.DatabaseName, "WarehouseName") along with synapsesql("Your T-SQL query"). For example:
query = "SELECT COUNT(*) as CustomerCount FROM dbo.Customers"
df_query = (spark.read
.option(Constants.DatabaseName, "SalesDW")
.synapsesql(query))
Here we specify the database (warehouse) name for context and pass a T-SQL query string to synapsesql() from same Fabric workspace.The connector will send that exact query to the Fabric Data Warehouse.
In the below example reads all rows with sql query from same workspace warehouse's table, showcasing simplicity.
In the below example demonstrates how to read all rows from a table in another workspace's warehouse using an SQL query, highlighting its simplicity.
Writing Data to the Fabric Data Warehouse from Spark
In addition to reading data, the Spark Connector allows Spark users to write DataFrame data into a Fabric Data Warehouse table. This means you can take the results of Spark transformations or machine learning predictions and insert them into the governed, central warehouse for reporting or further analysis.
When writing, it's important to note that the Fabric Data Warehouse (and the Lakehouse SQL endpoint) have some constraints: currently, only warehouse tables are writable via this connector. The Lakehouse's SQL endpoint is read-only, so you cannot write back to a Lakehouse through SQL
Example: Writing a Spark DataFrame to a Warehouse Table
Suppose you have a Spark DataFrame df that you want to save into a Fabric warehouse called SalesDW, under schema dbo as a new table TopCustomers. The code would look like:
df.write.synapsesql("SalesDW.dbo.TopCustomers")
By default, this will attempt to create a new table TopCustomers in SalesDW.dbo (since it doesn't exist yet) and insert all data from the DataFrame. The connector's write mechanism uses a two-phase approach behind the scenes: first, it stages the Spark data in an intermediate storage location, and then it issues a T-SQL COPY INTO command to efficiently load the staged data into the warehouse table. This approach is designed for scalability, so even large DataFrames can be ingested in a robust way.
If the table TopCustomers already exists in the warehouse, the default behavior (called ErrorIfExists mode) is to throw an error and avoid accidentally overwriting data. However, you can specify different save modes depending on your scenario:
Make sure to choose the mode that fits your use case. You can specify the mode like so:
df.write.mode("append").synapsesql("SalesDW.dbo.TopCustomers")
Note: One important consideration is that the write capability is relatively new and is only available in the latest Fabric Spark runtime (1.3 or later). If you are using an older Spark runtime in Fabric, the write.synapsesql() might not yet be supported.
Example: Writing a Spark DataFrame to a SQL Enpoint Lakehouse Table
If you need to create a new table in a Lakehouse, one approach is to write to a Warehouse and then move or copy the data into the Lakehouse if necessary. Alternatively, use the Delta Lake capabilities in Spark to write directly to the Lakehouse's files.
Suppose you want to store your DataFrame in a Lakehouse in Delta format, rather than writing directly to a Fabric warehouse table. The code you showed looks like this:
df.write.format("delta")\
.mode("overwrite")\
.save("abfss://path.Lakehouse/Tables/TopCustomers")
This will create (or overwrite) a Delta-based data folder at the specified location (abfss://path.Lakehouse/Tables/TopCustomers) inside your Lakehouse storage. Once written, all the data is stored as Delta files in that path.
In the above example demonstrates how to read all rows from a table in another workspace's sql enpoint lakehouse using an SQL query and how to write in in lakehouse, highlighting its simplicity.
Best Practices for Using the Spark Connector
To make the most of the Spark Connector for Fabric Data Warehouse, consider the following best practices and tips:
By following these practices, you'll ensure smoother use of the connector and avoid common pitfalls.
Conclusion
The Spark Connector for Microsoft Fabric Data Warehouse opens up a new realm of possibilities for data professionals by combining the power of Apache Spark with the robust SQL-based storage of Fabric's cloud data platform. It empowers data developers to use Spark's rich libraries on data where it lives. By following the guidelines and tips provided, developers alike can harness this tool to streamline their data pipelines and glean insights faster. Happy data exploring with Spark and Microsoft Fabric's Data Warehouse!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.