Power BI is turning 10, and we’re marking the occasion with a special community challenge. Use your creativity to tell a story, uncover trends, or highlight something unexpected.
Get startedJoin us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered
What is the elegant, fast, and easy way with PySpark to iterate over all exisiting workspaces, list all lakehouses, and finally get the metadata of all lakehouse tables such as size with the aim to display the sizes of all tables in all lakehouses from all workspaces in a Power BI report?
Hi @Element115
I don't find an out-of-box solution to get the desired outcome you want. Usually we would tend to use Power BI REST APIs or Fabric REST APIs to get the data. As you want to use PySpark, you can use some python libraries to call these REST APIs to get the data. Here are some of my ideas:
1. List all capacities and get their capacity IDs. Filter the result according to capacity type (sku) to remain only capacities that may have Fabric items. Capacities - List Capacities - REST API (Core) | Microsoft Learn
2. Iterate all above capacities to get the list of all workspaces in those capacities. Workspaces - List Workspaces - REST API (Admin) | Microsoft Learn
3. Iterate all of above workspaces to get the list of all lakehouses in these workspaces. Items - List Lakehouses - REST API (Lakehouse) | Microsoft Learn
4. Iterate all lakehouses to get the tables in each of them. Tables - List Tables - REST API (Lakehouse) | Microsoft Learn
5. To get the size of a delta table, currently there is no such API. You might refer to the solution in this thread QUESTION::LAKEHOUSE::SQL ENDPOINT::SYSTEM VIEWS - Microsoft Fabric Community
Hope this would be helpful.
Best Regards,
Jing
If this post helps, please Accept it as Solution to help other members find it. Appreciate your Kudos!
Another way to get all your lakehouse IDs across all your workspaces is to use mssparkutils like so:
from pyspark.sql import SparkSession
from notebookutils import mssparkutils
# Initialize Spark Session
spark = SparkSession.builder.appName("MultiLakehouseAccess").getOrCreate()
# List of lakehouses in other workspaces to access
workspaces = ['ECOCOUNT', 'IVDS', 'RWS']
workspace_IDs = {
'ECOCOUNT':'47dc360e-87b5-46de-9415-c5c54e81687a',
'IVDS':'7e3d0ca3-a8f6-4483-afe9-45978ffa3c6d',
'RWS':'142e0ce2-c040-436d-8eb8-58aea822ed45'
}
for key in workspace_IDs:
display(mssparkutils.lakehouse.list(workspace_IDs[key]))
But the problem then remains, how do you get DeltaTable.detail() to work?
The problem is that
deltaTable = DeltaTable.forName(spark, tableName)
# Get the information of the table as a DataFrame
table_size_df = deltaTable.detail()
the spark param is a SparkSession object that is only aware of the current worksapce context, meaning it only sees all the artifacts inside the workspace in which the Notebook is saved. So even though you can list all your lakehouse IDs with the previous script, it doesn't help you because you can't get the SparkSession object to see the metadata of these tables with the DeltaTable.detail() method.
I have also updated the Spark runtime to use Spark 3.5 because the 3.5 API has this method:
addArtifacts(*path[, pyfile, archive, file]) Add artifact(s) to the client session.
but the runtime engine complains when I use it saying this method can only be used if you also use Spark Connect and I have no idea atm what is or how to use Spark Connect.
But this method looks like it can add lakehouses to the current SparkSession object that is running in a different workspace, and if so, then the metadata of these lakehouses should suddenly become available.
@Anonymous Would you know how to use Spark Connect from a Fabric Notebook to be able to use SparkSession.addArtifacts() and add all the paths to other workspaces and lakehouses to the SparkSession?
Regarding all points but #5, I'd prefer not to use a REST API. I find it clunky and not as elegant as using API calls directly from PySpark or DeltaTable. Besides, the Notebook lives inside Fabric as do the various lakehouses. So to have to go through REST https calls to get this information seems not optimal as opposed to staying within Fabric compute environment (Jupyter, PySpark, abfss, etc).
In any case, one solution inspired by point #5 I'll end up posting in that thread is to duplicate a Notebook in every workspace that has a lakehouse, run the Notebook in each workspace, populate a size monitoring lakehouse table, and use that as a data source to a new Power BI report to have a page to monitor the size of all tables across all workspaces as they grow over time.
I have tried multiple different angles to do this with PySpark and DeltaTable and every single time there's no way around the fact that the SparkSession object can only be aware of the current workspace and what's in it EVEN THOUGH I can do this:
dtable = DeltaTable.forPath(spark, "abfss://workspace_id@onelake.dfs.fabric.microsoft.com/lakehouse_id/Tables/DimDate")
dtable_meta = dtable.detail()
dtable_col = dtable_meta.select("sizeInBytes").collect() # list
display(dtable_col[0].sizeInBytes)
So the Notebook running this code is in, say, workspace ws_0, and the abfss URI is able to access the table data from a lakehouse in another workspace, say, ws_1. The problem with this approach is when you have so many tables, you don't want to hard code all their names. As soon as a table name changes or is deleted, you have to go in your Notebook and change it accordingly, which is a brittle solution.
What I can't fathom is why we can get access to the data across workspaces using this method, but we can't get to the table metadata with the same method. Instinctively, you would think that a wildcard approach for the URI would do the trick, like so:
abfss://workspace_id@onelake.dfs.fabric.microsoft.com/lakehouse_id/Tables/*
but this does not work. So you are now forced to duplicate Notebooks.
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Fabric update to learn about new features.
User | Count |
---|---|
53 | |
20 | |
12 | |
6 | |
3 |
User | Count |
---|---|
65 | |
52 | |
11 | |
11 | |
6 |