Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
DennesTorres
Impactful Individual
Impactful Individual

Spark Job Definition vs Notebooks

Hi,

Is there a reason for a spark job definition not accept a notebook from the portal and requires us to download the notebook as a .py file and upload again to the job definition, requiring a manual synchronization if we need to change the script?

Kind Regards,

 

Dennes

2 ACCEPTED SOLUTIONS
Anonymous
Not applicable

Hi @DennesTorres ,

Yes, you are correct we need to use "spark.catalog", it will list out all the lakehouses present inside the workspace, even if not linked to the notebook. 
Code:

lakehouses =  spark.catalog.listDatabases()

lakehouse_list = []

for lakehouse in lakehouses:
    lakehouse_list.append(lakehouse.name)

print(lakehouse_list)



spark_job_4.png

In order get list of tables present inside particular lakehouse, you can refer below -
Code:

 

# Get the list of lakehouses to read tables from.
lakehouses = ["gopi_lake_house", "gopi_lakehouse_2"]

# Loop through the lakehouses and read all tables from each lakehouse.
for lakehouse in lakehouses:
    tables = spark.sql(f"SHOW TABLES IN {lakehouse}")
    tables.show()

 


Note: SHOW TABLES IN - will be working even if the lakehouse is not default. In my case only gopi_lakehouse_2 is selected as default, but I am able to see tables present inside gopi_lake_house and gopi_lakehouse_2.

For Example:

Executed in Fabric Notebooks:
spark_job_notebook_3.png

Executed in Spark Job Application:
spark_job_2.png
spark_job_1.png
The above code is working fine both in notebook and spark job application.

Hope this was helpful. 

View solution in original post

Hi,

Using the information provided until this point, I was able to write a code to make the maintenance of all lakehouses in the same workspace.

The Spark Job Definition, on the other way, can be linked to multiple workspaces. One of the workspace is turned into the default workspace while the other workspaces become a configuration.

We can loop through the configurations and use mssparkutils to make the mount of the lakehouse addresses as local folders. 

Once mounted, we loop through the mounts discovering the tables of each lakehouse and executing the maintenance. 

It worked like a charm, I will write an article about it.

Thank you for all the help!

Kind Regards,

Dennes

View solution in original post

17 REPLIES 17
Anonymous
Not applicable

Hi @DennesTorres  - Thanks for using Fabric Community,

As I understand you are trying to run/schedule notebook code using Spark Job Definition in MS Fabric.
Accessing notebook directly in spark job definition from the portal - it is not supported as of now.

I would like to understand why are you looking for this feature and what are you trying to do with Spark Job Definition, when we can actually run/schedule the code with the help of notebooks in Fabric.


I have attached screenshots for your reference regarding how to schedule jobs in notebook.

2023-10-03 17_25_09-Notebook 7 - Synapse Data Engineering and 15 more pages - Work - Microsoft​ Edge.png 

2023-10-03 17_28_32-Notebook 7 - Synapse Data Engineering and 15 more pages - Work - Microsoft​ Edge.png

Please do let us know if you have further queries.

Hi,

I confess I missed this feature, scheduling a notebook directly, without using a spark job definition. I'm trying to schedule a maintenance job which would run over multiple lakehouses, the spark job definition allows the link with multiple lakehouses, I was in hope this could work.

Kind Regards,

 

Dennes

 

However, If I understand correctly, this feature will allow the notebook to be executed over one lakehouse. 

Anonymous
Not applicable

Hi @DennesTorres ,

We can use multiple lakehouses in single notebook.
I have attached screenshot inorder how can we add multiple lakehouses in notebook.

vgchennamsft_0-1696340759813.png


Hope this helps.

Hi,

Yes, but how do we make reference to them in the code of the notebook? How do we iterate among them?

I just tested this option. Even with multiple lakehouses attached to a notebook, the folder /lakehouse contains only one subfolder /default, for the default lakehouse. I don't know how to access the other ones in the pyspark code and iterate through them.

Kind Regards,

Dennes

Anonymous
Not applicable

Hi @DennesTorres ,

Try using code:

# Get the list of lakehouses to read tables from.
lakehouses = ["gopi_lake_house", "gopi_lakehouse_2"]

# Loop through the lakehouses and read all tables from each lakehouse.
for lakehouse in lakehouses:
    tables = spark.sql(f"SHOW TABLES IN {lakehouse}")
    display(tables)

 2023-10-03 19_27_36-Notebook 7 - Synapse Data Engineering and 23 more pages - Work - Microsoft​ Edge.png

Hi,

I was trying the idea of the array as well, but I also need to recover the list of tables from each lakehouse.

The "Show Tables In ..." in your example only works for the default lakehouse. If the lakehouse is not the default, it doesn't work.

Kind Regards,

 

Dennes

Hi,

Additional attempts I made:

lakehouses = ["demolake", "MaltaLake","Sales"]

for lake in lakehouses:
    spark.catalog.setCurrentDatabase(lake)
    spark.sql('show tables').show()
The setCurrentDatabase fails in the second one, because it doesn't work with a database located in a different workspace than the default.
 
lakehouses = ["demolake", "MaltaLake","Sales"]

for lake in lakehouses:
    spark.sql(f'USE {lake}')
    spark.sql('show tables').show()

Same problem: USE doesn't work in a database in a different workspace than the default.
 
Am I missing something?

Kind Regards,
 
Dennes
Anonymous
Not applicable

Hi @DennesTorres ,

Yes, you are correct we need to use "spark.catalog", it will list out all the lakehouses present inside the workspace, even if not linked to the notebook. 
Code:

lakehouses =  spark.catalog.listDatabases()

lakehouse_list = []

for lakehouse in lakehouses:
    lakehouse_list.append(lakehouse.name)

print(lakehouse_list)



spark_job_4.png

In order get list of tables present inside particular lakehouse, you can refer below -
Code:

 

# Get the list of lakehouses to read tables from.
lakehouses = ["gopi_lake_house", "gopi_lakehouse_2"]

# Loop through the lakehouses and read all tables from each lakehouse.
for lakehouse in lakehouses:
    tables = spark.sql(f"SHOW TABLES IN {lakehouse}")
    tables.show()

 


Note: SHOW TABLES IN - will be working even if the lakehouse is not default. In my case only gopi_lakehouse_2 is selected as default, but I am able to see tables present inside gopi_lake_house and gopi_lakehouse_2.

For Example:

Executed in Fabric Notebooks:
spark_job_notebook_3.png

Executed in Spark Job Application:
spark_job_2.png
spark_job_1.png
The above code is working fine both in notebook and spark job application.

Hope this was helpful. 

Hi,

Thank you, this part I got.

However, this is limited to lakehouses in the same workspace and not related to the lakehouses linked to the notebook (if I can't loop through the lakehouses linked to the notebook, why link them at all? Is this a bug?)

I'm working on a solution using spark jobs and mssparkutils to mount the different lakehouses in folders, looping through the configurations.

In summary: In a notebook schedule we are limited to a single workspace and the attachment to lakehouses doesn't work very well. 

In a spark job we can break the workspace limitation by using mssparkutils and mount (still testing).

But there are many mismatches which seems missing features or even bugs:

We can't loop through the workspaces linked to a notebook
The spark job uses different syntaxes than a notebook (for example, the session needs to be manually stablished)
We can't build a notebook and schedule as a spark job, the development process needs to be different.

By the way, when we try to schedule a notebook, there is a huge limitation on the time we can use on the schedule.

The image below illustrates this. Of couse, we can't schedule something in the past, but the day is not taken into consideration the the choice of hours is very limited, not allowing edition.

DennesTorres_0-1696853074725.png

 




Kind Regards,

Dennes

Anonymous
Not applicable

Hi @DennesTorres - Thanks for using Fabric Community,

I understand that you are having difficulty to achieve your task.

Lets try to understand difference between Default Lakehouse and other linked/attached Lakehouse in Notebook:

Default lakehouse is automatically mounted to the Spark cluster. This means that you can use relative paths to read and write data in the default lakehouse from your Spark notebooks.

Th
is can be beneficial for a few reasons:

  • It makes it easier to switch between development and production environments. For example, if you are developing a Spark notebook that reads data from your default lakehouse, you can simply switch the default lakehouse in your workspace to point to your production lakehouse without having to change any code.
  • It makes your code more readable and reusable. 

The linked/added lakehouse (if it's not default) are just for notebook user to easily browse the data.

Can Fabric list the lakehouses linked to the notebook?
Currently Fabric does not support it. It can only show all the Lakehouses in a Workspace but not specific to the notebook. Also we cannot loop through the workspaces linked to a notebook.

Before I answer your further queries lets try to understand the difference between Notebook and Spark Job Definition (SJD) -

Notebooks are interactive development environments that allow you to write code, run it, and see the results immediately. Notebooks are a good choice for data exploration, where you need to iterate on your code quickly.

Spark job definition (SJD) doesn't support authoring capabilities, you can only write code somewhere else and upload your main file and dependencies to it and schedule run.

So if you prefer to do the data exploration on the Fabric platform, then notebook is recommended, or if you like to use some external IDE to develop your application and run this on Fabric, then SJD is the way to go. 

Why SJD uses different syntaxes than a notebook?
For notebook, there are certain variables created/initialized by the system, such as spark context, spark session.. when you write the code in the code cell, these variables are ready to use. But for SJD, this is the standard spark application, so it is the job of the user's code to create all the needed spark object such as spark context and session.

Why can't we build a notebook and schedule as an SJD directly?
As discussed earlier the purpose of SJD is different from Notebook. If you like to use some external IDE to develop your application and run this on Fabric, then SJD is the way to go. 

Inorder to address the query related to Limitations over Spark Job Schedule, I would request you to start a new post regarding your query as the original ask in the beginning is different from the present query.

Thank you.

Hi,

Using the information provided until this point, I was able to write a code to make the maintenance of all lakehouses in the same workspace.

The Spark Job Definition, on the other way, can be linked to multiple workspaces. One of the workspace is turned into the default workspace while the other workspaces become a configuration.

We can loop through the configurations and use mssparkutils to make the mount of the lakehouse addresses as local folders. 

Once mounted, we loop through the mounts discovering the tables of each lakehouse and executing the maintenance. 

It worked like a charm, I will write an article about it.

Thank you for all the help!

Kind Regards,

Dennes

Anonymous
Not applicable

Hi @DennesTorres ,

Glad to know that your query got resolved.

Please continue using fabric for help regarding your issues.

Hi,

Your example doesn't mention any import and the "catalog" doesn't work directly.

I tried to use "spark.catalog", but it only list the default lakehouse and other lakehouses located in the same workspace, even if not linked to the notebook. It fails to list lakehouses linked with the notebook but which are not the default one.

Is this to be used with the notebook schedule, linking multiple lakehouses, or is this intended to be used with a spark job?

Or did I made the wrong import?

Kind Regards,

Dennes

Hi,

The array becomes fixed in this example, I'm trying to achieve something more dynamic. But this may be an option.

Kind Regards,

 

Dennes

Anonymous
Not applicable

Hi @DennesTorres ,

Can you please check this code

 

 

lakehouses = catalog.listDatabases()

lakehouse_list = []

for lakehouse in lakehouses:
    lakehouse_list.append(lakehouse.name)

print(lakehouse_list)

 

 



2023-10-03 20_00_13-Notebook 7 - Synapse Data Engineering and 23 more pages - Work - Microsoft​ Edge.png

 

Anonymous
Not applicable

Hello @DennesTorres ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet . Otherwise, will respond back with the more details and we will try to help .

Anonymous
Not applicable

Hi @DennesTorres,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread .

Helpful resources

Announcements
Sept Fabric Carousel

Fabric Monthly Update - September 2024

Check out the September 2024 Fabric update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

Sept NL Carousel

Fabric Community Update - September 2024

Find out what's new and trending in the Fabric Community.