Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Spark startup time with Python packages install is too slow

Creating an environment gives you a nice UI to set a list of Python packages to install by default


The user thinks : “Oh cool, this will be faster than doing %pip install every time I start a notebook”


The reality is that it now takes 3 minutes to start Spark, instead of 10 seconds to run "%pip install"


Either fix the UI to warn users, or fix the startup time.


Status: Planned
Comments
vasu_n
New Member

We had the same problem. The moment we customize the node with PyPi packages, we loose the StarterPools instant start feature and go back to old 3-4 minute start time.


For sometime, used the pip to install the packages from inside the notebook.


And one file day, all ETLs failed stating the %pip magic commands disabled inside pipeline.



Finally i managed to install the pip packages into Lakehouse files space like below


!pip install googleads -t /lakehouse/default/Files/PyPi Packages/


and then in the notebook i add below line to include the Lakehouse folder into syspath so the package can be imported like local installed package.


import sys

sys.path.append('/lakehouse/default/Files/PyPi Packages/')



This is a workaround to avoid high start time and still work with custom packages.


Hope there will be some native fix for this in the future fabric runtimes.




Jonathan_Boarma
New Member

Databricks's newish serverless compute feature seems to directly compete with Fabric's temporary differentiator, which were the fast compute session launches. Now, Fabric is again behind Databricks in terms of offering clean options for maintaining a pool of properly configured compute. At least there when you reserve compute, you get pools that are configured correctly. This is embarrassing, Microsoft!


If you are launching pipelines and need python dependencies, get ready for seriously slow compute. 😕

nishalit
New Member
Thank you for sharing this idea. This feature is planned. Stay tuned.
fbcideas_migusr
New Member
Status changed to: Planned
 
Andrew_Hill_TRS
New Member
forget 3 minutes - if you have custom network (because for some reason your production is not available to the entire internet), and you have a library you can expect 10 to 15 minutes start-up, which support says is "expected" it is currently documented at 'https://learn.microsoft.com/en-us/fabric/data-engineering/spark-compute' that you can expect 2 to 5 minutes extra for adding a library, but i find that i pay the full 5 minutes even for a little helper library that i can pip install in less than 30 seconds.
nielsvdc
Skilled Sharer

Currently in our environment startup time for a custom environment on a starter pool can take up to 18 minutes. This is since the September release after FabCon Europe. 18 minutes is a lot of coffee drinking and not really being productive. I'm getting questions about when we are moving to Synapse or Databricks, which is not what we really want from an architecture point of view.

nielsvdc_0-1758618046715.png

Update: There is a known issue since August https://support.fabric.microsoft.com/known-issues/?active=true&fixed=true&sort=published&product=Dat...

Andrew_Hill_TRS
New Member
Anything more that 15 minutes is out of spec, and that would by for a custom size, vnet linked environment with a custom library added, each documented as *max* 5 minutes, but minimum 30b seconds to 2 minutes.