Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI Data Visualization World Championships is back! It's time to submit your entry. Live now!

Provide an opinionated and tuned Spark Single Node runtime in Fabric

Although we all agree that Polars/Pandas/DuckDB is faster on single node, the Spark Engine has a strong, nearly bug free API that has many efficient implementations (e.g. one of the best Delta/Iceberg/Hudi Writers). It also offers Stream Processing across many matured data sources.

 

There's nearly 15 years worth of rich tutorials on Spark patterns. Copilot/AI is extremely proficient at producing high-quality Spark code. Data Engineers can be highly productive in Spark in a very short period of ramp up. It is also a great API to learn, since all hyperscalers offer a managed Spark runtime, so it's easy to find a Data Engineering role (job security etc.)

 

Spark Declarative Pipelines will also significantly reduce the barrier to entry to writing ETL code: Spark Declarative Pipelines Programming Guide - Spark 4.1.0-preview1 Documentation

 

But - due to the lack of a cheap/fast/single-node runtime, instead of taking advantage of these awesome API investments in Spark Engine, many Data Engineers who have smaller volume of data end up writing entire heterogenous API centric codebases, when they could just use the Spark API and it's innovations, so the code can scale out when the job needs to with simple config changes, or run in local mode when it doesn't need to.

 

Ideally, one day Fabric could offer a serverless Spark runtime, but until then, it might be fairly easy to tune up a single-node Spark runtime that is opinionated with spark configs that do not force shuffles. Perhaps NEE could back this runtime to reduce the memory footprint as well.

 

OSS Apache Spark can run pretty fast(er) on a single node if you tweak a few settings.

The problem is, you need to be a Spark Expert to read these things and tweak it.

It'd be good if Fabric could do this for laymans.

 

How to Speed Up Spark Jobs on Small Test Datasets

How to cut the run time of a Spark SBT test suite by 40% | by Matthew Powers | Medium

Setting Shuffle Partitions in order to limit the number of Tasks by qnob · Pull Request #291 · holde...

Spark Standalone Mode - Spark 4.0.1 Documentation

 

For non-distributed datasets on a single VM, Fabric could go ahead and provision a Spark runtime that is Executor heavy, say, on a 6 core VM, do this:

 

Here's an example of an opinionated spark config


```yaml

spark:
  local: true                           # Fire spark in local[*] mode
  ui: false                               # Turn off Spark UI to save resources
  offHeapEnabled: false        # Disable offheap
  driverCore: 1                       # Driver doesn't do much, reduce it more if needed
  driverMemory: "512mb"    # "
  executorCore: 5                 # Executor does the most work, I'm not sure if odd numbers are allowed 🙂
  executorMemory: "11.5g"  # "
  shufflePartitions: 1             # Don't need shuffles on small data sizes
```
Status: New
Comments
SJCuthbertson
Advocate I
I'd definitely try this out if it existed. I think the 'cheap' aspect here is not emphasised enough: I use plain-python notebooks with polars /duckdb not just because my datasets are mostly pretty small, but because I'm running our entire Fabric estate (the whole prod env) on an F4 (originally an F2). I want to be really miserly with my CU consumption. I don't actually care about the 'fast' aspect quite as much; I'd generally keep using plain python notebooks + polars even if it's slower (unless we're talking whole orders of magnitude slower), if the total CU(s) is lower than the alternative (even if it's only _slightly_ lower). Given this example suggests 6 cores (vs 2 cores for default python notebook), I naively reckon I'd need a single-node spark implementation to run in 1/3 the time for the total CU(s) to be the same? I've no idea if that's achievable or not, just making the observation that this is what it'd take to get me to adopt this new feature, if it were implemented.
mdrakiburrahman
Microsoft Employee

100% agreed to @SJCuthbertson's point. Most of these are batch pipelines that are triggered, it doesn't even matter if it runs 10 minutes slower than DuckDB as long as the CU usage is predictable and low. We can totally just jam Spark and force it to run on a tiny little VM to achieve this low CU setup, throw it against a DAG (a series of transformations) and let it rip in it's merry way and keep th Executor threads running 100% hot.

 

Then, we can throw NEE at the problem to reduce the runtime, while running 100% hot.

When SDP is available on Fabric, this would be even easier to do by throwing everything into a YAML: Spark Declarative Pipelines Programming Guide - Spark 4.1.0-preview1 Documentation