Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!The Power BI Data Visualization World Championships is back! It's time to submit your entry. Live now!
Although we all agree that Polars/Pandas/DuckDB is faster on single node, the Spark Engine has a strong, nearly bug free API that has many efficient implementations (e.g. one of the best Delta/Iceberg/Hudi Writers). It also offers Stream Processing across many matured data sources.
There's nearly 15 years worth of rich tutorials on Spark patterns. Copilot/AI is extremely proficient at producing high-quality Spark code. Data Engineers can be highly productive in Spark in a very short period of ramp up. It is also a great API to learn, since all hyperscalers offer a managed Spark runtime, so it's easy to find a Data Engineering role (job security etc.)
Spark Declarative Pipelines will also significantly reduce the barrier to entry to writing ETL code: Spark Declarative Pipelines Programming Guide - Spark 4.1.0-preview1 Documentation
But - due to the lack of a cheap/fast/single-node runtime, instead of taking advantage of these awesome API investments in Spark Engine, many Data Engineers who have smaller volume of data end up writing entire heterogenous API centric codebases, when they could just use the Spark API and it's innovations, so the code can scale out when the job needs to with simple config changes, or run in local mode when it doesn't need to.
Ideally, one day Fabric could offer a serverless Spark runtime, but until then, it might be fairly easy to tune up a single-node Spark runtime that is opinionated with spark configs that do not force shuffles. Perhaps NEE could back this runtime to reduce the memory footprint as well.
OSS Apache Spark can run pretty fast(er) on a single node if you tweak a few settings.
The problem is, you need to be a Spark Expert to read these things and tweak it.
It'd be good if Fabric could do this for laymans.
How to Speed Up Spark Jobs on Small Test Datasets
How to cut the run time of a Spark SBT test suite by 40% | by Matthew Powers | Medium
Spark Standalone Mode - Spark 4.0.1 Documentation
For non-distributed datasets on a single VM, Fabric could go ahead and provision a Spark runtime that is Executor heavy, say, on a 6 core VM, do this:
Here's an example of an opinionated spark config
```yaml
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.