Configuring Apache Spark in Microsoft Fabric: A Practical Guide for Data Professionals

Apache Spark has become a go-to framework for big data analytics, machine learning, and real-time processing. With Microsoft Fabric, Spark is now deeply integrated into a powerful, unified platform that brings together data engineering, data science, and business intelligence workflows.

This blog post walks you through everything you need to know about Spark in Microsoft Fabric—from foundational concepts to configuring pools, environments, and jobs—so you can build and scale data workloads effectively.

What is Apache Spark?

Apache Spark is an open-source distributed computing engine designed for large-scale data processing. It allows developers and analysts to process data in parallel across multiple machines (nodes), making it dramatically faster than traditional approaches

How Does Spark Work?

Each Spark cluster consists of multiple nodes. The driver node is responsible for coordinating the task execution. The worker (executor) nodes execute tasks and store data for in-memory processing. Each worker node receives its own set of resources such as memory and CPU.

Node sizes in Microsoft Fabric

There are different node sizes in Microsoft Fabric. Here’s how they differ:

Size	vCore	Memory
Small	4	32 GB
Medium	8	64 GB
Large	16	128 GB
X-Large	32	256 GB
XX-Large	64	512 GB

Please note that as of now X-large and XX-large node sizes are only allowed for non-trial Fabric SKUs.

Number of nodes

The number of nodes you can have in your Spark pool depends on the capacity you are using. Two Apache Spark vCores equal one capacity Unit (CU). To calculate the total number of Spark vCores you get you can use the following formula:

CU * 2 * 3 (Burst multiplier)

So a F32 capacity will get you a total of 32 * 2 * 3 =192 vCores.

Core concepts of Spark in Fabric

Before we go into the question of how to configure the spark settings in Microsoft Fabric, let’s first look at two important core concepts of Spark.

Autoscaling

Autoscale dynamically adjusts the number of nodes in your Spark pool based on workload demand.

Minimum Nodes: Keeps a baseline of resources available.
Maximum Nodes: Caps the number of nodes to control cost.

Benefits

Cost efficiency: Avoid paying for idle capacity.
Flexibility: Scale up for large jobs, scale down for light workloads.

If autoscaling is disabled then the number of nodes remains fixed.

Dynamic allocation

Dynamic Allocation adjusts the number of executors during a Spark job based on current needs.

Executors scale up when there are pending tasks.
Executors scale down when they’re idle.

This complements autoscaling at the infrastructure level and helps improve performance at runtime.

How to Configure Spark Settings in Microsoft Fabric

Microsoft Fabric abstracts much of the Spark infrastructure while still giving you fine-grained control via five core components: Pools, Environments, Jobs, High concurrency and Automatic log. We will not have a closer look at the automatic log option as it is pretty self explanatory. To enter the spark settings navigate to the Workspace settings, click on the Section “Data Engineering/Science” and then the option “Spark settings”.

Spark Pools: Your Compute Backbone

A Spark Pool is a collection of compute resources (virtual machines) dedicated to Spark processing. Pools determine your job’s execution power, scaling behavior, and concurrency.

Starter Pool vs Custom Pool

A so-called starter pool is provided automatically. These pools are pre-configured and can only be customized to a certain extent. The node size of a starter pool is automatically set to Medium and cannot be modified. The only parameters that you can modify are the autoscaling and dynamic allocation. While you can’t disable these options you can modify the number of nodes and executors. Starter pools provide Spark clusters that are always running. This means that you do not need to wait for the nodes to be setup.

Custom pools on the other hand can (as their name suggests) be highly customized. You can define the node size, and completely disable the autoscaling and dynamic allocation.

Starter Pools are perfect for exploration and development. Custom Pools are essential for optimized performance, enterprise-level concurrency, and cost control.

Environments: Templating Spark Settings

Before we look at the Environment workspace settings, let’s first explain what an Environment is.

In Microsoft Fabric, an Environment is a special item that holds a set of settings used to run Spark tasks. These settings can include things like the Spark pool you want to use, the version of the Spark runtime, any custom Spark properties, and extra libraries your code might need.

You can think of an Environment as a “package” of settings. Once you create it, this package is saved in your Fabric workspace as its own item. You can share it with other users, and reuse it in different Spark notebooks or jobs.

The main benefit of creating different environments is that it gives you more control. For example, imagine you have multiple notebooks that each need different settings. Instead of changing the settings every time, you can simply assign each notebook its own environment with the right configuration.

Jobs: Scheduled Spark Workflows

In the “Jobs” Section there are two main options that you can configure. The first one is the Spark session timeout. Here you can define the duration after which an inactive Spark session is terminated.

The second option is related to the job admission. By default Microsoft Fabric uses a method called optimistic job admission to manage how Spark jobs are started and scaled. This method helps decide whether there are enough resources (like CPU cores) available to start and run a job.

This applies to:

Jobs run from notebooks
Jobs in Lakehouses
Scheduled Spark jobs

Here’s how it works:

Each Spark pool has a minimum number of nodes (computing units) it needs to run a job. Starter pools use 1 node by default. Custom pools let you choose the minimum.
When you submit a job, Fabric checks if there are enough free cores in your workspace’s capacity to meet the job’s minimum requirement.
- If yes, the job starts right away.
- If no, the job may either be blocked (for notebooks) or queued (for batch jobs).
Once running, the job starts with the minimum number of nodes and can scale up to use more—up to the maximum node setting—depending on how busy it gets.
Fabric will only let the job scale up if there are still free cores available within your assigned capacity.
If all cores are in use (meaning the system is at its max capacity), the job can’t scale up any further until another job finishes or is canceled.

Example:

Let’s say you’re using the Fabric F32 capacity and all jobs use the default starter pool:

Without optimistic job admission: only 3 jobs could run at once, because each job uses the maximum number of cores allowed by its pool.
With optimistic job admission: up to 24 jobs could start at the same time—each using only their minimum required cores to begin with, then trying to scale up if possible.

This smarter approach helps you run more jobs at once by starting small and growing only if resources allow.

High Concurrency Mode: Collaborative Spark Processing

High concurrency mode enables multiple users or jobs to execute on the same spark session simultaneously with optimal resource utilization and minimal interference.

Benefits:

Improved resource isolation for each session
Efficient multi-user support in shared environments
Ideal for notebook collaboration, streaming, or batch jobs

✅ Pro Tip: Always enable high concurrency for custom pools in collaborative or production scenarios. It prevents one user’s job from monopolizing the entire session.

Conclusion

Microsoft Fabric brings the power of Apache Spark into a fully integrated, cloud-native experience. Whether you’re a seasoned data engineer or a newcomer exploring notebooks for the first time, understanding how to configure and optimize Spark settings is crucial.

With custom pools, autoscaling, dynamic allocation, and high concurrency, you can tailor your Spark environment to fit both experimental work and enterprise-scale pipelines—all while managing performance and cost effectively.

Start small. Tune carefully. Scale confidently.