Understanding Linux Process Scheduling

Every running program on Linux is a process, and every CPU core can run exactly one process at a time. The kernel's scheduler decides which process gets the CPU, for how long, and on multi-core systems which core runs it.

1. The Completely Fair Scheduler (CFS)

Since kernel 2.6.23, Linux has used CFS. The core idea: give each process a fair share of the CPU. CFS maintains a red-black tree of runnable processes keyed by virtual runtime (vruntime). The scheduler always picks the process with the smallest vruntime.

# CFS tuning parameters
cat /proc/sys/kernel/sched_latency_ns         # default: 24000000 (24ms)
cat /proc/sys/kernel/sched_min_granularity_ns  # default: 3000000 (3ms)
cat /proc/sys/kernel/sched_wakeup_granularity_ns # default: 4000000 (4ms)

2. Nice Values and Priority

nice values range from -20 (highest priority) to +19 (lowest priority). Default is 0.

# Set nice value
nice -n 10 ./my-app
renice -n 10 -p <pid>

# Check current nice values
ps -eo pid,nice,pri,cmd | head -20

How nice values affect CFS: Each nice increment multiplies the weight by ~1.25:

Nice Value Weight CPU Share (vs nice 0)
-20 88761 ~86x more
-10 9548 ~9.3x more
0 1024 1x (baseline)
10 110 ~0.11x
19 15 ~0.015x

3. cgroups CPU Controller

cgroups control groups of processes — what Docker and Kubernetes use:

# CPU shares (relative weight)
echo 512 > /sys/fs/cgroup/cpu/mygroup/cpu.shares

# CPU quota (hard limit: max 0.5 cores)
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us

cpu.shares vs cpu.cfs_quota_us:

Setting Behavior Use Case
cpu.shares Proportional (soft) — matters only under contention Allocating CPU across services
cpu.cfs_quota_us Absolute (hard) limit Protecting a service from runaway neighbors

In Docker:

docker run --cpu-shares=512 my-image      # relative weight
docker run --cpus=0.5 my-image            # 50% of one core
docker run --cpus=2 my-image              # 2 full cores

Real example: Three containers with shares 1024, 1024, 2048. Under contention: A gets 25%, B gets 25%, C gets 50%. If C is idle, A and B each get up to 50%.

4. Real-Time Scheduling Policies

# Check scheduling policy
chrt -p <pid>

# Set FIFO real-time priority 50
chrt -f -p 50 <pid>

# Set Round-Robin real-time priority 80
chrt -r -p 80 <pid>

# List all policies
chrt -m
Policy Flag Behavior
SCHED_OTHER Default CFS — fair sharing
SCHED_BATCH -b Like OTHER but deschedules on wakeup
SCHED_IDLE -i Very low priority — only runs when CPU is idle
SCHED_FIFO -f First-in, first-out real-time (preempts all OTHER)
SCHED_RR -r Round-robin real-time (FIFO with time slices)

5. NUMA and CPU Pinning

# Check NUMA topology
numactl --hardware

# Run on specific NUMA node
numactl --cpunodebind=0 --membind=0 ./my-app

# Pin to specific CPUs
taskset -c 0-3 ./my-app

6. Debugging Scheduling

# Context switches
pidstat -w -p <pid> 1 5

# System-wide context switches
vmstat 1 5

# Run queue length (load average)
cat /proc/loadavg

# CPU migration stats
cat /proc/<pid>/sched | grep -E 'nr_switches|nr_migrations'

Quick Reference

chrt -p <pid>                     # check scheduling policy
taskset -cp 0-3 <pid>             # set CPU affinity
cat /sys/fs/cgroup/cpu/.../cpu.*  # check cgroup limits
mpstat -P ALL 1 5                 # per-CPU utilization

Understanding the scheduler isn't academic — it directly affects how your application behaves under load. When that "mysterious latency spike" occurs, knowing whether it's CPU contention, a scheduling policy issue, or a cgroup limit is the difference between hours of guesswork and minutes of diagnosis.