Every running program on Linux is a process, and every CPU core can run exactly one process at a time. The kernel's scheduler decides which process gets the CPU, for how long, and on multi-core systems which core runs it.
1. The Completely Fair Scheduler (CFS)
Since kernel 2.6.23, Linux has used CFS. The core idea: give each process a fair share of the CPU. CFS maintains a red-black tree of runnable processes keyed by virtual runtime (vruntime). The scheduler always picks the process with the smallest vruntime.
# CFS tuning parameters
cat /proc/sys/kernel/sched_latency_ns # default: 24000000 (24ms)
cat /proc/sys/kernel/sched_min_granularity_ns # default: 3000000 (3ms)
cat /proc/sys/kernel/sched_wakeup_granularity_ns # default: 4000000 (4ms)
2. Nice Values and Priority
nice values range from -20 (highest priority) to +19 (lowest priority). Default is 0.
# Set nice value
nice -n 10 ./my-app
renice -n 10 -p <pid>
# Check current nice values
ps -eo pid,nice,pri,cmd | head -20
How nice values affect CFS: Each nice increment multiplies the weight by ~1.25:
| Nice Value | Weight | CPU Share (vs nice 0) |
|---|---|---|
| -20 | 88761 | ~86x more |
| -10 | 9548 | ~9.3x more |
| 0 | 1024 | 1x (baseline) |
| 10 | 110 | ~0.11x |
| 19 | 15 | ~0.015x |
3. cgroups CPU Controller
cgroups control groups of processes — what Docker and Kubernetes use:
# CPU shares (relative weight)
echo 512 > /sys/fs/cgroup/cpu/mygroup/cpu.shares
# CPU quota (hard limit: max 0.5 cores)
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us
cpu.shares vs cpu.cfs_quota_us:
| Setting | Behavior | Use Case |
|---|---|---|
cpu.shares |
Proportional (soft) — matters only under contention | Allocating CPU across services |
cpu.cfs_quota_us |
Absolute (hard) limit | Protecting a service from runaway neighbors |
In Docker:
docker run --cpu-shares=512 my-image # relative weight
docker run --cpus=0.5 my-image # 50% of one core
docker run --cpus=2 my-image # 2 full cores
Real example: Three containers with shares 1024, 1024, 2048. Under contention: A gets 25%, B gets 25%, C gets 50%. If C is idle, A and B each get up to 50%.
4. Real-Time Scheduling Policies
# Check scheduling policy
chrt -p <pid>
# Set FIFO real-time priority 50
chrt -f -p 50 <pid>
# Set Round-Robin real-time priority 80
chrt -r -p 80 <pid>
# List all policies
chrt -m
| Policy | Flag | Behavior |
|---|---|---|
SCHED_OTHER |
Default | CFS — fair sharing |
SCHED_BATCH |
-b |
Like OTHER but deschedules on wakeup |
SCHED_IDLE |
-i |
Very low priority — only runs when CPU is idle |
SCHED_FIFO |
-f |
First-in, first-out real-time (preempts all OTHER) |
SCHED_RR |
-r |
Round-robin real-time (FIFO with time slices) |
5. NUMA and CPU Pinning
# Check NUMA topology
numactl --hardware
# Run on specific NUMA node
numactl --cpunodebind=0 --membind=0 ./my-app
# Pin to specific CPUs
taskset -c 0-3 ./my-app
6. Debugging Scheduling
# Context switches
pidstat -w -p <pid> 1 5
# System-wide context switches
vmstat 1 5
# Run queue length (load average)
cat /proc/loadavg
# CPU migration stats
cat /proc/<pid>/sched | grep -E 'nr_switches|nr_migrations'
Quick Reference
chrt -p <pid> # check scheduling policy
taskset -cp 0-3 <pid> # set CPU affinity
cat /sys/fs/cgroup/cpu/.../cpu.* # check cgroup limits
mpstat -P ALL 1 5 # per-CPU utilization
Understanding the scheduler isn't academic — it directly affects how your application behaves under load. When that "mysterious latency spike" occurs, knowing whether it's CPU contention, a scheduling policy issue, or a cgroup limit is the difference between hours of guesswork and minutes of diagnosis.