Go Internals - Scheduler
Introduction
In Go Internals - Essentials we mapped the runtime layers and pointed at Scheduler as the first deep dive: how goroutines multiplex onto OS threads, what G (goroutine), M (machine), and P (processor) mean, and why concurrent Go programs behave the way they do under load.
This article is that dive. We stay on Go 1.23, read the real scheduler code under src/runtime/, and connect language-level concurrency (go, channels, sync) to the invariants the runtime enforces. Memory layout, stacks, and the allocator come next in Memory; here the focus is who runs what, on which thread, and when control switches.
The G/M/P (goroutine / machine / processor) model
Modern Go uses a work-stealing scheduler with three central abstractions (see runtime2.go and comments in proc.go):
G— goroutine: user stack, registers, status, links toMwhile running.M— machine: an OS thread that can execute Go (or runtime) code when bound to aP.P— processor: execution context required to run Go code; holds a local run queue and scheduler state.
flowchart TB
subgraph logical["Logical parallelism (GOMAXPROCS)"]
P0["P₀ local runq"]
P1["P₁ local runq"]
P2["P₂ local runq"]
end
subgraph threads["OS threads"]
M0["M₀"]
M1["M₁"]
M2["M₂"]
M3["M₃ blocking syscall"]
end
subgraph goroutines["Goroutines"]
G1["G"]
G2["G"]
G3["G"]
G4["G waiting"]
end
P0 --> M0
P1 --> M1
P2 --> M2
P0 --> G1
P0 --> G2
P1 --> G3
G4 -.->|"parked on channel/netpoll"| M3
Rule that surprises newcomers: Go user code runs only when an M (machine) holds a P (processor). If M (machine) blocks in a syscall without a P (processor), it may keep its blocked G (goroutine) parked while another M (machine) is created or woken to continue runnable work — up to limits the runtime manages.
GOMAXPROCS (default: number of logical CPUs) is the count of Ps (processors). That is the ceiling on goroutines executing Go bytecode at the same time. More goroutines than GOMAXPROCS is fine; they time-slice.
import "runtime"
func main() {
fmt.Println(runtime.GOMAXPROCS(0)) // current setting
}
Setting GOMAXPROCS does not cap goroutine count; it caps parallel execution of Go code on distinct Ps (processors).
Goroutines are not OS threads
A goroutine is a lightweight, runtime-managed execution context. You start one with go f(). An OS thread (M (machine) in runtime jargon) is what the kernel schedules; creating one is comparatively expensive (megabytes of stack reserve, kernel bookkeeping).
Go's bet is many goroutines, few threads:
Goroutine (G (goroutine)) |
OS thread (M (machine)) |
|
|---|---|---|
| Created by | go statement / runtime |
runtime via clone / platform thread APIs |
| Typical cost | Small stack (grows on demand), metadata in runtime structures | Kernel scheduling, larger fixed stack mapping |
| Count in production | Thousands to millions is normal | Roughly GOMAXPROCS plus a few for blocking I/O and sysmon |
| Scheduling | Go scheduler (runtime) |
OS kernel |
func main() {
for i := 0; i < 100_000; i++ {
go func() {}()
}
time.Sleep(time.Second)
}
Spawning 100,000 goroutines is boring in Go; spawning 100,000 threads is not. The runtime amortizes thread usage by parking goroutines when they block and reusing the same M (machine) for other runnable Gs (goroutines).
That does not mean goroutines are free. Each G (goroutine) has a stack, lives in scheduler queues, and may hold locks. The scheduler's job is to keep Ms (machines) busy without letting runnable work pile up unbounded while others starve.
Goroutine states (simplified)
The full state machine in runtime2.go has more substates; for reading the scheduler, this mental model is enough:
| State (concept) | Meaning |
|---|---|
| Runnable | Ready to run; sitting on a run queue or about to be stolen |
| Running | Executing on an M (machine) with a P (processor) |
| Waiting | Blocked: channel, lock, sleep, network, syscall handoff, etc. |
| Dead | Finished; resources reclaimed cooperatively with GC |
Transitions happen in functions like gopark, goready, and the core schedule loop in proc.go. When you ch <- v and block because the buffer is full, your G (goroutine) leaves _Grunning and another G (goroutine) on that P (processor) runs.
Bootstrapping: from runtime.main to your main
Essentials mentioned that main.main is not the real entry. On startup, the linker runs runtime initialization; runtime.main (in proc.go) sets up the scheduler, memory, GC, and finally starts your package main in a new goroutine.
Rough sequence:
sequenceDiagram
participant OS
participant RT as runtime
participant G0 as main goroutine
participant User as main.main
OS->>RT: process start
RT->>RT: schedinit, mallocinit, ...
RT->>G0: go main.main in goroutine
RT->>RT: schedule loop
G0->>User: your code
When you run:
go worker()
the compiler lowers that to a runtime call that allocates or reuses a G (goroutine), copies arguments, and enqueues it — it does not create a new M (machine) per go. The existing Ps (processors) and Ms (machines) pick up work from run queues.
Worth one search in $GOROOT/src/runtime/proc.go: func main() inside runtime.main's package, and newproc / newproc1 for how go enters the scheduler.
Run queues and work stealing
Runnable Gs (goroutines) wait on run queues:
- Per-
P(processor) local queue — lock-free ring, hot path for goroutines spawned on thatP(processor), often LIFO for locality. - Global queue — protected by a lock; overflow and fairness paths use it.
When an M (machine) finishes a time slice or a G (goroutine) blocks, schedule() picks the next G (goroutine). Prefer order is roughly: local queue → global queue → work stealing from another P (processor)'s local queue → netpoller / idle hooks.
Work stealing keeps Ps (processors) busy: if Pᵢ has an empty local queue, it tries to take roughly half the runnable Gs (goroutines) from Pⱼ's queue (implementation details vary by Go version; the invariant is balance load without a single global lock on every schedule).
flowchart LR
subgraph Pbusy["Pⱼ busy"]
Qj["local runq: G G G G G"]
end
subgraph Pidle["Pᵢ idle"]
Qi["local runq: empty"]
end
Qi -->|"steal ~half"| Qj
Why stealing helps: goroutines created on one P (processor) tend to stay there briefly (producer/consumer on same CPU cache). When load is uneven, stealing spreads work without every schedule() hammering one global structure.
Fairness note: Pure LIFO local queues are great for cache locality but can starve under certain patterns. The runtime periodically pulls from the global queue and uses preemption (below) so long-running Gs (goroutines) do not monopolize a P (processor).
The scheduling loop in one picture
At a high level, each M (machine) bound to a P (processor) spins in a loop:
flowchart TD
A["G running on P"] --> B{"block or preempted?"}
B -->|yes| C["park G, enqueue wait reason"]
B -->|no| D["still running"]
C --> E["schedule()"]
E --> F{"local runq?"}
F -->|yes| G["run next G"]
F -->|no| H{"global / steal / netpoll?"}
H --> G
G --> A
You do not call schedule() from user code; the runtime inserts it at synchronization points and when the M (machine) must find new work.
Blocking without wasting threads
Not all waiting is equal.
Syscalls
When G (goroutine) enters a blocking syscall (e.g. some read on a blocking fd, file I/O without poller integration), the runtime may decouple M (machine) from P (processor): the P (processor) is released so another M (machine) can run runnable Gs (goroutines) while the syscall M (machine) waits in the kernel. When the syscall returns, the G (goroutine) must reacquire a P (processor) to run Go code again.
That is why "goroutines are cheap" does not mean "syscalls are cheap": you can still tie up an M (machine) in the kernel; the runtime just avoids tying up the P (processor).
sync.Mutex and runtime locks
Blocking on a sync.Mutex parks the G (goroutine) in the runtime's wait structures; the P (processor) runs other Gs (goroutines). This is still scheduling-aware blocking — no extra OS thread per waiting goroutine.
Network I/O and the netpoller
Network waits are different. The runtime integrates with epoll (Linux), kqueue (BSD/macOS), IOCP (Windows), etc. Sockets registered with the poller can park Gs (goroutines) without blocking an M (machine) in a read: when data arrives, G (goroutine) becomes runnable again.
flowchart TB
subgraph net["netpoller"]
EP["epoll / kqueue / ..."]
end
Gread["G: conn.Read"] -->|"non-blocking + park"| EP
EP -->|"ready fd"| Gready["G runnable → runq"]
So a server with tens of thousands of idle connections is mostly waiting Gs (goroutines), not tens of thousands of blocked kernel threads. Concurrency (next article in the series for channels) builds on these same parking primitives.
Preemption: why for {} used to be a problem
Early Go was cooperative at safe points: function calls, channel ops, loop back-edges in generated code, etc. A tight loop with no calls could hog a P (processor) forever.
Go 1.14+ added asynchronous preemption on Unix: the runtime can stop a running G (goroutine) via signals (SIGURG on Linux) and inspect stack bounds, then reschedule it. That makes tight loops and long-running numeric code far less likely to starve other goroutines — at a cost (signal handling, safepoint coordination).
Layers of preemption today (conceptual stack):
| Mechanism | When it applies |
|---|---|
| Cooperative safe points | Always: cheap checks at calls, loops |
| Async signal preemption | Running G (goroutine) on P (processor) without hitting safe points quickly enough |
sysmon |
Background thread: retake P (processor) from syscall-blocked M (machine), preempt long runners, kick netpoller, GC helpers |
sysmon runs without a full G (goroutine) in the usual sense; it is part of why the runtime can enforce global policies even when user goroutines never yield.
For Go 1.23, treat preemption as mostly automatic but not magic: cgo, certain runtime locks, and //go:nosplit corners still matter for advanced debugging — we touch those only where they clarify scheduler behavior.
Parallelism vs concurrency on the scheduler
- Concurrency — multiple goroutines making progress over time (interleaved).
- Parallelism — literally at the same instant on multiple CPUs.
With GOMAXPROCS=1, you can have huge concurrency and zero parallelism for Go code. With GOMAXPROCS=8 on an 8-core machine, up to eight Gs (goroutines) run Go bytecode in parallel if enough runnable work exists.
func burn() {
for i := 0; i < 1_000_000_000; i++ {
_ = i * i
}
}
func main() {
runtime.GOMAXPROCS(1)
go burn()
go burn()
time.Sleep(2 * time.Second) // both share one P; preemption lets them interleave
}
CPU-bound worker pools should align goroutine count and GOMAXPROCS with hardware; I/O-bound servers often run many more Gs (goroutines) than cores because most are parked in the poller or on channels.
Compiler and go statement: what actually gets scheduled
The compiler transforms:
go func(x int) { fmt.Println(x) }(42)
into a call to runtime.newproc with a function descriptor and copied arguments. The new G (goroutine) starts at the function entry with its own stack segment (initially small, grown by the stack allocator — Memory article).
Closure capture means the struct passed to newproc may include heap-allocated variables if they escape; scheduling cost is separate from allocation cost, but churning go in a tight loop still pressures the allocator and run queues.
Observing the scheduler
GODEBUG
Useful flags (see runtime docs for the full list on 1.23):
GODEBUG=schedtrace=1000 ./myapp # scheduler line every 1000 ms
GODEBUG=scheddetail=1,schedtrace=1000 ./myapp
schedtrace prints gomaxprocs, idle Ps (processors), thread count, and run queue lengths — good sanity check under load.
Execution trace
import (
"os"
"runtime/trace"
)
func main() {
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// workload
}
go tool trace trace.out
Shows goroutine lifetimes, P (processor), G (goroutine) and M (machine) events, blocking reasons — the best interactive picture of scheduling decisions.
Delve and pprof
dlv debug ./myapp
# goroutines, goroutine <id>, stack
go test -cpuprofile=cpu.prof -bench .
go tool pprof -http=:8080 cpu.prof
CPU profiles show where time went on-CPU; combine with trace when you suspect scheduling or lock contention rather than hot user code.
Patterns that interact badly with the scheduler
| Pattern | What happens |
|---|---|
Unbounded go spam |
Run queue growth, GC pressure, eventual thrashing — no magic parallelism beyond GOMAXPROCS |
GOMAXPROCS >> cores |
Extra context switching; rarely helps CPU-bound work |
GOMAXPROCS=1 on multi-core CPU-bound app |
Artificial bottleneck |
| Huge critical sections | Fewer preemption opportunities inside the section; delays other Gs (goroutines) on that P (processor) |
| Cgo or blocking syscall storms | Many Ms (machines), P (processor) handoff churn; can hurt throughput |
Assuming go order |
Scheduler does not guarantee completion order |
When to tune GOMAXPROCS: usually leave the default. Containers with CPU limits (cgroups) are a common exception — Go 1.23's runtime can respect cgroup quotas when configured; verify with runtime.GOMAXPROCS(0) inside the container.
Mental model checklist
Before opening Memory, you should be able to answer:
- What is the difference between a goroutine, a thread, and a
P(processor)? - Where do runnable goroutines wait, and what is work stealing for?
- Why does blocking network I/O scale differently from blocking file I/O in many setups?
- What does
GOMAXPROCSactually limit? - What tool would you use to prove a goroutine is runnable but not running?
If those are clear, stack growth and escape analysis will land on solid ground — stacks are per-G (goroutine), and the scheduler is what runs the code that uses them.
Conclusion
The Go scheduler multiplexes a large population of Gs (goroutines) onto a small pool of Ms (machines), using Ps (processors) as the unit of parallel Go execution. Local run queues plus work stealing spread load; the netpoller and parking logic avoid turning every wait into a thread; preemption and sysmon keep the system fair under CPU-heavy or syscall-heavy workloads.
Next in the series is Memory — how goroutine stacks grow and shrink, when values escape to the heap, and how that ties into the allocator and GC layers outlined in Essentials.