Go Internals - Scheduler

01/06/202612 min read

Lucas Lemos

Introduction

In Go Internals - Essentials we mapped the runtime layers and pointed at Scheduler as the first deep dive: how goroutines multiplex onto OS threads, what G (goroutine), M (machine), and P (processor) mean, and why concurrent Go programs behave the way they do under load.

This article is that dive. We stay on Go 1.26, read the real scheduler code under src/runtime/, and connect language-level concurrency (go, channels, sync) to the invariants the runtime enforces. Memory layout, stacks, and the allocator come next in Memory; here the focus is who runs what, on which thread, and when control switches.

The G/M/P (goroutine / machine / processor) model

Modern Go uses a work-stealing scheduler with three central abstractions (see runtime2.go and comments in proc.go):

G — goroutine: user stack, registers, status, links to M while running.
M — machine: an OS thread that can execute Go (or runtime) code when bound to a P.
P — processor: execution context required to run Go code; holds a local run queue and scheduler state.

flowchart TB
  subgraph logical["Logical parallelism (GOMAXPROCS)"]
    P0["P₀ local runq"]
    P1["P₁ local runq"]
    P2["P₂ local runq"]
  end
  subgraph threads["OS threads"]
    M0["M₀"]
    M1["M₁"]
    M2["M₂"]
    M3["M₃ blocking syscall"]
  end
  subgraph goroutines["Goroutines"]
    G1["G"]
    G2["G"]
    G3["G"]
    G4["G waiting"]
  end
  P0 --> M0
  P1 --> M1
  P2 --> M2
  P0 --> G1
  P0 --> G2
  P1 --> G3
  G4 -.->|"parked on channel/netpoll"| M3

Rule that surprises newcomers: Go user code runs only when an M (machine) holds a P (processor). If M (machine) blocks in a syscall without a P (processor), it may keep its blocked G (goroutine) parked while another M (machine) is created or woken to continue runnable work — up to limits the runtime manages.

GOMAXPROCS (default: number of logical CPUs) is the count of Ps (processors). That is the ceiling on goroutines executing Go bytecode at the same time. More goroutines than GOMAXPROCS is fine; they time-slice.

import "runtime"

func main() {
  fmt.Println(runtime.GOMAXPROCS(0)) // current setting
}

Setting GOMAXPROCS does not cap goroutine count; it caps parallel execution of Go code on distinct Ps (processors).

Goroutines are not OS threads

A goroutine is a lightweight, runtime-managed execution context. You start one with go f(). An OS thread (M (machine) in runtime jargon) is what the kernel schedules; creating one is comparatively expensive (megabytes of stack reserve, kernel bookkeeping).

Go's bet is many goroutines, few threads:

	Goroutine (`G` (goroutine))	OS thread (`M` (machine))
Created by	`go` statement / runtime	`runtime` via `clone` / platform thread APIs
Typical cost	Small stack (grows on demand), metadata in runtime structures	Kernel scheduling, larger fixed stack mapping
Count in production	Thousands to millions is normal	Roughly `GOMAXPROCS` plus a few for blocking I/O and `sysmon`
Scheduling	Go scheduler (`runtime`)	OS kernel

func main() {
  for i := 0; i < 100_000; i++ {
    go func() {}()
  }
  time.Sleep(time.Second)
}

Spawning 100,000 goroutines is boring in Go; spawning 100,000 threads is not. The runtime amortizes thread usage by parking goroutines when they block and reusing the same M (machine) for other runnable Gs (goroutines).

That does not mean goroutines are free. Each G (goroutine) has a stack, lives in scheduler queues, and may hold locks. The scheduler's job is to keep Ms (machines) busy without letting runnable work pile up unbounded while others starve.

Goroutine states (simplified)

The full state machine in runtime2.go has more substates; for reading the scheduler, this mental model is enough:

State (concept)	Meaning
Runnable	Ready to run; sitting on a run queue or about to be stolen
Running	Executing on an `M` (machine) with a `P` (processor)
Waiting	Blocked: channel, lock, sleep, network, syscall handoff, etc.
Dead	Finished; resources reclaimed cooperatively with GC

Transitions happen in functions like gopark, goready, and the core schedule loop in proc.go. When you ch <- v and block because the buffer is full, your G (goroutine) leaves _Grunning and another G (goroutine) on that P (processor) runs.

Bootstrapping: from `runtime.main` to your `main`

Essentials mentioned that main.main is not the real entry. On startup, the linker runs runtime initialization; runtime.main (in proc.go) sets up the scheduler, memory, GC, and finally starts your package main in a new goroutine.

Rough sequence:

sequenceDiagram
  participant OS
  participant RT as runtime
  participant G0 as main goroutine
  participant User as main.main
  OS->>RT: process start
  RT->>RT: schedinit, mallocinit, ...
  RT->>G0: go main.main in goroutine
  RT->>RT: schedule loop
  G0->>User: your code

When you run:

go worker()

the compiler lowers that to a runtime call that allocates or reuses a G (goroutine), copies arguments, and enqueues it — it does not create a new M (machine) per go. The existing Ps (processors) and Ms (machines) pick up work from run queues.

Worth one search in $GOROOT/src/runtime/proc.go: func main() inside runtime.main's package, and newproc / newproc1 for how go enters the scheduler.

Run queues and work stealing

Runnable Gs (goroutines) wait on run queues:

Per-P (processor) local queue — lock-free ring, hot path for goroutines spawned on that P (processor), often LIFO for locality.
Global queue — protected by a lock; overflow and fairness paths use it.

When an M (machine) finishes a time slice or a G (goroutine) blocks, schedule() picks the next G (goroutine). Prefer order is roughly: local queue → global queue → work stealing from another P (processor)'s local queue → netpoller / idle hooks.

Work stealing keeps Ps (processors) busy: if Pᵢ has an empty local queue, it tries to take roughly half the runnable Gs (goroutines) from Pⱼ's queue (implementation details vary by Go version; the invariant is balance load without a single global lock on every schedule).

flowchart LR
  subgraph Pbusy["Pⱼ busy"]
    Qj["local runq: G G G G G"]
  end
  subgraph Pidle["Pᵢ idle"]
    Qi["local runq: empty"]
  end
  Qi -->|"steal ~half"| Qj

Why stealing helps: goroutines created on one P (processor) tend to stay there briefly (producer/consumer on same CPU cache). When load is uneven, stealing spreads work without every schedule() hammering one global structure.

Fairness note: Pure LIFO local queues are great for cache locality but can starve under certain patterns. The runtime periodically pulls from the global queue and uses preemption (below) so long-running Gs (goroutines) do not monopolize a P (processor).

The scheduling loop in one picture

At a high level, each M (machine) bound to a P (processor) spins in a loop:

flowchart TD
  A["G running on P"] --> B{"block or preempted?"}
  B -->|yes| C["park G, enqueue wait reason"]
  B -->|no| D["still running"]
  C --> E["schedule()"]
  E --> F{"local runq?"}
  F -->|yes| G["run next G"]
  F -->|no| H{"global / steal / netpoll?"}
  H --> G
  G --> A

You do not call schedule() from user code; the runtime inserts it at synchronization points and when the M (machine) must find new work.

Blocking without wasting threads

Not all waiting is equal.

Syscalls

When G (goroutine) enters a blocking syscall (e.g. some read on a blocking fd, file I/O without poller integration), the runtime may decouple M (machine) from P (processor): the P (processor) is released so another M (machine) can run runnable Gs (goroutines) while the syscall M (machine) waits in the kernel. When the syscall returns, the G (goroutine) must reacquire a P (processor) to run Go code again.

That is why "goroutines are cheap" does not mean "syscalls are cheap": you can still tie up an M (machine) in the kernel; the runtime just avoids tying up the P (processor).

`sync.Mutex` and runtime locks

Blocking on a sync.Mutex parks the G (goroutine) in the runtime's wait structures; the P (processor) runs other Gs (goroutines). This is still scheduling-aware blocking — no extra OS thread per waiting goroutine.

Network I/O and the netpoller

Network waits are different. The runtime integrates with epoll (Linux), kqueue (BSD/macOS), IOCP (Windows), etc. Sockets registered with the poller can park Gs (goroutines) without blocking an M (machine) in a read: when data arrives, G (goroutine) becomes runnable again.

flowchart TB
  subgraph net["netpoller"]
    EP["epoll / kqueue / ..."]
  end
  Gread["G: conn.Read"] -->|"non-blocking + park"| EP
  EP -->|"ready fd"| Gready["G runnable → runq"]

So a server with tens of thousands of idle connections is mostly waiting Gs (goroutines), not tens of thousands of blocked kernel threads. Concurrency (next article in the series for channels) builds on these same parking primitives.

Preemption: why `for {}` used to be a problem

Early Go was cooperative at safe points: function calls, channel ops, loop back-edges in generated code, etc. A tight loop with no calls could hog a P (processor) forever.

Go 1.14+ added asynchronous preemption on Unix: the runtime can stop a running G (goroutine) via signals (SIGURG on Linux) and inspect stack bounds, then reschedule it. That makes tight loops and long-running numeric code far less likely to starve other goroutines — at a cost (signal handling, safepoint coordination).

Layers of preemption today (conceptual stack):

Mechanism	When it applies
Cooperative safe points	Always: cheap checks at calls, loops
Async signal preemption	Running `G` (goroutine) on `P` (processor) without hitting safe points quickly enough
`sysmon`	Background thread: retake `P` (processor) from syscall-blocked `M` (machine), preempt long runners, kick netpoller, GC helpers

sysmon runs without a full G (goroutine) in the usual sense; it is part of why the runtime can enforce global policies even when user goroutines never yield.

For Go 1.26, treat preemption as mostly automatic but not magic: cgo, certain runtime locks, and //go:nosplit corners still matter for advanced debugging — we touch those only where they clarify scheduler behavior.

Parallelism vs concurrency on the scheduler

Concurrency — multiple goroutines making progress over time (interleaved).
Parallelism — literally at the same instant on multiple CPUs.

With GOMAXPROCS=1, you can have huge concurrency and zero parallelism for Go code. With GOMAXPROCS=8 on an 8-core machine, up to eight Gs (goroutines) run Go bytecode in parallel if enough runnable work exists.

func burn() {
  for i := 0; i < 1_000_000_000; i++ {
    _ = i * i
  }
}

func main() {
  runtime.GOMAXPROCS(1)
  go burn()
  go burn()
  time.Sleep(2 * time.Second) // both share one P; preemption lets them interleave
}

CPU-bound worker pools should align goroutine count and GOMAXPROCS with hardware; I/O-bound servers often run many more Gs (goroutines) than cores because most are parked in the poller or on channels.

Compiler and `go` statement: what actually gets scheduled

The compiler transforms:

go func(x int) { fmt.Println(x) }(42)

into a call to runtime.newproc with a function descriptor and copied arguments. The new G (goroutine) starts at the function entry with its own stack segment (initially small, grown by the stack allocator — Memory article).

Closure capture means the struct passed to newproc may include heap-allocated variables if they escape; scheduling cost is separate from allocation cost, but churning go in a tight loop still pressures the allocator and run queues.

Observing the scheduler

`GODEBUG`

Useful flags (see runtime docs for the full list on 1.26):

GODEBUG=schedtrace=1000 ./myapp   # scheduler line every 1000 ms
GODEBUG=scheddetail=1,schedtrace=1000 ./myapp

schedtrace prints gomaxprocs, idle Ps (processors), thread count, and run queue lengths — good sanity check under load.

Execution trace

import (
  "os"
  "runtime/trace"
)

func main() {
  f, _ := os.Create("trace.out")
  trace.Start(f)
  defer trace.Stop()
  // workload
}

go tool trace trace.out

Shows goroutine lifetimes, P (processor), G (goroutine) and M (machine) events, blocking reasons — the best interactive picture of scheduling decisions.

Delve and `pprof`

dlv debug ./myapp
# goroutines, goroutine <id>, stack

go test -cpuprofile=cpu.prof -bench .
go tool pprof -http=:8080 cpu.prof

CPU profiles show where time went on-CPU; combine with trace when you suspect scheduling or lock contention rather than hot user code.

Patterns that interact badly with the scheduler

Pattern	What happens
Unbounded `go` spam	Run queue growth, GC pressure, eventual thrashing — no magic parallelism beyond `GOMAXPROCS`
`GOMAXPROCS` >> cores	Extra context switching; rarely helps CPU-bound work
`GOMAXPROCS=1` on multi-core CPU-bound app	Artificial bottleneck
Huge critical sections	Fewer preemption opportunities inside the section; delays other `G`s (goroutines) on that `P` (processor)
Cgo or blocking syscall storms	Many `M`s (machines), `P` (processor) handoff churn; can hurt throughput
Assuming `go` order	Scheduler does not guarantee completion order

When to tune GOMAXPROCS: usually leave the default. Containers with CPU limits (cgroups) are a common exception — Go 1.26's runtime can respect cgroup quotas when configured; verify with runtime.GOMAXPROCS(0) inside the container.

Mental model checklist

Before opening Memory, you should be able to answer:

What is the difference between a goroutine, a thread, and a P (processor)?
Where do runnable goroutines wait, and what is work stealing for?
Why does blocking network I/O scale differently from blocking file I/O in many setups?
What does GOMAXPROCS actually limit?
What tool would you use to prove a goroutine is runnable but not running?

If those are clear, stack growth and escape analysis will land on solid ground — stacks are per-G (goroutine), and the scheduler is what runs the code that uses them.

Conclusion

The Go scheduler multiplexes a large population of Gs (goroutines) onto a small pool of Ms (machines), using Ps (processors) as the unit of parallel Go execution. Local run queues plus work stealing spread load; the netpoller and parking logic avoid turning every wait into a thread; preemption and sysmon keep the system fair under CPU-heavy or syscall-heavy workloads.

Next in the series is Memory — how goroutine stacks grow and shrink, when values escape to the heap, and how that ties into the allocator and GC layers outlined in Essentials.