atty — benchmarking
Status (2026-06): all three phases landed. Phase 1 (the
zig build benchharness), Phase 2 (the three configurable eBPF enforcement-depth modes), and Phase 3 (the per-mode overhead measurement + decision matrix). The measured default isone_level.
Why
Two forces drive a benchmark suite:
- Defend the claims. atty advertises a zero-allocation hot path and microsecond keystroke dispatch. That should be a number we can reproduce and regression-gate, not a vibe.
- Decide perf/latency tradeoffs with data. The motivating case is kernel-side block enforcement depth (below): one-level vs a bounded ancestry walk vs propagate-on-fork trade coverage against per-execve / per-fork compute. The only honest way to pick a default — and to advise operators per use-case — is to measure all three on real kernels. We implement all three, make the depth configurable, and let the numbers pick the default.
The suite — zig build bench
Two tiers, because not everything can run in CI:
Tier A — in-process microbenchmarks (CI-safe)
A standalone ReleaseFast executable (bench/main.zig) that times
tight loops over the hot paths and prints ns/op + the per-op
allocation count (via a counting allocator, to assert the
zero-alloc claim). The first three rows are implemented today; the rest
are the planned target set:
| Benchmark | Status | Measures | Asserts |
|---|---|---|---|
dispatch_input |
✅ | one keystroke through the module chain (Dispatcher(modules).dispatchInput) |
ns/op, 0 allocs |
line_state_apply |
✅ | LineState.applyInput for printable / CSI bytes |
ns/op, 0 allocs |
ghost_text |
✅ | gatherGhostText over a seeded history ring (prefix match + trailing copy) |
ns/op, 0 allocs |
keymap_match |
planned | CSI-u + legacy binding scan | ns/op |
output_throughput |
planned | onOutput over a multi-KiB shell chunk (mouse-ring ingest, SGR strip) |
MB/s |
atom_scan (guard) |
planned | Aho-Corasick scan of a command line vs the atom corpus | ns/op |
The alloc counter sees allocations made through ctx.allocator /
ctx.scratch — the path the hot loop is supposed to use. A module that
captured a different allocator at attach (e.g. atuin’s worker
thread) and allocated through that in onInput would be invisible to
the counter; the figure is “zero alloc on the dispatched path,” which is
the property that matters for the keystroke loop. The
per-keystroke dispatch makes zero heap allocations test (in
bench/main.zig, run by zig build test) gates this so a regression
fails CI. (The Enter/commit branch — onLineCommit → history record —
is allowed to allocate and isn’t the hot path.)
Output is a stable table (and a --json mode) so CI can diff against a
committed baseline and fail a PR that regresses the hot path past a
threshold. Run:
zig build bench # human table
zig build bench -- --json # machine-readable, for CI baselines
zig build bench -- --filter dispatch
Tier B — system / kernel benchmarks (sandbox, not CI)
eBPF and full PTY round-trips need a real kernel and a PTY, so they run under the existing eBPF sandbox, not GitHub CI:
| Benchmark | Status | Measures |
|---|---|---|
57-ebpf-overhead |
✅ | per-mode ns per BPF-program invocation (trace_fork / check_execve / trace_exit) via the kernel’s bpf_stats_enabled run-time accounting |
map_pressure |
planned | threat_map growth + lookup cost under a deep/wide descendant tree |
pty_roundtrip |
planned | end-to-end keystroke→echo latency through the proxy |
Run it: python3 tests/sandbox/runner.py --no-build 57-ebpf-overhead
(it’s a measurement, not a gate, so it’s out of the make sandbox-ebpf
pass/fail target).
Why kernel stats, not userspace timing. A fork+execve+wait is
hundreds of µs of process creation; the hook cost is sub-µs, so a
userspace before/after delta drowns in container noise. With
bpf_stats_enabled the kernel records each program’s cumulative
run_time_ns / run_cnt, so Δrun_time_ns / Δrun_cnt over a fixed
fork+exec workload is the precise mean ns per program invocation — the
actual quantity we care about.
Results (x86_64, fork+execve ×4000/mode, median of 3)
ns per BPF-program invocation (kernel run_time_ns/run_cnt). A
fork+execve+exit fires three programs once each, so sum/cmd is the eBPF
cost added per command:
| mode | trace_fork |
check_execve |
trace_exit |
sum/cmd |
|---|---|---|---|---|
one_level |
77 | 321 | 334 | 732 ns |
ancestry(8) |
79 | 449 (+~120) | 324 | 852 ns |
propagate_on_fork |
130 (+~50) | 298 | 272 | 700 ns |
Measured baseline: one Python fork+execve(/bin/true)+wait with eBPF
off ≈ 1.12 ms/op on this (loaded) box → sum/cmd is ≈ 0.07 %.
Before/after the trace_execve removal. The first Phase-3 run found
a fourth program — trace_execve, an every-execve user-memory
arg-capture tracepoint — costing ~2600 ns/execve, ~78 % of sum/cmd
and identical across modes. Its per-execve events were going
unconsumed (the daemon classifies only proxy-delivered commands; the
kernel-side detection they were meant to feed was never wired), so it was
removed. That dropped sum/cmd from ~3331 ns to ~732 ns — the single
biggest per-execve eBPF win — with zero behavioral change (the non-proxy
detection gap it was supposed to address never worked anyway; pinned by
58-ebpf-detection-gap).
Security-profile (audit/session) watch scope (Phase 2). The
one_level+watch row SetWatch’s the bench parent, so every workload
execve is a watched descendant and check_execve emits a scoped
VERDICT_CLASSIFY event (the daemon’s classify runs on a worker thread,
off the BPF time). Same run:
| mode | trace_fork |
check_execve |
trace_exit |
sum/cmd | % base |
|---|---|---|---|---|---|
one_level |
142 | 354 | 319 | 815 ns | 0.07 % |
one_level+watch |
427 (+285) | 2603 (+2249) | 587 (+268) | 3616 ns | 0.32 % |
The emit costs ~2.25 µs on check_execve (ringbuf reserve + bprm
filename read + submit), plus ~285 ns on trace_fork (watch propagation)
and ~268 ns on trace_exit (the per-task GC delete) — ~4.4× the eBPF
program time, still 0.32 % of the fork+exec baseline. Note the
symmetry with the removed trace_execve (~2600 ns): the emit cost is
comparable, but it’s now scoped to a watched subtree and actually
consumed (audit logs / session kills) rather than a system-wide
firehose of unconsumed events — paid only where a profile is watching,
and it buys real non-proxy detection.
What the numbers say (lead with the absolute ns — the percentage’s denominator is a Python fork on one loaded host; a C/shell exec is faster, so treat the % as order-of-magnitude, the per-program ns as the portable figure):
- The enforcement-depth choice is not a perf differentiator. The
only depth-dependent program is
check_execve: ~321 ns (one_level) → ~449 ns (ancestry(8), +~120 ns for the 8-hop walk) → ~298 ns (propagate, an own-PID lookup ≈ a parent lookup).propagateadds +~50 ns/fork intrace_fork. Low-hundreds of nanoseconds — negligible against any real command’s process creation. So the “cost of a full tree-shake” that motivated making this configurable is, empirically, nothing to worry about. propagate’s +~50 ns is the per-forkthreat_mapgating lookup paid by every fork in propagate mode — not the mark-copy, which only runs for descendants of a flagged command (rare, and strictly costlier; out of scope for this always-paid measurement).trace_exit(~300 ns, now the largest single per-command program) runs the GCmap_deleteon every process exit in every mode (the leak-safe, mode-independent reclaim from Phase 2). Aggregate ~0.03 % CPU even at thousands of exits/s — the leak-safety is the right trade.
Numbers are kernel/host-specific; a committed baseline + CI gate needs a pinned runner first (same caveat as Tier A).
eBPF enforcement depth — the configurable feature
Today the LSM block is one level: threat_map[real_parent->tgid]
is checked on each execve (atty-guard/ebpf/atty_guard.bpf.c). We add
two deeper modes and make the active mode runtime-selectable.
The three modes
| Mode | Mechanism | Closes | Cost |
|---|---|---|---|
one_level (default) |
check immediate parent in threat_map |
direct children of a marked PID | check_execve ~328 ns/execve (measured) |
ancestry(N) |
walk real_parent up to N hops (bounded loop, runtime break at configured depth), block if any ancestor is Critical |
deeper descendants while the process chain is intact | the walk adds ~+111 ns/execve at N=8 (measured). Does not catch double-fork/daemonize (reparent to PID 1 severs the chain) |
propagate_on_fork |
sched_process_fork copies the parent’s mark to the child; sched_process_exit GCs the entry; LSM checks the process’s own (inherited) mark |
all descendants incl. double-fork/daemonize (mark is copied before any reparenting) | per-fork threat_map gating lookup adds ~+49 ns/fork (measured; the mark-copy itself runs only under an active mark); map-pressure under fork bombs |
Mechanism — one .bpf.o, runtime-switchable
A new config map drives the branch so we ship a single program, not three builds:
struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 1);
__type(key, __u32); __type(value, struct enforce_cfg);
} enforce_cfg_map SEC(".maps");
// struct enforce_cfg { __u8 mode; __u8 max_depth; }
- LSM hook reads
enforce_cfg:one_level→ single parent lookup;ancestry→ bounded walk toMAX_ANCESTRY(16) with a runtimeif (i >= depth) break;;propagate→ check own pid. (Implemented as a plain bounded loop — clang wouldn’t fully unroll the pointer-chasing walk, but eachbpf_core_readis a safe probe, so the verifier accepts the bounded form.) sched_process_fork/sched_process_exitprograms are always attached but early-return unlessmode == propagate— so they cost ~nothing in the other modes (confirmed:trace_forkearly-returns at ~80 ns vs ~129 ns in propagate — see the Phase-3 results).- Map type (decided by measurement, not up front). Phase 2 keeps
threat_mapas a plainBPF_MAP_TYPE_HASH. Under a fork bomb in propagate mode it fails open at the cap (new descendants past 16 384 entries simply aren’t marked — never a wrongful block). Switching toBPF_MAP_TYPE_LRU_HASH(evict-oldest) trades that for evicting the root mark instead; which is better is a Phase-3map_pressurequestion, so we don’t commit blind — consistent with the whole point of this suite.
Marking-model note (propagate mode)
one_level / ancestry mark the long-lived shell PID and gate its
(direct / ancestral) children — correct, because the mark is sticky only
between the flagged command and the next clean line
(security_guard.zig). propagate ideally marks the command’s PID
instead — propagating from the shell tags its entire future subtree for
the sticky window. Phase 2 ships the kernel mechanism + config and
benchmarks all three depths; it does not yet change the Zig proxy’s
shell-marking. That’s deliberate: per-mode overhead (the Phase-3
measurement that picks the default) is independent of which PID is
marked, and propagate’s proxy-side semantics (mark the command, clear on
its exit) is a focused follow-up tracked separately. Until then,
propagate is opt-in and labelled experimental.
Config surface
Daemon-side (atty-guard):
# /etc/atty-guard/config.toml
[enforcement]
depth = "one_level" # "one_level" | "ancestry" | "propagate_on_fork"
ancestry_max_depth = 8
Plumbed as a Rust config ([enforcement] depth / --enforcement-depth)
→ written to enforce_cfg on startup. Default is one_level — the
benchmark-confirmed choice (Phase 3 showed the deeper modes are
near-free, so the default is picked on coverage/precision, not speed; see
the decision framework). atty doctor surfacing of the active mode is a
tracked follow-up.
Decision framework
Filled from the Phase-3 measurements above. The headline: per-mode overhead is not a performance differentiator (every mode is <0.5% of a real fork+execve), so the choice is about coverage vs. blast-radius, not speed.
| Use-case | Suggested mode | Rationale (from measurements) |
|---|---|---|
| Latency-sensitive / default | one_level |
floor (check_execve ~328 ns/execve); blocks the flagged command itself (a direct child of the marked shell); precise — won’t sweep in the shell’s unrelated descendants during the sticky window |
| Defense-in-depth dev box | ancestry(8) |
+~111 ns/execve — negligible — to also catch the flagged command’s descendants while the chain is intact |
| High-security / CI runner | propagate_on_fork |
+~49 ns/fork; full containment incl. double-fork/daemonize. Opt-in/experimental until the proxy marks the command PID (see the marking-model note) |
Measured default: one_level. Not because the deeper modes are
costly — they aren’t — but because it’s the zero-overhead floor, it’s
backward-compatible, and it’s the precise match for the current
shell-marking model (block exactly the flagged commands, not their whole
subtree while the mark is sticky). The data’s real contribution is
removing performance as a reason to avoid ancestry/propagate: they
are near-free coverage upgrades an operator can opt into. The operator
picks; atty ships one_level.
There’s a second, non-cost reason a shallow default is sound: detection
is proxy-only. atty marks the shell when the prompt flags a typed
command, and that command is the shell’s direct child — so one_level
blocks it at the root and its subtree never spawns. The kernel does
not autonomously classify execves (the every-execve trace_execve
program was unconsumed and was removed). The deeper modes therefore close
no current gap — they’re infrastructure for a future command-PID
policy. The genuine current gap is detection of non-proxy chains
(python → node → exploit from a compromised dep, which never touches
the prompt) — no enforcement depth catches it (nothing gets marked);
it’s pinned by tests/sandbox/scenarios/58-ebpf-detection-gap. See the
enforcement-depth bullet in
operator-workflow.md (Threat model) for the
full reasoning.
Phasing
- ✅ Phase 1 — harness.
bench/+zig build bench(Tier A), human table +--json. CI-safe. A committed baseline + a CI regression gate follow once a stable perf runner is picked (numbers are machine-specific, so a naive committed baseline would be noise). - ✅ Phase 2 — eBPF modes + config.
enforce_cfgmap, the three-mode LSM branch,sched_process_fork/_exithooks, Rust[enforcement]config +--enforcement-depthCLI → the map. All three modes are behavior-validated undermake sandbox-ebpf(not CI):51(one_level blocks a direct child),55(one_level allows a grandchild, ancestry blocks it),56(ancestry(2) allows a 4-deep descendant, propagate_on_fork blocks it). Deferred to follow-ups: the Zig command-pid marking for propagate (see the marking-model note), the LRU-vs-HASH map decision (Phase-3map_pressure), andatty doctorsurfacing of the active depth. - ✅ Phase 3 — Tier-B benches + the matrix.
57-ebpf-overheadmeasures per-mode ns/invocation via the kernel’s BPF run-time stats; the decision matrix + default are filled from it (one_level, confirmed by data). Still planned:map_pressure(which drives the LRU-vs-HASH call) andpty_roundtrip.
See also
docs/operator-workflow.md— the eBPF Threat model & limitations (what one-level does/doesn’t stop).docs/security-guard-design.md— the V2-* tier architecture.atty-guard/ebpf/atty_guard.bpf.c— the LSM hook this extends.