Stop paying for stalls.
The same eBPF engine that powers the Sentry, retargeted at the memory wall. We hook sys_exit_ioctl, measure the nanosecond gap between every MEMCPY, classify each workload as compute-bound or memory-bound on a roofline, and tell your CFO how much VRAM you can reclaim. Kernel-grade FinOps.
The Memory Wall.
Most GPU dollars are wasted not on the wrong model, but on the wrong regime. Below the knee of the roofline, your H100s sit waiting on PCIe.
Roofline classification puts every workload on a single chart: operational intensity (FLOP per byte) on the x-axis, achievable GFLOPs on the y-axis. Two ceilings apply: bandwidth (sloped) and compute (flat). They meet at the knee. Workloads under the bandwidth roof are memory-bound: more silicon won’t help, but quantizing or evicting will. Workloads against the compute roof are compute-bound: that’s real work being done.
// Example · KV-prefill is the canonical memory-bound regime; GEMM saturates the compute roof
// each tick is a MEMCPY ioctl · red shading is GPU-stalled time on PCIe · the auditor totals it per-PID, per-shard
Compute / stall ratio.
The single most truthful number on a GPU bill. Time spent doing math, divided by time spent waiting on memory.
We hook sys_exit_ioctl at the driver boundary and record a kernel timestamp at every MEMCPY. The nanosecond gap between successive copies is GPU-blocked time, which is dollars walking out the door. We aggregate per-PID, per-GPU, per-shard and report it as a single ratio. Above 0.9: you’re compute-bound and the cluster is sized correctly. Below 0.7: there’s capital trapped in the memory wall.
Console alerts.
Specific recommendations.
The auditor doesn't just say 'memory-bound.' It tells you what to quantize, to what dtype, and what you'll get back, based on observed VRAM pressure, model architecture, and the KV-cache profile we measured live.
Potential monthly savings.
$39,250.
Illustrative scenario · sample 64×H100 vLLM cluster observed for seven days. 11 PIDs classified memory-bound · stall ratio 0.69 · KV-cache misuse on 3 shards · INT4 advisory issued. The full report names the workloads, the dtypes, the eviction policies, and the reclaim plan. Reclaim varies by fleet. Your engagement returns numbers measured against your traffic.
Try the math on your fleet.
Reclaim varies by fleet. Engagements typically surface double-digit VRAM headroom against the customer's traffic. Plug in your fleet to model the upside; we report measured numbers in your engagement.
Annual reclaimed capital.
@ 20% illustrative reclaim · model your fleet, then audit
// illustrative · the 20% input is a modeling rate, not a guarantee · actual savings depend on roofline classification, quantization headroom, and existing scheduler hygiene · requires Arca Efficiency Auditor engagement
Four things you can verify.
The auditor ships with a technical appendix that names every probe, every measurement window, and every assumption. Read the spec.
Kernel-level timing · zero observer effect
We hook sys_exit_ioctl to measure driver-blocked time at the kernel boundary. The agent runs in user space and never sees the probe. Each eBPF probe is sub-microsecond and the ring buffer absorbs ≥1M events/s without back-pressuring, so the auditor itself doesn't tax the GPU you're trying to measure. Customer-specific overhead is reported in the integration brief.
Sharding-aware · auto-detects TP
The auditor walks the per-PID NCCL collective fingerprint and infers Tensor Parallelism degree across multi-GPU nodes. Roofline classification is computed per shard; TP=8 looks nothing like TP=1, and the report reflects that.
vLLM-aware · KV-cache vs weight-load
A logic gate distinguishes between weight loading (one-time MEMCPY at model load) and KV-cache pre-allocation (recurring per-request). Without this, the roofline lies. With it, you see exactly which workload class is memory-bound and what to quantize.
Fleet-wide VRAM topology
Each Sentry exports arca_vram_topology_bytes: VRAM broken down by precision (FP32 / FP16 / INT8 / INT4) and category (model weights, KV-cache, activations, workspace). When Arca Nexus is in front of the fleet, you see the same breakdown rolled across every host, so the CFO's reclaimable-VRAM number sources from the same byte-level signal the host computes.