ARCA.VISION
// FEATURE · THE EFFICIENCY AUDITOR · PHASE 5

Stop paying for stalls.

The same eBPF engine that powers the Sentry, retargeted at the memory wall. We hook sys_exit_ioctl, measure the nanosecond gap between every MEMCPY, classify each workload as compute-bound or memory-bound on a roofline, and tell your CFO how much VRAM you can reclaim. Kernel-grade FinOps.

SECTION 01ROOFLINE · COMPUTE-BOUND vs MEMORY-BOUND

The Memory Wall.

Most GPU dollars are wasted not on the wrong model, but on the wrong regime. Below the knee of the roofline, your H100s sit waiting on PCIe.

Roofline classification puts every workload on a single chart: operational intensity (FLOP per byte) on the x-axis, achievable GFLOPs on the y-axis. Two ceilings apply: bandwidth (sloped) and compute (flat). They meet at the knee. Workloads under the bandwidth roof are memory-bound: more silicon won’t help, but quantizing or evicting will. Workloads against the compute roof are compute-bound: that’s real work being done.

Classification
per-PID, per-GPU, per-shard
Update cadence
1 s rolling window
Memory roof
PCIe + HBM bandwidth measured live
Compute roof
vendor-published peak per dtype
GFLOPsFLOP / byteMEMORY WALLcompute roofKV-prefillMEMCPYattnGEMM

// Example · KV-prefill is the canonical memory-bound regime; GEMM saturates the compute roof

sys_exit_ioctl · MEMCPY · t (ns)STALLED▎ MEMCPY tick▮ stall = waste

// each tick is a MEMCPY ioctl · red shading is GPU-stalled time on PCIe · the auditor totals it per-PID, per-shard

SECTION 02NANOSECOND TIMING · sys_exit_ioctl

Compute / stall ratio.

The single most truthful number on a GPU bill. Time spent doing math, divided by time spent waiting on memory.

We hook sys_exit_ioctl at the driver boundary and record a kernel timestamp at every MEMCPY. The nanosecond gap between successive copies is GPU-blocked time, which is dollars walking out the door. We aggregate per-PID, per-GPU, per-shard and report it as a single ratio. Above 0.9: you’re compute-bound and the cluster is sized correctly. Below 0.7: there’s capital trapped in the memory wall.

$arca_gpu_compute_efficiency_ratio · 1m window · per-PID
// SECTION 03 · QUANTIZATION ADVISOR

Console alerts.
Specific recommendations.

The auditor doesn't just say 'memory-bound.' It tells you what to quantize, to what dtype, and what you'll get back, based on observed VRAM pressure, model architecture, and the KV-cache profile we measured live.

VRAM PRESSURE · 86%CRITICAL
VRAM86%
// RECOMMENDATIONFP16 → INT4 (GGUF Q4_K_M)
+38% headroom · −2.1% perplexity
VRAM PRESSURE · 71%ADVISORY
VRAM71%
// RECOMMENDATIONFP16 → INT8 (AWQ)
+22% headroom · −0.6% perplexity
VRAM PRESSURE · 92%CRITICAL
VRAM92%
// RECOMMENDATIONFP16 → INT4 + KV-cache eviction policy
+44% headroom · zero accuracy delta
// SECTION 04 · THE WASTE SCORE

Potential monthly savings.
$39,250.

Illustrative scenario · sample 64×H100 vLLM cluster observed for seven days. 11 PIDs classified memory-bound · stall ratio 0.69 · KV-cache misuse on 3 shards · INT4 advisory issued. The full report names the workloads, the dtypes, the eviction policies, and the reclaim plan. Reclaim varies by fleet. Your engagement returns numbers measured against your traffic.

MEMORY-BOUND PIDs11
STALL RATIO · p500.69
RECLAIMABLE VRAM31%
ANNUAL · ILLUSTRATIVE$471k
// SECTION 05 · ROI CALCULATOR

Try the math on your fleet.

Reclaim varies by fleet. Engagements typically surface double-digit VRAM headroom against the customer's traffic. Plug in your fleet to model the upside; we report measured numbers in your engagement.

$
// SAVINGS CALCULATOR

Annual reclaimed capital.

@ 20% illustrative reclaim · model your fleet, then audit

// ESTIMATED ANNUAL RECLAIMED CAPITAL$470,938/ yr64 × Nvidia H100 (80 GB) · $4 / hr · 20% reclaimable

// illustrative · the 20% input is a modeling rate, not a guarantee · actual savings depend on roofline classification, quantization headroom, and existing scheduler hygiene · requires Arca Efficiency Auditor engagement

// SECTION 06 · TECHNICAL PROOF

Four things you can verify.

The auditor ships with a technical appendix that names every probe, every measurement window, and every assumption. Read the spec.

PROOF 01

Kernel-level timing · zero observer effect

We hook sys_exit_ioctl to measure driver-blocked time at the kernel boundary. The agent runs in user space and never sees the probe. Each eBPF probe is sub-microsecond and the ring buffer absorbs ≥1M events/s without back-pressuring, so the auditor itself doesn't tax the GPU you're trying to measure. Customer-specific overhead is reported in the integration brief.

PROOF 02

Sharding-aware · auto-detects TP

The auditor walks the per-PID NCCL collective fingerprint and infers Tensor Parallelism degree across multi-GPU nodes. Roofline classification is computed per shard; TP=8 looks nothing like TP=1, and the report reflects that.

PROOF 03

vLLM-aware · KV-cache vs weight-load

A logic gate distinguishes between weight loading (one-time MEMCPY at model load) and KV-cache pre-allocation (recurring per-request). Without this, the roofline lies. With it, you see exactly which workload class is memory-bound and what to quantize.

PROOF 04

Fleet-wide VRAM topology

Each Sentry exports arca_vram_topology_bytes: VRAM broken down by precision (FP32 / FP16 / INT8 / INT4) and category (model weights, KV-cache, activations, workspace). When Arca Nexus is in front of the fleet, you see the same breakdown rolled across every host, so the CFO's reclaimable-VRAM number sources from the same byte-level signal the host computes.

// SCOPE THE AUDIT

Reclaim your VRAM.
Shrink your cluster,
not your performance.