Benchmark Harness (v0.4)

Source checkout only. python -m examples.cli.benchmark_harness requires the repository examples/ package and docs/benchmarks/. It is not shipped in the PyPI package.

The v0.4 benchmark harness measures always-on monitoring as a benchmarked operability budget, not just as a point-in-time benchmark.

It still supports the same two gate modes:

budget: compare current metrics to absolute max thresholds
regression: compare current metrics to a checked-in baseline plus allowed deltas

On top of that, v0.4 extends the benchmark coverage to:

gpumemprof CPU fallback (gpumemprof_cpu)
tfmemprof --device /CPU:0 (tfmemprof_cpu)
accelerated soak qualification
rollover and retention validation
history truncation diagnostics
actionable failure reporting

Default operating assumptions

The harness models the always-on default runtime mode as:

track with append-only sink enabled
flush_every_seconds=2.0
rollover_max_bytes=64 MB
retention_max_files=8
retention_max_total_bytes=512 MB

Retention validation also runs a forced-churn subtest with tighter limits so rollover and pruning are exercised even in fast local runs.

Profiles

pr: accelerated 6h-equivalent soak plus default-interval overhead checks
nightly: accelerated 24h-equivalent soak plus the same overhead checks

“24h-equivalent” means the harness does not sleep between samples. Instead, it collects the same number of samples that a 24-hour run would emit at the runtime’s default interval:

gpumemprof_cpu: 864000 samples at 0.1s
tfmemprof_cpu: 86400 samples at 1.0s

Modes

overhead: run only the unprofiled vs tracked overhead comparison
soak: run only the accelerated soak and retention validation
all: run both

Run the harness

python -m examples.cli.benchmark_harness \
  --profile pr \
  --mode all \
  --output artifacts/benchmarks/latest_v0.4.json

Enforce Regression Gate

python -m examples.cli.benchmark_harness \
  --check \
  --profile pr \
  --mode all \
  --gate-mode regression \
  --iterations 5000 \
  --baseline docs/benchmarks/v0.4_baseline.json \
  --tolerances docs/benchmarks/v0.4_tolerances.json \
  --output artifacts/benchmarks/latest_v0.4_regression.json

This is the policy used by the pull-request memory gate in CI. The checked-in regression assets intentionally cover only the default pr profile.

Enforce Budgets

python -m examples.cli.benchmark_harness \
  --check \
  --profile pr \
  --mode all \
  --gate-mode budget \
  --iterations 5000 \
  --budgets docs/benchmarks/v0.4_operating_budget.json \
  --output artifacts/benchmarks/latest_v0.4_budget.json

Use budget mode when you want a short benchmark run checked against absolute operating thresholds rather than baseline deltas.

Nightly operating-budget gate

python -m examples.cli.benchmark_harness \
  --check \
  --profile nightly \
  --mode all \
  --gate-mode budget \
  --iterations 5000 \
  --budgets docs/benchmarks/v0.4_operating_budget.json \
  --output artifacts/benchmarks/latest_v0.4_nightly.json

This keeps budget enforcement in place for the longer nightly soak profile, so the short-run checks and long-run checks use the same benchmark policy model.

What it measures

runtime_overhead_pct: wall-clock overhead of the tracked default mode vs the unprofiled workload.
cpu_overhead_pct: CPU-time overhead of the tracked default mode vs the unprofiled workload.
artifact_growth_bytes: tracked-output size minus the unprofiled output size.
rss_growth_per_24h_equiv: RSS delta normalized to a 24-hour-equivalent run.
max_rss_delta_bytes: largest observed RSS increase above the soak baseline.
final_retained_files: retained append-only segment count after pruning.
final_retained_bytes: retained append-only bytes after pruning.
rollover_count, pruned_segment_count, pruned_bytes: sink churn under sustained load.
history_dropped_*: bounded-history eviction counts surfaced by the runtime.
collector_failure_event_count: degraded/recovered collector transitions seen during the run.

Output format

The v0.4 report includes:

profile, mode, gate_mode
config: comparison config plus runtime-specific sample counts
runtimes: per-runtime overhead, soak, retention-validation, and diagnostic data
metrics: flattened per-runtime metrics used for gating
budget_checks or regression_checks
failure_diagnostics: actionable failures with collector, sink, and history context
passed

Interpreting failures

Failure lines are intentionally verbose. A budget or regression failure includes:

the failing metric and threshold
the runtime name
collector health state
collector failure count
rollover and prune counts
retained file and byte totals
retained and dropped history counters

Typical examples:

overhead regression: runtime or CPU overhead jumped materially above baseline
retention failure: retained files or bytes exceeded the configured sink budget
collector failure: degraded-mode transitions occurred during the soak
history drift: dropped-event or dropped-sample counts grew beyond the expected envelope

Tuning order

When a run fails, adjust knobs in this order:

sampling interval
sink flush cadence
rollover size or rollover event count
retention file and byte limits
TensorFlow max_history if sample/event windows are too large for the deployment

Versioned assets

The v0.4 harness reads:

docs/benchmarks/v0.4_operating_budget.json
docs/benchmarks/v0.4_baseline.json
docs/benchmarks/v0.4_tolerances.json

Update these files only with an intentional benchmark refresh. Run the harness with the same profile and config as CI, inspect the new metrics, then commit the asset update separately from unrelated code.