Always-on Tracking

Use this recipe when you want Stormlog to behave like an operational service: bounded history in memory, append-only sink artifacts on disk, and enough session metadata to reconstruct one run later.

Audience: operators, platform owners. Difficulty: intermediate.

Prerequisites

install the package first with Installation
use pip install "stormlog[torch]" for gpumemprof track
use pip install "stormlog[tf]" for tfmemprof track
use Command Line Guide if you need per-flag reference
a writable artifact directory for sink files
enough runtime permissions to inspect the target device backend

Success signal:

a sink manifest is created
analyze can reload the sink
collector health and retention counters are visible in the output

When this is the right recipe

you want long-running track sessions
you need rollover and retention limits on artifacts
you want the run to stay alive when a collector becomes unhealthy
you want a stable path from live telemetry to later analysis or TUI loading

PyTorch always-on baseline

gpumemprof track \
  --interval 0.5 \
  --warning-threshold 75 \
  --critical-threshold 90 \
  --telemetry-sink-dir ./live_sink \
  --telemetry-flush-seconds 2.0 \
  --telemetry-rollover-mb 64 \
  --telemetry-retention-files 8 \
  --telemetry-retention-total-mb 512

What this gives you:

append-only JSONL sink segments plus a manifest
one session identity for the run
rollover and pruning under a bounded artifact budget
collector_degraded and collector_recovered events instead of synthetic zero samples

Add workload phases when timestamps are not enough

Use structured phases when you want long-running artifacts to answer what part of the workload was active when a hidden-memory anomaly appeared.

from stormlog import MemoryTracker

tracker = MemoryTracker(
    sampling_interval=0.5,
    # telemetry_sink_config=...,  # Optional: configure the append-only sink.
)

tracker.start_tracking()

for epoch in range(num_epochs):
    with tracker.phase("train", metadata={"epoch": epoch}):
        with tracker.phase("load_batch"):
            batch = next(loader)
        with tracker.phase("forward"):
            loss = model(batch).sum()
        with tracker.phase("backward"):
            loss.backward()
        with tracker.phase("optimizer_step"):
            optimizer.step()

tracker.stop_tracking()

What changes when phases are present:

track writes companion phase_enter / phase_exit records with deterministic nested paths
gpumemprof analyze adds phase-aware summaries beside timestamps
the TUI Diagnostics tab shows the first anomaly phase path for each rank
when you omit instrumentation entirely, the same workflow stays valid and low-overhead

TensorFlow always-on baseline

tfmemprof track \
  --interval 1.0 \
  --threshold 4096 \
  --device /CPU:0 \
  --output ./tf_track.json \
  --telemetry-sink-dir ./tf_live_sink \
  --telemetry-flush-seconds 2.0 \
  --telemetry-rollover-mb 64 \
  --telemetry-retention-files 8 \
  --telemetry-retention-total-mb 512

Use /GPU:0 instead of /CPU:0 when the TensorFlow runtime has a GPU device available. Stop the command cleanly with Ctrl+C when you want it to flush the final output file and session summary.

Inspect the latest clean session

gpumemprof analyze ./live_sink --format txt --output ./live_analysis.txt
tfmemprof analyze --input ./tf_track.json --detect-leaks --optimize --report ./tf_report.txt

For PyTorch sink directories, default session selection prefers the newest clean completed session and falls back to interrupted or incomplete sessions only when needed.

What to watch during long runs

Treat these values as operational signals, not just debug trivia:

rollover_count
pruned_segment_count
pruned_bytes
final_retained_files
final_retained_bytes
history_retained_*
history_dropped_*
collector_failure_event_count
session_status

How to interpret degraded mode

If the collector becomes unhealthy during track:

the process keeps running
new sample emission pauses until recovery
status events remain visible in the artifact stream
the final report should still show the collector-health transition history

Treat either of these as actionable:

non-zero collector_failure_event_count
any final collector state other than healthy

Troubleshooting

Symptom: sink files grow too quickly

Likely cause: retention is too loose for the deployment budget. Fix: tighten retention and rollover settings before lowering sample fidelity. Verify: final_retained_*, pruned_*, and rollover_count stabilize.

Symptom: tracking stays alive but live telemetry looks partial

Likely cause: the collector entered degraded mode. Fix: inspect collector_failure_event_count and the emitted status events. Verify: collector_health_status returns to healthy and sampling resumes.

Symptom: analysis loads the wrong run from a reused sink directory

Likely cause: more than one session is present in the sink. Fix: inspect the discovered sessions and target the session you want explicitly. Verify: the selected session id matches the intended run metadata.

What to do next

If the artifact budget is too high, tighten retention before you lower the sampling interval.
If collector_failure_event_count is non-zero, move to the Incident Playbooks degraded-collector checklist.
If the run needs to be qualified for CI or release use, move to the CI and Release Qualification harness workflow.
If the next question is rank-aware diagnosis, move to the Distributed Diagnostics Recipes.

← Back to Production Cookbook