← Back to Production Cookbook

Always-on Tracking

Use this recipe when you want Stormlog to behave like an operational service: bounded history in memory, append-only sink artifacts on disk, and enough session metadata to reconstruct one run later.

Audience: operators, platform owners. Difficulty: intermediate.

Prerequisites

  • install the package first with Installation

  • use pip install "stormlog[torch]" for gpumemprof track

  • use pip install "stormlog[tf]" for tfmemprof track

  • use Command Line Guide if you need per-flag reference

  • a writable artifact directory for sink files

  • enough runtime permissions to inspect the target device backend

Success signal:

  • a sink manifest is created

  • analyze can reload the sink

  • collector health and retention counters are visible in the output

When this is the right recipe

  • you want long-running track sessions

  • you need rollover and retention limits on artifacts

  • you want the run to stay alive when a collector becomes unhealthy

  • you want a stable path from live telemetry to later analysis or TUI loading

PyTorch always-on baseline

gpumemprof track \
  --interval 0.5 \
  --warning-threshold 75 \
  --critical-threshold 90 \
  --telemetry-sink-dir ./live_sink \
  --telemetry-flush-seconds 2.0 \
  --telemetry-rollover-mb 64 \
  --telemetry-retention-files 8 \
  --telemetry-retention-total-mb 512

What this gives you:

  • append-only JSONL sink segments plus a manifest

  • one session identity for the run

  • rollover and pruning under a bounded artifact budget

  • collector_degraded and collector_recovered events instead of synthetic zero samples

Add workload phases when timestamps are not enough

Use structured phases when you want long-running artifacts to answer what part of the workload was active when a hidden-memory anomaly appeared.

from stormlog import MemoryTracker

tracker = MemoryTracker(
    sampling_interval=0.5,
    # telemetry_sink_config=...,  # Optional: configure the append-only sink.
)

tracker.start_tracking()

for epoch in range(num_epochs):
    with tracker.phase("train", metadata={"epoch": epoch}):
        with tracker.phase("load_batch"):
            batch = next(loader)
        with tracker.phase("forward"):
            loss = model(batch).sum()
        with tracker.phase("backward"):
            loss.backward()
        with tracker.phase("optimizer_step"):
            optimizer.step()

tracker.stop_tracking()

What changes when phases are present:

  • track writes companion phase_enter / phase_exit records with deterministic nested paths

  • gpumemprof analyze adds phase-aware summaries beside timestamps

  • the TUI Diagnostics tab shows the first anomaly phase path for each rank

  • when you omit instrumentation entirely, the same workflow stays valid and low-overhead

TensorFlow always-on baseline

tfmemprof track \
  --interval 1.0 \
  --threshold 4096 \
  --device /CPU:0 \
  --output ./tf_track.json \
  --telemetry-sink-dir ./tf_live_sink \
  --telemetry-flush-seconds 2.0 \
  --telemetry-rollover-mb 64 \
  --telemetry-retention-files 8 \
  --telemetry-retention-total-mb 512

Use /GPU:0 instead of /CPU:0 when the TensorFlow runtime has a GPU device available. Stop the command cleanly with Ctrl+C when you want it to flush the final output file and session summary.

Inspect the latest clean session

gpumemprof analyze ./live_sink --format txt --output ./live_analysis.txt
tfmemprof analyze --input ./tf_track.json --detect-leaks --optimize --report ./tf_report.txt

For PyTorch sink directories, default session selection prefers the newest clean completed session and falls back to interrupted or incomplete sessions only when needed.

What to watch during long runs

Treat these values as operational signals, not just debug trivia:

  • rollover_count

  • pruned_segment_count

  • pruned_bytes

  • final_retained_files

  • final_retained_bytes

  • history_retained_*

  • history_dropped_*

  • collector_failure_event_count

  • session_status

How to interpret degraded mode

If the collector becomes unhealthy during track:

  • the process keeps running

  • new sample emission pauses until recovery

  • status events remain visible in the artifact stream

  • the final report should still show the collector-health transition history

Treat either of these as actionable:

  • non-zero collector_failure_event_count

  • any final collector state other than healthy

Troubleshooting

Symptom: sink files grow too quickly

Likely cause: retention is too loose for the deployment budget. Fix: tighten retention and rollover settings before lowering sample fidelity. Verify: final_retained_*, pruned_*, and rollover_count stabilize.

Symptom: tracking stays alive but live telemetry looks partial

Likely cause: the collector entered degraded mode. Fix: inspect collector_failure_event_count and the emitted status events. Verify: collector_health_status returns to healthy and sampling resumes.

Symptom: analysis loads the wrong run from a reused sink directory

Likely cause: more than one session is present in the sink. Fix: inspect the discovered sessions and target the session you want explicitly. Verify: the selected session id matches the intended run metadata.

What to do next

  • If the artifact budget is too high, tighten retention before you lower the sampling interval.

  • If collector_failure_event_count is non-zero, move to the Incident Playbooks degraded-collector checklist.

  • If the run needs to be qualified for CI or release use, move to the CI and Release Qualification harness workflow.

  • If the next question is rank-aware diagnosis, move to the Distributed Diagnostics Recipes.


← Back to Production Cookbook