Production Cookbook

This cookbook packages Stormlog’s profiling, tracking, diagnostic, and TUI flows into task-oriented recipes for production-facing work.

Use these pages when you already know the tool is installed and want the shortest path to a reliable operational workflow.

Audience: operators, ML engineers, release owners. Difficulty: intermediate.

Before you choose a recipe

Read the Installation Guide first if the environment is not already set up.
Use the Command Line Guide if you need option-by-option reference instead of a task recipe.
If you installed from PyPI, use the pip-safe CLI commands on each page.
If you are working from a source checkout, you can also use the maintained examples/ and benchmark_harness flows for qualification.
If you need API signatures or option-by-option reference, go back to the Usage Guide, Command Line Guide, or generated API reference.

Choose the right recipe

Goal	Start here
keep long-running artifacts bounded	Always-on Tracking
respond to a PyTorch incident quickly	PyTorch Production Recipes
respond to a TensorFlow incident quickly	TensorFlow Production Recipes
compare ranks or rebuild distributed timelines	Distributed Diagnostics Recipes
triage OOM or hidden-memory-gap findings	Incident Playbooks
qualify operational behavior in CI or before release	CI and Release Qualification

Recipes

Always-on tracking and bounded artifact budgets

Use the Always-on Tracking recipe when you want a long-running tracking session with append-only sink files, retention limits, and explicit guidance for degraded collectors.

PyTorch production profiling and OOM capture

Use the PyTorch Production Recipes page when you need to move from a live PyTorch issue to a saved telemetry or OOM artifact quickly.

TensorFlow production profiling and diagnosis

Use the TensorFlow Production Recipes page when the workload is owned by TensorFlow and you need track, analyze, and diagnose guidance that matches the current tfmemprof behavior.

Distributed and rank-aware diagnosis

Use the Distributed Diagnostics Recipes page when you need to track multiple ranks, preserve rank identity in artifacts, and rebuild rank-aware diagnostics later in the TUI.

Incident triage playbooks

Use the Incident Playbooks page when the main question is what to do next after an OOM, hidden-memory-gap result, degraded collector, or always-on retention issue.

CI and release qualification

Use the CI and Release Qualification page when you need one place for source-checkout smoke commands, benchmark harness gates, and artifact archival guidance.