Production Cookbook
This cookbook packages Stormlog’s profiling, tracking, diagnostic, and TUI flows into task-oriented recipes for production-facing work.
Use these pages when you already know the tool is installed and want the shortest path to a reliable operational workflow.
Audience: operators, ML engineers, release owners. Difficulty: intermediate.
Before you choose a recipe
Read the Installation Guide first if the environment is not already set up.
Use the Command Line Guide if you need option-by-option reference instead of a task recipe.
If you installed from PyPI, use the pip-safe CLI commands on each page.
If you are working from a source checkout, you can also use the maintained
examples/andbenchmark_harnessflows for qualification.If you need API signatures or option-by-option reference, go back to the Usage Guide, Command Line Guide, or generated API reference.
Choose the right recipe
Goal |
Start here |
|---|---|
keep long-running artifacts bounded |
|
respond to a PyTorch incident quickly |
|
respond to a TensorFlow incident quickly |
|
compare ranks or rebuild distributed timelines |
|
triage OOM or hidden-memory-gap findings |
|
qualify operational behavior in CI or before release |
Recipes
Always-on tracking and bounded artifact budgets
Use the Always-on Tracking recipe when you want a long-running tracking session with append-only sink files, retention limits, and explicit guidance for degraded collectors.
PyTorch production profiling and OOM capture
Use the PyTorch Production Recipes page when you need to move from a live PyTorch issue to a saved telemetry or OOM artifact quickly.
TensorFlow production profiling and diagnosis
Use the TensorFlow Production Recipes page when the workload is
owned by TensorFlow and you need track, analyze, and diagnose guidance that
matches the current tfmemprof behavior.
Distributed and rank-aware diagnosis
Use the Distributed Diagnostics Recipes page when you need to track multiple ranks, preserve rank identity in artifacts, and rebuild rank-aware diagnostics later in the TUI.
Incident triage playbooks
Use the Incident Playbooks page when the main question is what to do next after an OOM, hidden-memory-gap result, degraded collector, or always-on retention issue.
CI and release qualification
Use the CI and Release Qualification page when you need one place for source-checkout smoke commands, benchmark harness gates, and artifact archival guidance.
Suggested reading order
New production deployment
PyTorch incident response
Distributed Diagnostics Recipes if more than one rank is involved
TensorFlow incident response
Distributed Diagnostics Recipes if more than one rank is involved