TensorFlow Production Recipes
Use these recipes when the runtime is TensorFlow and you need production-safe
capture, analysis, and diagnosis flows that match the current tfmemprof
behavior.
Audience: ML engineers, release owners. Difficulty: intermediate.
Prerequisites
install the package first with Installation
use
pip install "stormlog[tf]"for the TensorFlow CLI pathsuse Command Line Guide if you need per-flag reference
pick
/CPU:0or/GPU:0explicitly for the current runtimefor GPU recipes, run a small GPU op first instead of relying only on
tf.config.list_physical_devices("GPU")
Success signal:
the first workload-backed recipe records non-zero GPU memory
a monitor, track, or diagnose artifact is written successfully
the analyzer returns a report with a clear next action
Choose the first TensorFlow recipe
If the main goal is… |
Start with… |
|---|---|
check a small in-process workload first |
profile a GPU matmul step |
capture a bounded sample window |
|
keep an event stream and session status |
|
get a report with leak and optimization signals |
|
save a portable bundle fast |
|
Recipe: profile a GPU matmul step
import tensorflow as tf
from stormlog.tensorflow import TFMemoryProfiler
profiler = TFMemoryProfiler(device="/GPU:0", enable_tensor_tracking=True)
with profiler.profile_context("matmul_step"):
a = tf.random.normal((4096, 4096))
b = tf.random.normal((4096, 4096))
c = tf.matmul(a, b)
_ = tf.reduce_mean(c).numpy()
results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")
print(f"Snapshots captured: {len(results.snapshots)}")
Use this when you want a small TensorFlow workload on /GPU:0 without
depending on the cuDNN training path.
Recipe: capture a bounded CLI timeline
tfmemprof monitor --interval 0.5 --duration 30 --device /CPU:0 --output ./tf_monitor.json
Switch to /GPU:0 when the TensorFlow runtime exposes a GPU device.
Treat this as a CLI artifact-flow command. On an otherwise idle runtime it will often record zeros even when the tracker itself is functioning correctly.
Recipe: track TensorFlow memory over time
tfmemprof track \
--interval 0.5 \
--threshold 4096 \
--device /CPU:0 \
--output ./tf_track.json
Use track when you need retained vs dropped history counters, session status,
and an event stream you can reload later.
Stop the command cleanly with Ctrl+C so the output file is flushed before the
process exits.
Recipe: run TensorFlow analysis
tfmemprof analyze --input ./tf_monitor.json --detect-leaks --optimize --report ./tf_report.txt
The current TensorFlow analyzer uses --input, not the positional-input style
from gpumemprof analyze.
Recipe: produce a diagnose bundle
tfmemprof diagnose --duration 0 --output ./tf_diag_bundle
Recipe: run the end-to-end TensorFlow flow from a source checkout
python -m examples.scenarios.tf_end_to_end_scenario
This is source-checkout only. Pip installs do not include examples/.
What to look for in the results
leak findings from
--detect-leaksoptimization recommendations from
--optimizecollector_failure_event_countsession_statusgap_analysiswhen telemetry events are availablecollective_attributionwhen cross-rank communication likely explains hidden-memory spikes
What to do next
If leak detection reports high-severity issues, move to the memory-growth path in Incident Playbooks.
If
collector_failure_event_countis non-zero, use the degraded-collector checklist in Incident Playbooks.If the run is part of a larger distributed job, move to the Distributed Diagnostics Recipes.
If the deployment is long-running, move to Always-on Tracking.
Troubleshooting
Symptom: track stops without writing the output file
Likely cause: the process was interrupted before the tracker reached its normal shutdown path.
Fix: wait until tracking has started, then stop it cleanly with Ctrl+C.
Verify: the output file is written and session_status is present.
Symptom: TensorFlow sees /GPU:0 but a training step fails with DNN library initialization failed
Likely cause: the TensorFlow, CUDA, cuDNN, and driver stack is not aligned for training-backed ops. Fix: rerun the minimal matmul snippet above first, then repair the TensorFlow runtime before moving to Keras or cuDNN-dependent workloads. Verify: the matmul snippet records non-zero GPU memory and the training path no longer raises a DNN initialization error.
Symptom: monitor or track on /GPU:0 records only zeros
Likely cause: the CLI process is idle and not exercising a TensorFlow workload.
Fix: use the workload-backed TFMemoryProfiler snippet or the source-checkout
scenario before relying on the artifact for performance conclusions.
Verify: the workload-backed path records non-zero GPU memory.
Symptom: analyze reports no GPU
Likely cause: the TensorFlow runtime is CPU-only or the selected device is unavailable.
Fix: rerun with --device /CPU:0 or fix the runtime environment first.
Verify: tfmemprof info and the chosen capture command agree on the active device.