Distributed Diagnostics Recipes
Use this page when you need to preserve per-rank identity during capture and rebuild a rank-aware timeline later.
Audience: distributed-training owners, incident responders. Difficulty: advanced.
Prerequisites
install the package first with Installation
use
pip install "stormlog[torch]"for PyTorch rank captureuse
pip install "stormlog[tf]"for TensorFlow rank captureuse
pip install "stormlog[tui,torch]"if you want the TUI Diagnostics workflow from a pip installuse TUI Guide and Command Line Guide if you need UI or CLI reference details
the artifact paths for each rank are writable and distinct
rank metadata is either inferred from the environment or passed explicitly
the TUI extra is installed if you want the interactive Diagnostics workflow
Success signal:
each rank produces its own artifact with rank identity intact
the TUI Diagnostics tab can load multiple rank files without flattening them
Choose the distributed path
If the job is… |
Start with… |
|---|---|
PyTorch rank capture |
rank-aware |
TensorFlow rank capture |
rank-aware |
artifact triage after capture |
|
When this is the right recipe
you need one artifact per rank
you want
job_id,rank,local_rank, andworld_sizerecorded explicitlyyou need TUI diagnostics to keep ranks separate instead of flattening them
you want hidden-memory-gap or collective-attribution analysis with more than one rank
Recipe: validate a reference torchrun DDP run on Jarvis
Use this when you want one real multi-GPU training run based on the official PyTorch DDP tutorial pattern, not a manually stitched set of rank-local captures.
This workflow was validated against:
PyTorch DDP tutorial:
https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.htmlPyTorch tutorial-series example repo:
https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-serieslocal adaptation:
examples.scenarios.torchrun_ddp_reference
Validated environment:
Jarvis container instance
2xL4PyTorch template
single node with
torchrun
Assume the source checkout is already present on the instance at
/home/gpu-memory-profiler.
jl create \
--gpu L4 \
--num-gpus 2 \
--template pytorch \
--region IN2 \
--storage 80 \
--name stormlog-ddp-reference \
--yes \
--json
Prepare the project environment on the instance:
cd /home/gpu-memory-profiler
python3 -m venv --system-site-packages .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
Run the reference training job with torchrun:
cd /home/gpu-memory-profiler
. .venv/bin/activate
mkdir -p artifacts/jarvis_torchrun_reference
PYTHONPATH=/home/gpu-memory-profiler \
OMP_NUM_THREADS=1 \
python -m torch.distributed.run \
--nnodes=1 \
--nproc_per_node=2 \
--master_addr=127.0.0.1 \
--master_port=29501 \
-m examples.scenarios.torchrun_ddp_reference \
--epochs 2 \
--batch-size 128 \
--dataset-size 4096 \
--interval 0.1 \
--job-id jarvis-torchrun-reference \
--output-dir artifacts/jarvis_torchrun_reference \
2>&1 | tee artifacts/jarvis_torchrun_reference/run.log
Expected console output shape:
one loss line per rank per epoch
rank-local and global loss values
a final
Reference summary saved to .../ddp_reference_summary.json
Expected artifacts:
artifacts/jarvis_torchrun_reference/ddp_reference_summary.jsonartifacts/jarvis_torchrun_reference/rank0/telemetry_sink/artifacts/jarvis_torchrun_reference/rank1/telemetry_sink/artifacts/jarvis_torchrun_reference/reference_checkpoint.ptartifacts/jarvis_torchrun_reference/run.log
The validated reference run produced:
world_size = 2rank_summaries = 212sample events per rank133phase_enterand133phase_exitevents per rank
Download the artifact root locally, then load both rank sinks together in the TUI Diagnostics tab:
stormlog
Then:
Open
Diagnostics.Enter the two sink paths as a comma-separated list.
Click
Load Artifacts.Leave session selection on
auto.Confirm
present_ranksshows0,1.
For this reference run, Diagnostics selected a merged synthetic session and reported:
present_ranks = [0, 1]expected_ranks = [0, 1]missing_ranks = []one diagnostics row per rank
Recipe: capture rank-aware PyTorch artifacts
gpumemprof track \
--duration 30 \
--interval 0.5 \
--job-id train-42 \
--rank 0 \
--local-rank 0 \
--world-size 2 \
--output ./rank0.json \
--format json
gpumemprof track \
--duration 30 \
--interval 0.5 \
--job-id train-42 \
--rank 1 \
--local-rank 1 \
--world-size 2 \
--output ./rank1.json \
--format json
Recipe: capture rank-aware TensorFlow artifacts
tfmemprof track \
--interval 0.5 \
--threshold 4096 \
--device /CPU:0 \
--job-id train-42 \
--rank 0 \
--local-rank 0 \
--world-size 2 \
--output ./tf_rank0.json
tfmemprof track \
--interval 0.5 \
--threshold 4096 \
--device /CPU:0 \
--job-id train-42 \
--rank 1 \
--local-rank 1 \
--world-size 2 \
--output ./tf_rank1.json
Stop each TensorFlow rank cleanly with Ctrl+C after tracking has started so
the per-rank output file is flushed before exit.
Keep the same job_id across every rank-local capture from one distributed
run. Diagnostics uses that shared job identity to auto-select a merged
cross-rank session when you load the artifacts together.
Recipe: load multiple rank artifacts in the TUI
stormlog
Then:
Open
Diagnostics.Enter the artifact paths as a comma-separated list.
Click
Load Artifacts.Leave session selection on
autoordefaultfirst. With a sharedjob_id, Diagnostics selects the merged cross-rank session automatically.Choose an individual
session_idonly when you want to isolate one raw rank-local artifact.Apply a rank filter such as
allor0,1.
What to look for
cross_rank_analysisin PyTorch optimization reports when more than one rank is presentrank-aware timeline differences in the TUI diagnostics pane
collective_attributionwhen communication phases align with hidden-memory spikessession separation by
session_idwhen the same sink directory or host is reused
What to do next
If one rank is the first cause, isolate that rank’s artifact and analyze it independently.
If all ranks spike together and
collective_attributionis populated, treat the issue as a communication or synchronization candidate before changing model code.If the problem is operational rather than rank-local, move to Always-on Tracking.
Troubleshooting
Symptom: only one rank appears in diagnostics
Likely cause: the wrong artifact set or session was loaded.
Fix: load every rank artifact together, then target the intended session before refreshing.
Verify: present_ranks matches the expected rank set.
Symptom: ranks are present but the first cause is unclear
Likely cause: the issue is synchronized across ranks.
Fix: inspect cross_rank_analysis and collective_attribution before isolating one rank.
Verify: the next action comes from a rank-local or communication-attributed explanation, not guesswork.