stormlog.tracker

Real-time memory tracking and monitoring.

Classes

MemoryTracker([device, sampling_interval, ...])

Real-time memory tracker with alerts and monitoring.

MemoryWatchdog(tracker[, auto_cleanup, ...])

Memory watchdog for automated memory management.

TrackingEvent(timestamp, event_type, ...[, ...])

Represents a memory tracking event.

class stormlog.tracker.TrackingEvent(timestamp, event_type, memory_allocated, memory_reserved, memory_change, device_id, session_id=None, context=None, job_id=None, rank=0, local_rank=0, world_size=1, metadata=None, active_memory=None, inactive_memory=None, device_used=None, device_free=None, device_total=None, backend='cuda')[source]

Bases: object

Represents a memory tracking event.

Parameters:
  • timestamp (float)

  • event_type (str)

  • memory_allocated (int)

  • memory_reserved (int)

  • memory_change (int)

  • device_id (int)

  • session_id (str | None)

  • context (str | None)

  • job_id (str | None)

  • rank (int)

  • local_rank (int)

  • world_size (int)

  • metadata (Dict[str, Any] | None)

  • active_memory (int | None)

  • inactive_memory (int | None)

  • device_used (int | None)

  • device_free (int | None)

  • device_total (int | None)

  • backend (str)

timestamp: float
event_type: str
memory_allocated: int
memory_reserved: int
memory_change: int
device_id: int
session_id: str | None = None
context: str | None = None
job_id: str | None = None
rank: int = 0
local_rank: int = 0
world_size: int = 1
metadata: Dict[str, Any] | None = None
active_memory: int | None = None
inactive_memory: int | None = None
device_used: int | None = None
device_free: int | None = None
device_total: int | None = None
backend: str = 'cuda'
class stormlog.tracker.MemoryTracker(device=None, sampling_interval=0.1, max_events=10000, enable_alerts=True, enable_oom_flight_recorder=False, oom_dump_dir='oom_dumps', oom_buffer_size=None, oom_max_dumps=5, oom_max_total_mb=256, job_id=None, rank=None, local_rank=None, world_size=None, enable_native_cuda_history=False, native_history_max_entries=100000, telemetry_sink_config=None)[source]

Bases: object

Real-time memory tracker with alerts and monitoring.

Parameters:
  • device (str | int | torch.device | None)

  • sampling_interval (float)

  • max_events (int)

  • enable_alerts (bool)

  • enable_oom_flight_recorder (bool)

  • oom_dump_dir (str)

  • oom_buffer_size (int | None)

  • oom_max_dumps (int)

  • oom_max_total_mb (int)

  • job_id (str | None)

  • rank (int | None)

  • local_rank (int | None)

  • world_size (int | None)

  • enable_native_cuda_history (bool)

  • native_history_max_entries (int)

  • telemetry_sink_config (TelemetrySinkConfig | None)

get_session_summary()[source]

Return the current or most recent tracking session summary.

Return type:

SessionSummary | None

property oom_buffer_size: int

Resolved OOM ring-buffer size.

start_tracking()[source]

Start real-time memory tracking.

Return type:

None

stop_tracking()[source]

Stop real-time memory tracking.

Return type:

None

enter_phase(name, *, metadata=None)[source]

Enter one structured workload phase while tracking is active.

Parameters:
  • name (str)

  • metadata (Dict[str, Any] | None)

Return type:

PhaseHandle

phase(name, *, metadata=None)[source]

Context manager that emits structured phase enter and exit records.

Parameters:
  • name (str)

  • metadata (Dict[str, Any] | None)

Return type:

Any

handle_exception(exc, context=None, metadata=None)[source]

Capture OOM diagnostics for recognized OOM exceptions.

Parameters:
  • exc (BaseException)

  • context (str | None)

  • metadata (Dict[str, Any] | None)

Return type:

str | None

capture_oom(context='runtime', metadata=None)[source]

Capture OOM diagnostic bundle if a tracked block raises OOM.

Parameters:
  • context (str)

  • metadata (Dict[str, Any] | None)

Return type:

Any

add_alert_callback(callback)[source]

Add a callback function to be called on alerts.

Parameters:

callback (Callable[[TrackingEvent], None])

Return type:

None

remove_alert_callback(callback)[source]

Remove an alert callback.

Parameters:

callback (Callable[[TrackingEvent], None])

Return type:

None

get_events(event_type=None, last_n=None, since=None)[source]

Get tracking events with optional filtering.

Parameters:
  • event_type (str | None) – Filter by event type

  • last_n (int | None) – Get last N events

  • since (float | None) – Get events since timestamp

Returns:

List of filtered events

Return type:

List[TrackingEvent]

get_memory_timeline(interval=1.0)[source]

Get memory usage timeline with specified interval.

Parameters:

interval (float) – Time interval in seconds for aggregation

Returns:

Dictionary with timeline data

Return type:

Dict[str, List]

get_statistics()[source]

Get comprehensive tracking statistics.

Return type:

Dict[str, Any]

export_events(filename, format='csv')[source]

Export tracking events to file.

Parameters:
  • filename (str) – Output filename

  • format (str) – Export format (‘csv’ or ‘json’)

Return type:

None

clear_events()[source]

Clear all tracking events.

Return type:

None

set_threshold(threshold_name, value)[source]

Set alert threshold.

Parameters:
  • threshold_name (str) – Name of the threshold

  • value (int | float) – Threshold value

Return type:

None

get_alerts(last_n=None)[source]

Get all alert events (warnings, critical, errors).

Parameters:

last_n (int | None)

Return type:

List[TrackingEvent]

class stormlog.tracker.MemoryWatchdog(tracker, auto_cleanup=True, cleanup_threshold=0.9, aggressive_cleanup_threshold=0.95)[source]

Bases: object

Memory watchdog for automated memory management.

Parameters:
  • tracker (MemoryTracker)

  • auto_cleanup (bool)

  • cleanup_threshold (float)

  • aggressive_cleanup_threshold (float)

force_cleanup(aggressive=False)[source]

Force immediate memory cleanup.

Parameters:

aggressive (bool)

Return type:

None

get_cleanup_stats()[source]

Get cleanup statistics.

Return type:

Dict[str, Any]