Skip to main content

checkpointing

Checkpointing and recovery

  • Save: periodic or event-driven snapshots of graph node states, context (inputs, variables, outputs, step_results), in-progress tasks, and current WAITING node.
  • Restore: given a checkpoint id, rebuild graph state and context, then continue execution (e.g., after resume).

Flow

  • Storage abstraction: pluggable save/load interface.
  • Used for resiliency, long-lived interactions, and operator-driven recovery.