Skip to content
ReliabilityAdvanced

AI Agent Disaster Recovery: Backups, Failover & RTO/RPO

TG
Tijo Gaucher

April 19, 2026·15 min read

Your AI agent isn’t a stateless web service. It carries conversation memory, a vector store, in-flight task queues, and dependencies on external LLM providers. When any of those fail, “just restart the container” won’t save you. This guide shows the DR architecture that will.

TL;DR

A production AI agent has four recovery surfaces: vector memory, structured state, in-flight tasks, and LLM provider dependencies. A working DR plan snapshots each one, replicates them cross-region, and practices failover quarterly. Target RTO under 15 minutes, RPO under 5 minutes. This post has production code for both OpenClaw and Hermes Agent. Rapid Claw includes automatic snapshots, cross-region replication, and multi-provider failover out of the box.

AI Agent Disaster Recovery — backups, failover, and RTO/RPO for OpenClaw and Hermes Agent

Want DR handled for you?

Try Rapid Claw

Why Agent DR Is Different From Web-App DR

Traditional web-app disaster recovery is well-understood: replicate the database, keep stateless workers behind a load balancer, and rebuild compute from images. Agents break two of those three assumptions. Workers are not stateless — they carry running conversations. And the “database” isn’t just a table of rows — it includes a vector store whose semantic contents can’t be rebuilt cheaply.

If you’ve read the guide on AI agent memory management, you already know that memory lives in three places: short-term context, structured state, and long-term vector memory. DR has to cover all three. If you’ve read the observability guide, you know that traces give you the chronological story of what the agent did — which is exactly what you need to replay in-flight work after a failover.

A second difference: agents depend on external LLM providers. When Anthropic, OpenAI, or your GPU cluster has an incident, your agent is down — even if your own infrastructure is healthy. DR for agents means DR for your provider stack, not just your compute. We’ll cover that in Layer 4.

Vector memory

Embeddings + metadata for long-term recall

Structured state

Tasks, tool state, user preferences

In-flight queue

Tasks mid-execution when disaster hits

Provider dependencies

LLM providers, GPU nodes, external APIs

Setting Your RTO and RPO Targets

Before you design recovery, decide how much downtime and data loss you can actually tolerate. Those two numbers drive every architectural decision that follows.

Agent tierRTO targetRPO targetPattern
Customer-facing / revenue< 2 min< 30 secActive-active multi-region
Internal operations< 15 min< 5 minWarm standby + 5-min snapshots
Batch / research< 1 hour< 1 hourCold standby + hourly snapshots
Experimental / sandbox< 24 hr< 24 hrDaily snapshots

Be honest about the tier. A “research agent” that becomes load-bearing for a customer’s weekly report has graduated to internal operations — upgrade its DR plan before that happens, not after.

Layer 1: Backing Up Vector Memory

The problem

Vector memory is the most expensive thing to rebuild. Re-embedding 100K documents can take hours and cost hundreds of dollars in API fees. If you lose your vector store and have to re-index from source, your agent is effectively useless for the duration. Back it up like a database, because that’s what it is.

Snapshotting a Qdrant or pgvector store

Most open agent stacks use Qdrant, Weaviate, or pgvector. All three support point-in-time snapshots that you can ship to object storage. Here’s the pattern for Qdrant:

qdrant_snapshot.py
import boto3
import requests
from datetime import datetime, timezone

QDRANT_URL = "http://qdrant:6333"
S3_BUCKET = "agent-backups"
COLLECTION = "agent_memory"

def snapshot_vector_store():
    """Create a Qdrant snapshot and ship it to S3."""
    # 1. Trigger an on-demand snapshot
    resp = requests.post(f"{QDRANT_URL}/collections/{COLLECTION}/snapshots")
    resp.raise_for_status()
    snapshot_name = resp.json()["result"]["name"]

    # 2. Download the snapshot
    snap = requests.get(
        f"{QDRANT_URL}/collections/{COLLECTION}/snapshots/{snapshot_name}",
        stream=True,
    )

    # 3. Ship to S3 with a timestamped key
    ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    key = f"vector/{COLLECTION}/{ts}_{snapshot_name}"

    s3 = boto3.client("s3")
    s3.upload_fileobj(snap.raw, S3_BUCKET, key, ExtraArgs={
        "StorageClass": "STANDARD_IA",
        "Metadata": {"collection": COLLECTION, "snapshot": snapshot_name},
    })

    # 4. Delete server-side snapshot after successful upload
    requests.delete(f"{QDRANT_URL}/collections/{COLLECTION}/snapshots/{snapshot_name}")

    return {"s3_key": key, "collection": COLLECTION, "timestamp": ts}

# Run every 5 minutes via cron or a sidecar container

Three details that matter in practice:

  • Cross-region replication. Use S3 cross-region replication (CRR) or its GCS / R2 equivalent. A snapshot in the same region as your agent isn’t a backup — it’s a file.
  • Retention ladder. Keep 24 hourly, 14 daily, 12 monthly. Most recoveries use a snapshot less than 24 hours old; the rest cover rare but painful cases like silent memory corruption discovered days later.
  • Restore rehearsal. An untested backup is not a backup. Restore into a staging environment at least monthly and verify agent behavior afterward.

Layer 2: Structured State & the In-Flight Task Queue

The problem

When a host crashes mid-task, you have two failure modes. Lost progress: the agent forgets it was in the middle of something. Duplicate execution: the agent retries a step that had external side effects (charging a card, sending an email, calling a third-party API). Both are bad. A durable, idempotent task queue solves both.

A checkpoint-based task runner for OpenClaw

Each step of a multi-step task persists a checkpoint before executing. Restart reads the checkpoint, resumes from the next step, and skips any step marked completed. Idempotency keys on external side effects prevent double-execution.

openclaw_checkpoint.py
import json
import uuid
import psycopg2
from contextlib import contextmanager

DB = psycopg2.connect("postgres://agent:***@db/agent_state")

def start_task(payload: dict) -> str:
    task_id = str(uuid.uuid4())
    with DB, DB.cursor() as cur:
        cur.execute(
            """INSERT INTO tasks (id, status, payload, created_at)
               VALUES (%s, 'pending', %s, now())""",
            (task_id, json.dumps(payload)),
        )
    return task_id

@contextmanager
def step(task_id: str, step_name: str, idempotency_key: str):
    """Durable step that survives crashes. Skips if already committed."""
    with DB, DB.cursor() as cur:
        cur.execute(
            "SELECT status FROM task_steps WHERE task_id=%s AND name=%s",
            (task_id, step_name),
        )
        row = cur.fetchone()

        if row and row[0] == "completed":
            # Already done before the crash — skip
            yield None
            return

        cur.execute(
            """INSERT INTO task_steps (task_id, name, idempotency_key, status)
               VALUES (%s, %s, %s, 'running')
               ON CONFLICT (task_id, name) DO UPDATE SET status='running'""",
            (task_id, step_name, idempotency_key),
        )

    try:
        result = {}
        yield result
        with DB, DB.cursor() as cur:
            cur.execute(
                """UPDATE task_steps SET status='completed', result=%s, finished_at=now()
                   WHERE task_id=%s AND name=%s""",
                (json.dumps(result), task_id, step_name),
            )
    except Exception as e:
        with DB, DB.cursor() as cur:
            cur.execute(
                """UPDATE task_steps SET status='failed', error=%s, finished_at=now()
                   WHERE task_id=%s AND name=%s""",
                (str(e), task_id, step_name),
            )
        raise

# Usage
def run_research_task(task_id: str, query: str):
    with step(task_id, "search_web", idempotency_key=f"search:{query}") as r:
        if r is not None:
            r["results"] = web_search(query)

    with step(task_id, "summarize", idempotency_key=f"sum:{task_id}") as r:
        if r is not None:
            r["summary"] = llm_summarize(query)

    with step(task_id, "email_report", idempotency_key=f"email:{task_id}") as r:
        if r is not None:
            # External side effect — idempotency key prevents double-send
            send_email(subject=query, body=r.get("summary", ""))

Durable queues for Hermes Agent

Hermes Agent is a good fit for a durable queue like Redis Streams or NATS JetStream. The pattern is the same: persist intent before executing, ACK only after completion. Here’s the NATS JetStream version, which gives you replay for free:

hermes_jetstream.py
import asyncio
import json
import nats
from nats.js.api import StreamConfig, RetentionPolicy, StorageType

STREAM = "AGENT_TASKS"
SUBJECT = "agent.tasks.*"

async def setup_stream():
    nc = await nats.connect("nats://nats:4222")
    js = nc.jetstream()

    # File-backed, replicated 3x, retain until ACK
    await js.add_stream(StreamConfig(
        name=STREAM,
        subjects=[SUBJECT],
        storage=StorageType.FILE,
        num_replicas=3,
        retention=RetentionPolicy.WORK_QUEUE,
        max_age=86400 * 7,  # 7-day safety net
    ))
    return js

async def enqueue(js, task_id: str, payload: dict):
    """Publish task with msg-id for exactly-once semantics."""
    await js.publish(
        f"agent.tasks.{task_id}",
        json.dumps(payload).encode(),
        headers={"Nats-Msg-Id": task_id},  # dedupes within window
    )

async def worker(js, handler):
    sub = await js.pull_subscribe(SUBJECT, durable="agent_worker")
    while True:
        msgs = await sub.fetch(batch=1, timeout=30)
        for msg in msgs:
            task = json.loads(msg.data)
            try:
                await handler(task)
                await msg.ack()
            except Exception:
                await msg.nak(delay=5)  # retry with backoff

If you want a ready-made heartbeat monitor in front of this, the open-source agent-watchdog tool ships with a drop-in liveness probe for long-running agent loops, and agent-probe gives you a Go binary for health checks from your load balancer.

Layer 3: Region Failover

The problem

A single-AZ or single-region deployment has a floor on its availability set by the provider. AWS, GCP, and Azure all have multi-hour regional incidents every couple of years. If your agent has to stay up through one, you need cross-region replication and a failover path that doesn’t require a human to click buttons at 3 AM.

Warm standby in a second region

The most cost-effective pattern for most teams: run the full agent stack in region A, replicate state to region B continuously, but run only a thin “pilot light” in region B. On failover, scale up region B from pilot-light to full capacity.

failover.tf (Terraform)
# Primary region: full deployment
module "agent_primary" {
  source    = "./modules/agent-stack"
  region    = "us-east-1"
  replicas  = 3
  role      = "primary"
}

# DR region: pilot light — 1 replica, 0 traffic
module "agent_dr" {
  source    = "./modules/agent-stack"
  region    = "us-west-2"
  replicas  = 1            # scales to 3 on failover
  role      = "standby"
}

# Cross-region S3 replication for vector backups
resource "aws_s3_bucket_replication_configuration" "vector_backups" {
  bucket = aws_s3_bucket.primary_vector_backups.id
  role   = aws_iam_role.replication.arn

  rule {
    id       = "cross-region-vector-backups"
    status   = "Enabled"
    priority = 1
    destination {
      bucket        = aws_s3_bucket.dr_vector_backups.arn
      storage_class = "STANDARD_IA"
      replication_time {
        status = "Enabled"
        time { minutes = 15 }
      }
    }
  }
}

# Route53 health check + failover record
resource "aws_route53_health_check" "primary" {
  fqdn              = module.agent_primary.health_endpoint
  type              = "HTTPS"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "agent_failover_primary" {
  zone_id = var.zone_id
  name    = "agents.rapidclaw.dev"
  type    = "A"
  set_identifier = "primary"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.primary.id
  alias {
    name                   = module.agent_primary.lb_dns
    zone_id                = module.agent_primary.lb_zone_id
    evaluate_target_health = true
  }
}

Route53 promotes the DR region automatically when health checks fail. The catch: the DR region needs a restored vector store before it serves traffic. Either hydrate from the replicated S3 backup on first boot, or keep a streaming replica running in the DR region (more expensive, faster RTO).

Layer 4: LLM Provider Failover

The problem

Anthropic had a 3-hour outage in March 2026. OpenAI has had several multi-hour incidents in the last year. If your agent has a hard dependency on one provider, your agent’s availability equals the provider’s availability — which for most providers is under 99.9%. A multi-provider router is the answer.

A provider router with circuit breakers

Define your providers in priority order. Each call tries the top healthy provider. When a provider returns 5xx or times out three times in a 60-second window, the circuit breaker opens and traffic fails over to the next provider until the breaker half-opens again.

provider_router.py
import time
from dataclasses import dataclass, field

@dataclass
class Provider:
    name: str
    call: callable
    failures: list[float] = field(default_factory=list)
    opened_at: float | None = None

    def is_open(self, now: float) -> bool:
        if self.opened_at is None:
            return False
        # Half-open after 60 seconds
        if now - self.opened_at > 60:
            self.opened_at = None
            self.failures = []
            return False
        return True

    def record_failure(self, now: float):
        # Keep last 60 seconds of failures
        self.failures = [t for t in self.failures if t > now - 60] + [now]
        if len(self.failures) >= 3:
            self.opened_at = now

class ProviderRouter:
    def __init__(self, providers: list[Provider]):
        self.providers = providers

    def call(self, prompt: str, **kwargs):
        now = time.time()
        last_error = None
        for p in self.providers:
            if p.is_open(now):
                continue
            try:
                return p.call(prompt, **kwargs)
            except Exception as e:
                p.record_failure(now)
                last_error = e
                continue
        raise RuntimeError(f"All providers unavailable: {last_error}")

# Configure priority order: primary, secondary, self-hosted fallback
router = ProviderRouter([
    Provider("anthropic", call=anthropic_call),
    Provider("openai",    call=openai_call),
    Provider("local_llama", call=local_llama_call),  # degraded quality, always up
])

response = router.call("Summarize this document: ...")

Important caveats:

  • Quality degradation is expected. A local Llama fallback is not a drop-in replacement for Claude or GPT-4. Your agent should tolerate lower quality during failover — think of it like serving a simpler error page instead of a full app.
  • Prompt differences. Providers respond differently to the same prompt. Maintain provider-specific prompt templates, or test your prompts against each provider before relying on failover.
  • Tool-call format. Anthropic and OpenAI use different tool-call schemas. Use a wrapper (like the agent-router library) that normalizes tool calls across providers.

The DR Runbook

When something actually goes wrong, nobody has time to read architecture diagrams. Keep a short runbook in the repo and treat it like code. Here’s a template:

RUNBOOK.md
# Agent DR Runbook

## Scenario 1 — Primary region unhealthy (Route53 health check failing)
1. Confirm the outage: check /health on each primary pod + CloudWatch.
2. If > 3 min of failed health checks, Route53 will fail over automatically.
3. Scale the DR region from 1 -> 3 replicas:
   $ kubectl scale -n agents deploy/agent --replicas=3 --context=dr-cluster
4. Verify vector store is hydrated:
   $ curl https://qdrant-dr/collections/agent_memory | jq .result.points_count
5. Smoke test: $ ./scripts/dr-smoke-test.sh
6. Post to #incidents with timeline.

## Scenario 2 — Vector memory corruption detected
1. Put agent in read-only mode:
   $ kubectl set env deploy/agent READ_ONLY=true
2. Restore latest healthy snapshot from S3:
   $ ./scripts/restore-vector-snapshot.sh --snapshot=20260419T020000Z
3. Verify restored collection count matches expected baseline.
4. Remove read-only flag, monitor error rate for 30 min.

## Scenario 3 — Primary LLM provider down
1. Router handles this automatically (see provider_router.py).
2. If persistent (> 30 min), page on-call to review quality metrics.
3. Consider pausing non-critical agent workloads until primary recovers.

## Scenario 4 — In-flight task queue lost
1. NATS JetStream is 3x replicated — this should not happen.
2. If it does: tasks are replayable from agent_tasks table.
3. $ ./scripts/replay-tasks.sh --since=<last-known-good-timestamp>

Testing Your DR Plan

A DR plan you’ve never executed is a theory. Run real game days. The testing-in-production guide covers canary and shadow-traffic patterns; the DR equivalent is chaos engineering.

Monthly

Restore latest vector snapshot into staging, run regression suite, confirm parity.

Quarterly

Fail over to DR region in off-hours. Leave it running for 24 hours. Measure RTO/RPO actuals vs targets.

Quarterly

Kill the primary LLM provider (block the egress). Confirm router fails over within 30 seconds and agent tasks complete on the fallback provider.

Annually

Full black-box DR exercise: one team destroys a random component, another team executes the runbook. Update the runbook with what they actually needed vs what was documented.

The managed alternative

Building all of this yourself is a multi-quarter project. Rapid Claw includes 5-minute vector snapshots, cross-region replication, a multi-provider LLM router, and a one-click restore flow. If you’d rather skip the undifferentiated heavy lifting, see the complete hosting guide for how it fits with the rest of your stack.

Frequently Asked Questions

Skip the DR plumbing

Rapid Claw deploys OpenClaw and Hermes Agent with snapshots, cross-region replication, and multi-provider failover pre-configured. Production-ready DR in under two minutes.

Deploy with DR built in