Skip to content
Release EngineeringAdvanced

AI Agent Versioning & Rollback: Safe Rollouts for Production

TG
Tijo Gaucher

April 19, 2026·15 min read

A one-line prompt change can break an agent that passed every eval. A “minor” model upgrade can silently change tool-call formatting. You need the same release-engineering rigor you’d use for a database migration — versioned artifacts, canary rollouts, and rollback that works in under a minute.

TL;DR

Version five things together as one immutable artifact: prompts, tool definitions, model pin, memory schema, configuration. Roll out with a canary (1% → 10% → 50% → 100%). Auto-rollback on metric regression. Keep memory migrations backward-compatible. Never use “latest” model IDs. This guide has production code for OpenClaw and Hermes Agent; Rapid Claw ships all of it out of the box.

AI Agent Versioning & Safe Rollback — canary, blue-green, and auto-rollback for OpenClaw and Hermes Agent

Want versioning & rollback built in?

Try Rapid Claw

Why Versioning AI Agents Is Harder Than Versioning Web Apps

You push a bug fix to a REST API. The tests pass. You deploy. Errors stay flat. You move on. Agents don’t work that way. You tweak a single sentence in a system prompt, all your evals pass, you deploy — and two days later a user reports that the agent stopped offering to summarize attachments. Nothing errored. The new version just behaves differently.

That’s the core problem. Agent behavior is an emergent property of prompt + model + tools + memory + configuration. Change any one of those and the behavior shifts in ways you can’t fully predict. The failure modes of AI agents in production are dominated by silent regressions, not crashes. Your release engineering has to be built around that fact.

Two principles carry most of the weight: everything is versioned together as one immutable artifact, and rollbacks are cheap and routine. The rest of this guide shows how to implement both for OpenClaw and Hermes Agent.

The Five Things to Version Together

Every agent release pins these five components. If any one changes, the version number changes, and the whole artifact is rebuilt and re-released.

Prompts

System prompt, task templates, few-shot examples. Stored in git, not in a database. Readable as text.

Tool definitions

Function signatures, descriptions, parameter schemas. A changed description changes agent behavior just as much as a changed signature.

Model pin

Exact model version (e.g. claude-sonnet-4-6, gpt-4-turbo-2024-04-09). Never "latest". Never just "claude-sonnet".

Memory schema

Vector metadata shape, structured-state columns, migration version. Backward-compatible migrations only.

Configuration

Rate limits, permissions, routing policies. The runtime knobs that change agent behavior.

An agent manifest file

Capture all five in a single manifest that ships with every build. This is the deployable artifact — not a container image on its own, and not a commit SHA on its own.

agent.manifest.yaml
# Agent manifest — the source of truth for a single release
apiVersion: agents/v1
kind: AgentRelease
metadata:
  name: research-agent
  version: 2026.04.19-a3f2c1          # date + short git sha
  image: ghcr.io/acme/research-agent:2026.04.19-a3f2c1

spec:
  # 1. Prompts — content-hashed, stored as files in the image
  prompts:
    system: prompts/system.md          # sha256: a3f2c1...
    task_templates: prompts/tasks/     # sha256 per file
    few_shot_examples: prompts/examples/

  # 2. Tool definitions — semver per tool
  tools:
    - name: web_search
      version: 1.4.0
      schema: schemas/web_search.json
    - name: send_email
      version: 2.1.0
      schema: schemas/send_email.json

  # 3. Model pin — exact version, never "latest"
  model:
    primary: claude-sonnet-4-6
    fallback: gpt-4-turbo-2024-04-09
    local: meta-llama-3-70b-instruct:2026-02-01

  # 4. Memory schema
  memory:
    schema_version: 7
    migration_from: 6                  # backward-compatible

  # 5. Runtime configuration
  config:
    rate_limits:
      web_search: "20/60s"
      send_email: "5/300s"
    max_tokens_per_task: 100000
    provider_routing: priority_failover

# Rollout plan for this release
rollout:
  strategy: canary
  stages:
    - { traffic: 1,  duration: "30m" }
    - { traffic: 10, duration: "2h" }
    - { traffic: 50, duration: "6h" }
    - { traffic: 100 }
  guardrails:
    task_success_rate: { min: 0.97 }   # vs previous baseline
    p95_latency_ms:    { max: 8000 }
    token_spend_delta: { max: 0.15 }   # +15% allowed vs baseline
    eval_score:        { min: 0.90 }
  auto_rollback_on_failure: true

A few things this buys you: the manifest is small and diffable in a pull request; the content hashes make tamper detection trivial; the rollout plan lives alongside the thing it rolls out, so it ages with the release. Every production deploy has a corresponding manifest, and every manifest has a corresponding commit.

Canary Rollouts With Metric Guardrails

Why not blue-green?

Blue-green flips 100% of traffic in one cut. For deterministic services that’s usually fine. For agents it’s dangerous: a quality regression that evals missed will hit every user at once, and the damage may take hours to detect. Canary gives you a real-traffic early-warning window with a small blast radius.

Traffic splitting for OpenClaw

Route traffic between old and new versions based on a stable hash of the user or session ID. Consistent hashing means a given user always lands on the same version during the canary — that’s important because agents have memory, and flip-flopping a user between versions would confuse conversation context.

canary_router.py
import hashlib
from dataclasses import dataclass

@dataclass
class Version:
    name: str            # e.g. "2026.04.19-a3f2c1"
    handler: callable    # function that runs the agent
    traffic_pct: float   # 0-100

class CanaryRouter:
    """Sticky per-user version routing."""

    def __init__(self, stable: Version, canary: Version):
        self.stable = stable
        self.canary = canary

    def route(self, user_id: str) -> Version:
        # Stable hash → deterministic bucketing 0-99
        digest = hashlib.sha256(user_id.encode()).digest()
        bucket = int.from_bytes(digest[:4], "big") % 100
        if bucket < self.canary.traffic_pct:
            return self.canary
        return self.stable

    def handle(self, user_id: str, task: dict):
        version = self.route(user_id)
        # Tag every span/log with the version — essential for metric comparison
        with trace.attribute("agent.version", version.name):
            return version.handler(task)

# Bump canary traffic as stages progress
router = CanaryRouter(
    stable=Version("2026.04.12-b1e4f0", handler=old_agent, traffic_pct=0),
    canary=Version("2026.04.19-a3f2c1", handler=new_agent, traffic_pct=1),
)

Metric-based guardrails

At each canary stage, compare metrics from the canary cohort to the stable cohort over the same window. If any guardrail trips, the rollout halts and traffic shifts back to the stable version. The metrics come directly from the observability stack covered in the observability guide.

guardrails.py
from dataclasses import dataclass

@dataclass
class GuardrailResult:
    metric: str
    stable: float
    canary: float
    passed: bool
    reason: str = ""

def check_guardrails(stable_metrics: dict, canary_metrics: dict) -> list[GuardrailResult]:
    results = []

    # 1. Task success rate — canary must not drop more than 2pp below stable
    delta = stable_metrics["success_rate"] - canary_metrics["success_rate"]
    results.append(GuardrailResult(
        metric="task_success_rate",
        stable=stable_metrics["success_rate"],
        canary=canary_metrics["success_rate"],
        passed=delta < 0.02,
        reason=f"Canary success rate {delta*100:.1f}pp below stable",
    ))

    # 2. p95 latency — canary must not exceed stable by more than 20%
    ratio = canary_metrics["p95_latency_ms"] / stable_metrics["p95_latency_ms"]
    results.append(GuardrailResult(
        metric="p95_latency",
        stable=stable_metrics["p95_latency_ms"],
        canary=canary_metrics["p95_latency_ms"],
        passed=ratio < 1.20,
        reason=f"Canary p95 latency {(ratio-1)*100:.0f}% higher than stable",
    ))

    # 3. Token spend per task — canary must not exceed stable by more than 15%
    ratio = canary_metrics["avg_tokens"] / stable_metrics["avg_tokens"]
    results.append(GuardrailResult(
        metric="token_spend",
        stable=stable_metrics["avg_tokens"],
        canary=canary_metrics["avg_tokens"],
        passed=ratio < 1.15,
        reason=f"Canary token spend {(ratio-1)*100:.0f}% higher than stable",
    ))

    # 4. Held-out eval score — canary must score >= 0.90
    results.append(GuardrailResult(
        metric="eval_score",
        stable=stable_metrics["eval_score"],
        canary=canary_metrics["eval_score"],
        passed=canary_metrics["eval_score"] >= 0.90,
        reason=f"Canary eval score {canary_metrics['eval_score']:.2f} below 0.90 threshold",
    ))

    return results

def should_rollback(results: list[GuardrailResult]) -> bool:
    return any(not r.passed for r in results)

Tooling-wise, if you don’t want to hand-roll the eval harness, the open-source agent-bench project benchmarks LLM agents across providers on speed, cost, and quality and produces a report card you can diff between versions. Wire it into your CI and you get regression detection before the canary even starts.

Rollouts With Hermes Agent & Kubernetes

If you’re running Hermes Agent on Kubernetes, Argo Rollouts or Flagger automates canary progression and guardrails at the ingress layer. Here’s the Argo Rollouts manifest that implements the rollout plan from our agent manifest:

hermes-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: hermes-agent
  namespace: agents
spec:
  replicas: 6
  selector:
    matchLabels: { app: hermes-agent }
  strategy:
    canary:
      canaryService: hermes-agent-canary
      stableService: hermes-agent-stable
      trafficRouting:
        istio:
          virtualService:
            name: hermes-agent
      steps:
        - setWeight: 1
        - pause: { duration: 30m }
        - analysis:
            templates: [{ templateName: agent-guardrails }]
        - setWeight: 10
        - pause: { duration: 2h }
        - analysis:
            templates: [{ templateName: agent-guardrails }]
        - setWeight: 50
        - pause: { duration: 6h }
        - analysis:
            templates: [{ templateName: agent-guardrails }]
        - setWeight: 100
  template:
    spec:
      containers:
        - name: hermes
          image: ghcr.io/acme/hermes-agent:2026.04.19-a3f2c1
          env:
            - { name: AGENT_VERSION, value: "2026.04.19-a3f2c1" }
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: agent-guardrails
spec:
  metrics:
    - name: task-success-rate
      interval: 5m
      successCondition: result >= 0.97
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(agent_task_success_total{version="2026.04.19-a3f2c1"}[5m]))
            /
            sum(rate(agent_task_total{version="2026.04.19-a3f2c1"}[5m]))
    - name: p95-latency
      interval: 5m
      successCondition: result < 8000
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.95, sum(rate(
              agent_task_duration_ms_bucket{version="2026.04.19-a3f2c1"}[5m]
            )) by (le))

Backward-Compatible Memory Migrations

The problem with canary + shared memory

Canary rollouts have two versions serving traffic simultaneously. Both versions read and write the same memory store. If the new version writes data in a shape the old version can’t read, rollback will fail — the old version will crash on data the new version wrote.

The solution: migrations are additive only during a rollout. Add new fields; don’t remove or rename old ones. The old version ignores new fields; the new version reads both old and new. Deprecation happens two releases later, after the new version has been stable at 100% for long enough to trust.

migration_7.sql
-- Migration v6 → v7
-- Goal: replace user_preferences.timezone (string) with structured location
-- Strategy: additive — both old and new versions keep working

BEGIN;

-- Add new columns with sensible defaults. Old code ignores them.
ALTER TABLE user_preferences
  ADD COLUMN location_timezone TEXT,
  ADD COLUMN location_country  TEXT,
  ADD COLUMN location_region   TEXT;

-- Backfill from the old column so the new code has data immediately
UPDATE user_preferences
   SET location_timezone = timezone
 WHERE location_timezone IS NULL AND timezone IS NOT NULL;

-- Schema version bump — the running code checks this at startup
UPDATE meta SET schema_version = 7 WHERE key = 'memory';

COMMIT;

-- IMPORTANT: do NOT drop the old "timezone" column in this migration.
-- Schedule the drop for migration v9 after v7 has been stable at 100% for 30 days.

The same pattern applies to vector metadata: if you’re changing how chunks are tagged, write both the old tag and the new tag during the rollout window. Re-index in a background job. Cut over reads to the new tag only after the write-side migration is complete everywhere.

Rolling Back

A rollback is just “deploy the previous manifest.” Nothing special should be required. The goal is a sub-60-second mean time to recover.

rollback.sh
#!/bin/bash
# Roll back to the previous stable agent release.
set -euo pipefail

CURRENT=$(kubectl get rollout hermes-agent -n agents -o jsonpath='{.spec.template.spec.containers[0].image}')
PREVIOUS=$(git log --format="%H" -- agent.manifest.yaml | sed -n 2p)

echo "Current: $CURRENT"
echo "Rolling back to manifest at commit: $PREVIOUS"

# Abort any in-progress rollout first
kubectl argo rollouts abort hermes-agent -n agents

# Restore the previous manifest and apply
git show "$PREVIOUS:agent.manifest.yaml" > /tmp/rollback-manifest.yaml
./scripts/apply-manifest.sh /tmp/rollback-manifest.yaml

# Wait for stable and verify
kubectl argo rollouts status hermes-agent -n agents --timeout=2m
./scripts/smoke-test.sh --version=previous

echo "Rollback complete. Verify metrics at https://grafana/d/agent-overview"

Rollback must not require a database migration

Because additive schema migrations leave the old columns intact, the previous version Just Works.

Drain in-flight work on the new version before rolling back

If you have a task queue, let the canary finish its current tasks. Stop enqueuing to it first.

Invalidate any caches that were written by the new version

Otherwise the old code will read values it does not understand.

Post-rollback, lock the release and write the postmortem

A rollback without a postmortem is a rollback you will repeat.

Shadow Deployments for Risky Changes

For changes that are particularly risky — a new model, a major prompt rewrite, a tool protocol change — run a shadow deployment before the canary. Shadow means the new version receives a copy of real traffic but its responses never reach the user. You compare its behavior to the stable version offline.

This is the pattern covered in more depth in the testing-in-production guide. Combined with the debugging playbook, shadow deployments catch a surprising share of regressions before any user is exposed.

The managed alternative

Every Rapid Claw deploy is an immutable versioned artifact with canary rollout, metric-based guardrails, shadow traffic support, and one-click rollback under 10 seconds. Memory migrations are backward-compatible by default. If you’d rather not build the release-engineering stack from scratch, see the complete hosting guide for how it fits with the rest of your infrastructure.

Frequently Asked Questions

Release engineering, included

Rapid Claw deploys OpenClaw and Hermes Agent with canary rollouts, metric-based guardrails, and one-click rollback pre-configured. Ship agents like you ship software.

Deploy with rollback built in