AI Agent Versioning & Rollback: Safe Rollouts for Production
April 19, 2026·15 min read
A one-line prompt change can break an agent that passed every eval. A “minor” model upgrade can silently change tool-call formatting. You need the same release-engineering rigor you’d use for a database migration — versioned artifacts, canary rollouts, and rollback that works in under a minute.
TL;DR
Version five things together as one immutable artifact: prompts, tool definitions, model pin, memory schema, configuration. Roll out with a canary (1% → 10% → 50% → 100%). Auto-rollback on metric regression. Keep memory migrations backward-compatible. Never use “latest” model IDs. This guide has production code for OpenClaw and Hermes Agent; Rapid Claw ships all of it out of the box.

Want versioning & rollback built in?
Try Rapid ClawWhy Versioning AI Agents Is Harder Than Versioning Web Apps
You push a bug fix to a REST API. The tests pass. You deploy. Errors stay flat. You move on. Agents don’t work that way. You tweak a single sentence in a system prompt, all your evals pass, you deploy — and two days later a user reports that the agent stopped offering to summarize attachments. Nothing errored. The new version just behaves differently.
That’s the core problem. Agent behavior is an emergent property of prompt + model + tools + memory + configuration. Change any one of those and the behavior shifts in ways you can’t fully predict. The failure modes of AI agents in production are dominated by silent regressions, not crashes. Your release engineering has to be built around that fact.
Two principles carry most of the weight: everything is versioned together as one immutable artifact, and rollbacks are cheap and routine. The rest of this guide shows how to implement both for OpenClaw and Hermes Agent.
The Five Things to Version Together
Every agent release pins these five components. If any one changes, the version number changes, and the whole artifact is rebuilt and re-released.
Prompts
System prompt, task templates, few-shot examples. Stored in git, not in a database. Readable as text.
Tool definitions
Function signatures, descriptions, parameter schemas. A changed description changes agent behavior just as much as a changed signature.
Model pin
Exact model version (e.g. claude-sonnet-4-6, gpt-4-turbo-2024-04-09). Never "latest". Never just "claude-sonnet".
Memory schema
Vector metadata shape, structured-state columns, migration version. Backward-compatible migrations only.
Configuration
Rate limits, permissions, routing policies. The runtime knobs that change agent behavior.
An agent manifest file
Capture all five in a single manifest that ships with every build. This is the deployable artifact — not a container image on its own, and not a commit SHA on its own.
# Agent manifest — the source of truth for a single release
apiVersion: agents/v1
kind: AgentRelease
metadata:
name: research-agent
version: 2026.04.19-a3f2c1 # date + short git sha
image: ghcr.io/acme/research-agent:2026.04.19-a3f2c1
spec:
# 1. Prompts — content-hashed, stored as files in the image
prompts:
system: prompts/system.md # sha256: a3f2c1...
task_templates: prompts/tasks/ # sha256 per file
few_shot_examples: prompts/examples/
# 2. Tool definitions — semver per tool
tools:
- name: web_search
version: 1.4.0
schema: schemas/web_search.json
- name: send_email
version: 2.1.0
schema: schemas/send_email.json
# 3. Model pin — exact version, never "latest"
model:
primary: claude-sonnet-4-6
fallback: gpt-4-turbo-2024-04-09
local: meta-llama-3-70b-instruct:2026-02-01
# 4. Memory schema
memory:
schema_version: 7
migration_from: 6 # backward-compatible
# 5. Runtime configuration
config:
rate_limits:
web_search: "20/60s"
send_email: "5/300s"
max_tokens_per_task: 100000
provider_routing: priority_failover
# Rollout plan for this release
rollout:
strategy: canary
stages:
- { traffic: 1, duration: "30m" }
- { traffic: 10, duration: "2h" }
- { traffic: 50, duration: "6h" }
- { traffic: 100 }
guardrails:
task_success_rate: { min: 0.97 } # vs previous baseline
p95_latency_ms: { max: 8000 }
token_spend_delta: { max: 0.15 } # +15% allowed vs baseline
eval_score: { min: 0.90 }
auto_rollback_on_failure: trueA few things this buys you: the manifest is small and diffable in a pull request; the content hashes make tamper detection trivial; the rollout plan lives alongside the thing it rolls out, so it ages with the release. Every production deploy has a corresponding manifest, and every manifest has a corresponding commit.
Canary Rollouts With Metric Guardrails
Why not blue-green?
Blue-green flips 100% of traffic in one cut. For deterministic services that’s usually fine. For agents it’s dangerous: a quality regression that evals missed will hit every user at once, and the damage may take hours to detect. Canary gives you a real-traffic early-warning window with a small blast radius.
Traffic splitting for OpenClaw
Route traffic between old and new versions based on a stable hash of the user or session ID. Consistent hashing means a given user always lands on the same version during the canary — that’s important because agents have memory, and flip-flopping a user between versions would confuse conversation context.
import hashlib
from dataclasses import dataclass
@dataclass
class Version:
name: str # e.g. "2026.04.19-a3f2c1"
handler: callable # function that runs the agent
traffic_pct: float # 0-100
class CanaryRouter:
"""Sticky per-user version routing."""
def __init__(self, stable: Version, canary: Version):
self.stable = stable
self.canary = canary
def route(self, user_id: str) -> Version:
# Stable hash → deterministic bucketing 0-99
digest = hashlib.sha256(user_id.encode()).digest()
bucket = int.from_bytes(digest[:4], "big") % 100
if bucket < self.canary.traffic_pct:
return self.canary
return self.stable
def handle(self, user_id: str, task: dict):
version = self.route(user_id)
# Tag every span/log with the version — essential for metric comparison
with trace.attribute("agent.version", version.name):
return version.handler(task)
# Bump canary traffic as stages progress
router = CanaryRouter(
stable=Version("2026.04.12-b1e4f0", handler=old_agent, traffic_pct=0),
canary=Version("2026.04.19-a3f2c1", handler=new_agent, traffic_pct=1),
)Metric-based guardrails
At each canary stage, compare metrics from the canary cohort to the stable cohort over the same window. If any guardrail trips, the rollout halts and traffic shifts back to the stable version. The metrics come directly from the observability stack covered in the observability guide.
from dataclasses import dataclass
@dataclass
class GuardrailResult:
metric: str
stable: float
canary: float
passed: bool
reason: str = ""
def check_guardrails(stable_metrics: dict, canary_metrics: dict) -> list[GuardrailResult]:
results = []
# 1. Task success rate — canary must not drop more than 2pp below stable
delta = stable_metrics["success_rate"] - canary_metrics["success_rate"]
results.append(GuardrailResult(
metric="task_success_rate",
stable=stable_metrics["success_rate"],
canary=canary_metrics["success_rate"],
passed=delta < 0.02,
reason=f"Canary success rate {delta*100:.1f}pp below stable",
))
# 2. p95 latency — canary must not exceed stable by more than 20%
ratio = canary_metrics["p95_latency_ms"] / stable_metrics["p95_latency_ms"]
results.append(GuardrailResult(
metric="p95_latency",
stable=stable_metrics["p95_latency_ms"],
canary=canary_metrics["p95_latency_ms"],
passed=ratio < 1.20,
reason=f"Canary p95 latency {(ratio-1)*100:.0f}% higher than stable",
))
# 3. Token spend per task — canary must not exceed stable by more than 15%
ratio = canary_metrics["avg_tokens"] / stable_metrics["avg_tokens"]
results.append(GuardrailResult(
metric="token_spend",
stable=stable_metrics["avg_tokens"],
canary=canary_metrics["avg_tokens"],
passed=ratio < 1.15,
reason=f"Canary token spend {(ratio-1)*100:.0f}% higher than stable",
))
# 4. Held-out eval score — canary must score >= 0.90
results.append(GuardrailResult(
metric="eval_score",
stable=stable_metrics["eval_score"],
canary=canary_metrics["eval_score"],
passed=canary_metrics["eval_score"] >= 0.90,
reason=f"Canary eval score {canary_metrics['eval_score']:.2f} below 0.90 threshold",
))
return results
def should_rollback(results: list[GuardrailResult]) -> bool:
return any(not r.passed for r in results)Tooling-wise, if you don’t want to hand-roll the eval harness, the open-source agent-bench project benchmarks LLM agents across providers on speed, cost, and quality and produces a report card you can diff between versions. Wire it into your CI and you get regression detection before the canary even starts.
Rollouts With Hermes Agent & Kubernetes
If you’re running Hermes Agent on Kubernetes, Argo Rollouts or Flagger automates canary progression and guardrails at the ingress layer. Here’s the Argo Rollouts manifest that implements the rollout plan from our agent manifest:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: hermes-agent
namespace: agents
spec:
replicas: 6
selector:
matchLabels: { app: hermes-agent }
strategy:
canary:
canaryService: hermes-agent-canary
stableService: hermes-agent-stable
trafficRouting:
istio:
virtualService:
name: hermes-agent
steps:
- setWeight: 1
- pause: { duration: 30m }
- analysis:
templates: [{ templateName: agent-guardrails }]
- setWeight: 10
- pause: { duration: 2h }
- analysis:
templates: [{ templateName: agent-guardrails }]
- setWeight: 50
- pause: { duration: 6h }
- analysis:
templates: [{ templateName: agent-guardrails }]
- setWeight: 100
template:
spec:
containers:
- name: hermes
image: ghcr.io/acme/hermes-agent:2026.04.19-a3f2c1
env:
- { name: AGENT_VERSION, value: "2026.04.19-a3f2c1" }
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: agent-guardrails
spec:
metrics:
- name: task-success-rate
interval: 5m
successCondition: result >= 0.97
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(agent_task_success_total{version="2026.04.19-a3f2c1"}[5m]))
/
sum(rate(agent_task_total{version="2026.04.19-a3f2c1"}[5m]))
- name: p95-latency
interval: 5m
successCondition: result < 8000
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.95, sum(rate(
agent_task_duration_ms_bucket{version="2026.04.19-a3f2c1"}[5m]
)) by (le))Backward-Compatible Memory Migrations
The problem with canary + shared memory
Canary rollouts have two versions serving traffic simultaneously. Both versions read and write the same memory store. If the new version writes data in a shape the old version can’t read, rollback will fail — the old version will crash on data the new version wrote.
The solution: migrations are additive only during a rollout. Add new fields; don’t remove or rename old ones. The old version ignores new fields; the new version reads both old and new. Deprecation happens two releases later, after the new version has been stable at 100% for long enough to trust.
-- Migration v6 → v7
-- Goal: replace user_preferences.timezone (string) with structured location
-- Strategy: additive — both old and new versions keep working
BEGIN;
-- Add new columns with sensible defaults. Old code ignores them.
ALTER TABLE user_preferences
ADD COLUMN location_timezone TEXT,
ADD COLUMN location_country TEXT,
ADD COLUMN location_region TEXT;
-- Backfill from the old column so the new code has data immediately
UPDATE user_preferences
SET location_timezone = timezone
WHERE location_timezone IS NULL AND timezone IS NOT NULL;
-- Schema version bump — the running code checks this at startup
UPDATE meta SET schema_version = 7 WHERE key = 'memory';
COMMIT;
-- IMPORTANT: do NOT drop the old "timezone" column in this migration.
-- Schedule the drop for migration v9 after v7 has been stable at 100% for 30 days.The same pattern applies to vector metadata: if you’re changing how chunks are tagged, write both the old tag and the new tag during the rollout window. Re-index in a background job. Cut over reads to the new tag only after the write-side migration is complete everywhere.
Rolling Back
A rollback is just “deploy the previous manifest.” Nothing special should be required. The goal is a sub-60-second mean time to recover.
#!/bin/bash
# Roll back to the previous stable agent release.
set -euo pipefail
CURRENT=$(kubectl get rollout hermes-agent -n agents -o jsonpath='{.spec.template.spec.containers[0].image}')
PREVIOUS=$(git log --format="%H" -- agent.manifest.yaml | sed -n 2p)
echo "Current: $CURRENT"
echo "Rolling back to manifest at commit: $PREVIOUS"
# Abort any in-progress rollout first
kubectl argo rollouts abort hermes-agent -n agents
# Restore the previous manifest and apply
git show "$PREVIOUS:agent.manifest.yaml" > /tmp/rollback-manifest.yaml
./scripts/apply-manifest.sh /tmp/rollback-manifest.yaml
# Wait for stable and verify
kubectl argo rollouts status hermes-agent -n agents --timeout=2m
./scripts/smoke-test.sh --version=previous
echo "Rollback complete. Verify metrics at https://grafana/d/agent-overview"Rollback must not require a database migration
Because additive schema migrations leave the old columns intact, the previous version Just Works.
Drain in-flight work on the new version before rolling back
If you have a task queue, let the canary finish its current tasks. Stop enqueuing to it first.
Invalidate any caches that were written by the new version
Otherwise the old code will read values it does not understand.
Post-rollback, lock the release and write the postmortem
A rollback without a postmortem is a rollback you will repeat.
Shadow Deployments for Risky Changes
For changes that are particularly risky — a new model, a major prompt rewrite, a tool protocol change — run a shadow deployment before the canary. Shadow means the new version receives a copy of real traffic but its responses never reach the user. You compare its behavior to the stable version offline.
This is the pattern covered in more depth in the testing-in-production guide. Combined with the debugging playbook, shadow deployments catch a surprising share of regressions before any user is exposed.
The managed alternative
Every Rapid Claw deploy is an immutable versioned artifact with canary rollout, metric-based guardrails, shadow traffic support, and one-click rollback under 10 seconds. Memory migrations are backward-compatible by default. If you’d rather not build the release-engineering stack from scratch, see the complete hosting guide for how it fits with the rest of your infrastructure.
Frequently Asked Questions
Release engineering, included
Rapid Claw deploys OpenClaw and Hermes Agent with canary rollouts, metric-based guardrails, and one-click rollback pre-configured. Ship agents like you ship software.
Deploy with rollback built in