Can I deploy CrewAI on a single VM or do I need Kubernetes?

You can deploy CrewAI on a single VM for low-volume workloads. Use systemd or Docker Compose with managed Redis and Postgres. Reach for Kubernetes when you need horizontal scaling, zero-downtime rollouts, or tenant isolation.

How do I scale CrewAI horizontally?

Make crew execution stateless. All state — memory, tool results — must live in Redis (short-term) and Postgres with pgvector (long-term). Then run N worker replicas behind a task queue.

Does Rapid Claw host CrewAI natively?

Rapid Claw runs multi-agent workloads — including CrewAI-style crews — with queueing, shared memory, and budget controls built in.

TutorialIntermediate

Deploy CrewAI to Production: A Step-by-Step Tutorial

Deploy CrewAI to Production — Step-by-Step Tutorial (2026)

Tijo Gaucher

April 20, 2026·18 min read

CrewAI is the easiest way to write a multi-agent workflow. It is also one of the easiest frameworks to blow up in production. This tutorial walks through a complete deployment — containerization, secrets, task queues, horizontal scaling, monitoring, and the pitfalls that burn real budgets — then shows the shortcut if you would rather skip the plumbing.

TL;DR

To deploy CrewAI to production: containerize the crew, push state into Redis + Postgres, run crew.kickoff() in Celery workers behind a task queue, inject secrets at startup, add structured logging with per-crew correlation IDs, and cap tool calls + token spend per run. Production-ready skeleton below — or skip the plumbing with Rapid Claw, which ships queueing, shared memory, and budget kill-switches out of the box.

Prefer the managed path?

Deploy on Rapid Claw

1. CrewAI Production Requirements

Before you touch a Dockerfile, get clear on what CrewAI actually needs. The quickstart docs show crew.kickoff() running in a Python REPL. Production is a different animal. Here is the minimum stack:

Python 3.11+

CrewAI 0.80+ requires 3.11 or newer. Pin the minor version in your base image.

Redis

Short-term memory, tool result cache, and Celery broker. Managed (Elasticache, Upstash) saves ops time.

Postgres + pgvector

Long-term memory and semantic search over past crew runs. Needed once you enable CrewAI memory.

Task queue

Celery, RQ, or Arq. Crew runs are long — they do not belong in request handlers.

Secrets manager

AWS Secrets Manager, Vault, or Doppler. Never ship .env files to production.

LLM provider credits

Separate keys per environment. Budget alerts on daily spend are non-optional.

What most tutorials skip

The CrewAI docs show a working crew in 20 lines of Python. They do not show you how to run that crew when 50 users hit it simultaneously, what to do when Anthropic rate-limits you mid-task, or how to prevent a hallucinating agent from spending $800 in ten minutes. This tutorial fills those gaps.

2. Containerize Your Crew

A reproducible Docker image is the foundation. Pin dependencies, run as a non-root user, and keep the image small. Here is a production-ready Dockerfile:

Dockerfile

# syntax=docker/dockerfile:1.7
FROM python:3.11-slim-bookworm AS base

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

# Non-root user
RUN groupadd --system crew && useradd --system --gid crew --create-home crew

WORKDIR /app

# Build layer — cached unless requirements change
COPY --chown=crew:crew requirements.txt .
RUN pip install --upgrade pip && \
    pip install -r requirements.txt

# App code
COPY --chown=crew:crew . .

USER crew

# Separate entrypoints for API vs worker — pick at runtime
# docker run ... python -m app.api    (FastAPI server)
# docker run ... celery -A app.tasks worker --loglevel=info    (crew worker)
CMD ["python", "-m", "app.api"]

And the matching requirements.txt — every version pinned, no floating ranges:

requirements.txt

crewai==0.86.0
crewai-tools==0.17.0
anthropic==0.42.0
openai==1.59.0
celery[redis]==5.4.0
redis==5.2.1
fastapi==0.115.6
uvicorn[standard]==0.34.0
sqlalchemy==2.0.36
psycopg[binary]==3.2.3
pgvector==0.3.6
pydantic-settings==2.7.0
structlog==24.4.0
prometheus-client==0.21.1
tenacity==9.0.0

Why pin everything?

CrewAI moves fast. A minor bump in crewai-tools has broken agent signatures twice in the last six months. Pin the exact version that passed your integration tests. Upgrade on purpose, not by accident.

3. Environment & Configuration Management

Split configuration into three layers. Mixing them is the single most common source of "works on my laptop" incidents:

Non-secret runtime config

Model names, temperature, crew process type, memory toggles. Lives in environment variables, checked into infrastructure code.

Secrets

LLM API keys, database passwords, tool credentials. Lives in a secrets manager; injected at container startup only.

Per-invocation inputs

The actual task payload — user question, document, context. Passed through the task queue, never baked into the container.

A typed settings module

Use pydantic-settings so missing config fails fast at boot, not on the first crew run:

app/config.py

from pydantic import Field, SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=False,
    )

    # --- Non-secret runtime config ---
    env: str = Field(default="production")
    log_level: str = Field(default="INFO")
    default_llm_model: str = Field(default="claude-sonnet-4-6")
    default_temperature: float = Field(default=0.2)
    crew_memory_enabled: bool = Field(default=True)
    max_crew_runtime_sec: int = Field(default=600)      # hard cap
    max_tool_calls_per_crew: int = Field(default=50)    # budget guard

    # --- Secrets (from secrets manager, injected as env) ---
    anthropic_api_key: SecretStr
    openai_api_key: SecretStr
    database_url: SecretStr
    redis_url: SecretStr

    # --- Budget controls ---
    daily_token_budget_usd: int = Field(default=100)
    per_crew_token_budget_usd: int = Field(default=5)

settings = Settings()

Pulling secrets at startup

For AWS, a short entrypoint script fetches secrets and exports them before the app starts — no secrets on disk:

entrypoint.sh

#!/usr/bin/env bash
set -euo pipefail

# Fetch secret JSON from AWS Secrets Manager
SECRET_JSON=$(aws secretsmanager get-secret-value \
  --secret-id "prod/crewai/$APP_NAME" \
  --query SecretString \
  --output text)

# Export each key as an env var
while IFS="=" read -r key value; do
  export "$key"="$value"
done < <(echo "$SECRET_JSON" | jq -r 'to_entries[] | "\(.key)=\(.value)"')

# Hand off to the real command (API or worker)
exec "$@"

4. Move Crew Execution to a Task Queue

The anti-pattern

Do not call crew.kickoff() inside a FastAPI or Flask route. A moderately complex crew takes 30–300 seconds. HTTP timeouts, proxy buffers, and load-balancer idle limits will bite you. Worse, a stuck crew blocks a worker thread for the full runtime.

API tier: enqueue and return

The API receives the request, validates it, enqueues a Celery job, and returns a task ID immediately. The client polls or subscribes for results:

app/api.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from app.tasks import run_crew_task
from celery.result import AsyncResult

app = FastAPI(title="CrewAI API")

class CrewRequest(BaseModel):
    topic: str
    tenant_id: str

class CrewResponse(BaseModel):
    task_id: str
    status: str

@app.post("/v1/crew/run", response_model=CrewResponse, status_code=202)
def start_crew(req: CrewRequest) -> CrewResponse:
    # Enqueue — do NOT call crew.kickoff() here
    result = run_crew_task.delay(topic=req.topic, tenant_id=req.tenant_id)
    return CrewResponse(task_id=result.id, status="queued")

@app.get("/v1/crew/{task_id}")
def get_status(task_id: str):
    result = AsyncResult(task_id)
    if result.state == "PENDING":
        return {"task_id": task_id, "status": "queued"}
    if result.state == "STARTED":
        return {"task_id": task_id, "status": "running"}
    if result.state == "SUCCESS":
        return {"task_id": task_id, "status": "done", "result": result.result}
    if result.state == "FAILURE":
        raise HTTPException(500, detail=str(result.info))
    return {"task_id": task_id, "status": result.state.lower()}

Worker tier: run the crew

The worker builds the crew, runs it with a hard timeout, and returns structured output. The per-task budget guard caps cost even if the LLM goes off the rails:

app/tasks.py

from celery import Celery
from celery.exceptions import SoftTimeLimitExceeded
from crewai import Crew, Agent, Task, Process
from app.config import settings
from app.observability import log, record_crew_metrics
import uuid

celery_app = Celery(
    "crewai-workers",
    broker=settings.redis_url.get_secret_value(),
    backend=settings.redis_url.get_secret_value(),
)

celery_app.conf.update(
    task_acks_late=True,             # re-queue on worker crash
    task_reject_on_worker_lost=True,
    task_soft_time_limit=settings.max_crew_runtime_sec,
    task_time_limit=settings.max_crew_runtime_sec + 30,
    worker_prefetch_multiplier=1,    # one crew per worker slot
)

def build_research_crew(topic: str) -> Crew:
    researcher = Agent(
        role="Senior Researcher",
        goal=f"Gather facts about {topic}",
        backstory="You are meticulous and cite sources.",
        llm=settings.default_llm_model,
        max_iter=10,           # cap reasoning loops
        max_execution_time=180 # per-agent timeout (seconds)
    )
    writer = Agent(
        role="Technical Writer",
        goal="Produce a concise briefing",
        backstory="You write clearly and never bury the lede.",
        llm=settings.default_llm_model,
        max_iter=8,
    )
    research = Task(
        description=f"Research {topic}. Return 5 key facts with sources.",
        agent=researcher,
        expected_output="Bulleted list with sources",
    )
    write = Task(
        description="Write a 300-word briefing from the research.",
        agent=writer,
        context=[research],
        expected_output="300-word briefing",
    )
    return Crew(
        agents=[researcher, writer],
        tasks=[research, write],
        process=Process.sequential,
        memory=settings.crew_memory_enabled,
        verbose=False,
    )

@celery_app.task(bind=True, name="run_crew")
def run_crew_task(self, topic: str, tenant_id: str) -> dict:
    correlation_id = str(uuid.uuid4())
    log.info("crew.start", task_id=self.request.id, correlation_id=correlation_id,
             tenant_id=tenant_id, topic=topic)
    try:
        crew = build_research_crew(topic)
        result = crew.kickoff(inputs={"topic": topic})
        record_crew_metrics(correlation_id, crew, status="success")
        return {
            "correlation_id": correlation_id,
            "output": str(result),
            "token_usage": result.token_usage.dict() if hasattr(result, "token_usage") else None,
        }
    except SoftTimeLimitExceeded:
        log.error("crew.timeout", correlation_id=correlation_id)
        record_crew_metrics(correlation_id, None, status="timeout")
        raise
    except Exception as e:
        log.exception("crew.failed", correlation_id=correlation_id, error=str(e))
        record_crew_metrics(correlation_id, None, status="error")
        raise

5. Scaling Crews Horizontally

Once crew.kickoff() is in a worker, scaling is about two things: stateless workers and the right concurrency model for your workload.

Keep workers stateless

Every piece of mutable state must live outside the worker process. CrewAI memory goes to Postgres + pgvector. Tool result caches go to Redis. Scratch files go to S3 with a task-id prefix. If a worker disappears, another one picks up without drift.

docker-compose.yml (dev)

version: "3.9"
services:
  api:
    build: .
    command: uvicorn app.api:app --host 0.0.0.0 --port 8000
    ports: ["8000:8000"]
    depends_on: [redis, postgres]
    environment:
      - REDIS_URL=redis://redis:6379/0
      - DATABASE_URL=postgresql+psycopg://crew:crew@postgres:5432/crew

  worker:
    build: .
    command: celery -A app.tasks.celery_app worker --loglevel=info --concurrency=4
    depends_on: [redis, postgres]
    environment:
      - REDIS_URL=redis://redis:6379/0
      - DATABASE_URL=postgresql+psycopg://crew:crew@postgres:5432/crew
    deploy:
      replicas: 3  # horizontal scaling — stateless workers

  redis:
    image: redis:7-alpine
    volumes: ["redis-data:/data"]

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      - POSTGRES_USER=crew
      - POSTGRES_PASSWORD=crew
      - POSTGRES_DB=crew
    volumes: ["pg-data:/var/lib/postgresql/data"]

volumes:
  redis-data:
  pg-data:

Picking concurrency per worker

Celery’s --concurrency flag controls how many crews a single worker process runs in parallel. The right number depends on what the crew is doing:

Crew profile	Concurrency	Why
LLM-only, mostly I/O wait	20–50	Workers are idle waiting on API responses. Pack them in.
Mixed LLM + light tools	8–15	Some CPU for parsing, file I/O, vector lookups.
Heavy tools (code exec, scraping)	2–6	Tools dominate. Too much parallelism = OOM kills.
CPU-bound (local inference)	1 per core	Use `--pool=prefork`. GIL will block otherwise.

Autoscaling on queue depth

Scale workers based on Celery queue depth, not CPU. A queue with 200 pending jobs and low CPU is the exact case where you need more workers. On Kubernetes, use KEDA with the Redis scaler:

keda-scaledobject.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: crewai-worker-scaler
  namespace: crewai
spec:
  scaleTargetRef:
    name: crewai-worker
  minReplicaCount: 2
  maxReplicaCount: 30
  pollingInterval: 15
  cooldownPeriod: 120
  triggers:
    - type: redis
      metadata:
        address: redis.crewai.svc.cluster.local:6379
        listName: celery
        listLength: "10"   # scale up when > 10 jobs queued per pod

6. Monitoring and Logging

A CrewAI deployment without observability is a liability. You need visibility into three layers: the crew itself, the infrastructure it runs on, and the cost it generates.

Structured logs with correlation IDs

Every crew run gets a correlation ID that threads through every log line, every tool call, every LLM invocation. When something breaks, you grep for one ID and see the entire story:

app/observability.py

import structlog
from prometheus_client import Counter, Histogram, Gauge

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
)
log = structlog.get_logger("crewai")

# Prometheus metrics
crew_runs_total = Counter(
    "crewai_runs_total", "Crew runs by status", ["status"]
)
crew_duration = Histogram(
    "crewai_run_duration_seconds", "Crew runtime",
    buckets=(5, 15, 30, 60, 120, 300, 600, 1200)
)
crew_tokens_used = Counter(
    "crewai_tokens_total", "Tokens consumed", ["kind"]
)
crew_cost_usd = Counter(
    "crewai_cost_usd_total", "Estimated spend in USD"
)
queue_depth = Gauge(
    "crewai_queue_depth", "Pending crew jobs in the Celery queue"
)

def record_crew_metrics(correlation_id: str, crew, status: str) -> None:
    crew_runs_total.labels(status=status).inc()
    if crew and hasattr(crew, "usage_metrics") and crew.usage_metrics:
        u = crew.usage_metrics
        crew_tokens_used.labels(kind="input").inc(u.get("prompt_tokens", 0))
        crew_tokens_used.labels(kind="output").inc(u.get("completion_tokens", 0))
        # Rough spend estimate — plug real pricing here
        cost = (u.get("prompt_tokens", 0) * 3 + u.get("completion_tokens", 0) * 15) / 1_000_000
        crew_cost_usd.inc(cost)
        log.info("crew.usage", correlation_id=correlation_id,
                 tokens=u, estimated_cost_usd=round(cost, 4))

Agent-level tracing with Langfuse

Prometheus gives you infrastructure metrics. For agent-level tracing — which task called which tool with which prompt, and what the LLM responded — wire in Langfuse or LangSmith. Both support CrewAI via OpenTelemetry:

app/tracing.py

from langfuse.decorators import observe, langfuse_context
from app.config import settings

# Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST via settings

@observe(name="crew_run")
def traced_crew_run(topic: str, correlation_id: str):
    langfuse_context.update_current_trace(
        user_id=correlation_id,
        tags=["crewai", settings.env],
        metadata={"topic": topic}
    )
    crew = build_research_crew(topic)
    return crew.kickoff(inputs={"topic": topic})

The dashboard you actually need

Skip the vanity charts. The five panels that matter on day one:

Crew runs per minute (by status)

P50 / P95 crew duration

Token spend per hour (rolling)

Celery queue depth

Tool-call error rate (by tool name)

Daily budget remaining ($)

For a deeper look at what to monitor and why, see the AI Agent Observability guide — it covers traces, metrics, and alerting patterns that generalize across CrewAI, OpenClaw, and Hermes Agent.

7. Common Deployment Pitfalls

After shipping and debugging more CrewAI deployments than I care to count, five failure modes come up again and again. Dodge these and you avoid most of the pain.

Pitfall 1: No tool-call rate limiting

A hallucinating agent in a retry loop can issue hundreds of tool calls per minute, each triggering LLM requests. Wrap every tool in a per-crew call counter and hard-stop at a threshold. See the tool rate-limit patterns in the AI Agent Firewall guide.

Pitfall 2: Synchronous crew execution

Running crew.kickoff() in an HTTP handler is the #1 cause of production incidents. Use Celery (shown above). Clients poll or subscribe to task-id updates — they never wait on the wire.

Pitfall 3: Memory on local disk

CrewAI memory defaults to a local SQLite file. This breaks the moment you add a second worker. Configure the external memory backend (Postgres + pgvector) from day one.

Pitfall 4: No retry on transient LLM errors

Anthropic and OpenAI return 429s and 529s under load. Without a retry strategy, every transient error surfaces as a failed crew. Use tenacity with exponential backoff on the LLM call, NOT on the full crew (retrying a 5-minute crew for a 2-second blip wastes tokens).

Pitfall 5: No budget kill-switch

Track spend per crew and per tenant. Hard-stop when the daily budget is hit. One misbehaving prompt template has caused $4,000 overnight charges on production deployments. A kill-switch turns that into a $50 incident.

Retry helper for transient LLM errors

Wrap your LLM calls — not your crews — in a retry decorator scoped to rate-limit and transient errors only:

app/llm_retry.py

from tenacity import (
    retry, retry_if_exception_type, stop_after_attempt,
    wait_exponential_jitter,
)
from anthropic import APIStatusError, RateLimitError

TRANSIENT = (RateLimitError, APIStatusError)

@retry(
    retry=retry_if_exception_type(TRANSIENT),
    wait=wait_exponential_jitter(initial=1, max=30),
    stop=stop_after_attempt(4),
    reraise=True,
)
def call_llm(client, **kwargs):
    return client.messages.create(**kwargs)

8. Deploying CrewAI on Rapid Claw (The Easy Path)

Everything above is doable. It is also maybe two weeks of infrastructure work before your first production crew ships — and an ongoing maintenance tax. If your goal is shipping a product, not running Celery clusters, the managed path is faster.

Rapid Claw runs CrewAI-style multi-agent workloads with queueing, shared memory, budget controls, and monitoring built in. You upload a crew definition, wire up your tools, and ship. No Dockerfile to tune, no Celery concurrency to pick, no KEDA scalers to maintain.

Managed queue + workers

Autoscaling crew runners with per-tenant isolation. No Celery config.

Shared memory + pgvector

Long-term memory across runs. No Postgres to provision.

Budget kill-switches

Per-crew, per-tenant, and daily caps baked in. Overspend is impossible.

Observability built in

Per-crew traces, token spend, tool-call latency — no Grafana to wire.

Secrets management

API keys and credentials stored encrypted, injected at runtime.

Zero-downtime deploys

Rolling updates for crew definitions without draining workers.

Self-host or managed — how to decide

The self-hosted path is the right call if you have a platform team, strict data-residency requirements, or a workload that genuinely benefits from custom tuning. For most teams, managed wins on time-to-first-production-crew — often weeks faster.

For a full head-to-head on CrewAI alternatives that handle deployment differently, see 5 CrewAI Alternatives That Actually Handle Deployment.

Production Readiness Checklist

Before you flip the switch, walk this list. If anything is missing, you have homework:

Frequently Asked Questions

Ship your first crew in minutes, not weeks

Rapid Claw runs CrewAI-style multi-agent workloads with queueing, memory, budgets, and monitoring built in. Skip the Celery config and the KEDA scalers — bring a crew definition and go.