Deploy CrewAI to Production: A Step-by-Step Tutorial

April 20, 2026·18 min read
CrewAI is the easiest way to write a multi-agent workflow. It is also one of the easiest frameworks to blow up in production. This tutorial walks through a complete deployment — containerization, secrets, task queues, horizontal scaling, monitoring, and the pitfalls that burn real budgets — then shows the shortcut if you would rather skip the plumbing.
TL;DR
To deploy CrewAI to production: containerize the crew, push state into Redis + Postgres, run crew.kickoff() in Celery workers behind a task queue, inject secrets at startup, add structured logging with per-crew correlation IDs, and cap tool calls + token spend per run. Production-ready skeleton below — or skip the plumbing with Rapid Claw, which ships queueing, shared memory, and budget kill-switches out of the box.
Prefer the managed path?
Deploy on Rapid Claw1. CrewAI Production Requirements
Before you touch a Dockerfile, get clear on what CrewAI actually needs. The quickstart docs show crew.kickoff() running in a Python REPL. Production is a different animal. Here is the minimum stack:
Python 3.11+
CrewAI 0.80+ requires 3.11 or newer. Pin the minor version in your base image.
Redis
Short-term memory, tool result cache, and Celery broker. Managed (Elasticache, Upstash) saves ops time.
Postgres + pgvector
Long-term memory and semantic search over past crew runs. Needed once you enable CrewAI memory.
Task queue
Celery, RQ, or Arq. Crew runs are long — they do not belong in request handlers.
Secrets manager
AWS Secrets Manager, Vault, or Doppler. Never ship .env files to production.
LLM provider credits
Separate keys per environment. Budget alerts on daily spend are non-optional.
What most tutorials skip
The CrewAI docs show a working crew in 20 lines of Python. They do not show you how to run that crew when 50 users hit it simultaneously, what to do when Anthropic rate-limits you mid-task, or how to prevent a hallucinating agent from spending $800 in ten minutes. This tutorial fills those gaps.
2. Containerize Your Crew
A reproducible Docker image is the foundation. Pin dependencies, run as a non-root user, and keep the image small. Here is a production-ready Dockerfile:
# syntax=docker/dockerfile:1.7
FROM python:3.11-slim-bookworm AS base
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
# Non-root user
RUN groupadd --system crew && useradd --system --gid crew --create-home crew
WORKDIR /app
# Build layer — cached unless requirements change
COPY --chown=crew:crew requirements.txt .
RUN pip install --upgrade pip && \
pip install -r requirements.txt
# App code
COPY --chown=crew:crew . .
USER crew
# Separate entrypoints for API vs worker — pick at runtime
# docker run ... python -m app.api (FastAPI server)
# docker run ... celery -A app.tasks worker --loglevel=info (crew worker)
CMD ["python", "-m", "app.api"]And the matching requirements.txt — every version pinned, no floating ranges:
crewai==0.86.0
crewai-tools==0.17.0
anthropic==0.42.0
openai==1.59.0
celery[redis]==5.4.0
redis==5.2.1
fastapi==0.115.6
uvicorn[standard]==0.34.0
sqlalchemy==2.0.36
psycopg[binary]==3.2.3
pgvector==0.3.6
pydantic-settings==2.7.0
structlog==24.4.0
prometheus-client==0.21.1
tenacity==9.0.0Why pin everything?
CrewAI moves fast. A minor bump in crewai-tools has broken agent signatures twice in the last six months. Pin the exact version that passed your integration tests. Upgrade on purpose, not by accident.
3. Environment & Configuration Management
Split configuration into three layers. Mixing them is the single most common source of "works on my laptop" incidents:
Non-secret runtime config
Model names, temperature, crew process type, memory toggles. Lives in environment variables, checked into infrastructure code.
Secrets
LLM API keys, database passwords, tool credentials. Lives in a secrets manager; injected at container startup only.
Per-invocation inputs
The actual task payload — user question, document, context. Passed through the task queue, never baked into the container.
A typed settings module
Use pydantic-settings so missing config fails fast at boot, not on the first crew run:
from pydantic import Field, SecretStr
from pydantic_settings import BaseSettings, SettingsConfigDict
class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
case_sensitive=False,
)
# --- Non-secret runtime config ---
env: str = Field(default="production")
log_level: str = Field(default="INFO")
default_llm_model: str = Field(default="claude-sonnet-4-6")
default_temperature: float = Field(default=0.2)
crew_memory_enabled: bool = Field(default=True)
max_crew_runtime_sec: int = Field(default=600) # hard cap
max_tool_calls_per_crew: int = Field(default=50) # budget guard
# --- Secrets (from secrets manager, injected as env) ---
anthropic_api_key: SecretStr
openai_api_key: SecretStr
database_url: SecretStr
redis_url: SecretStr
# --- Budget controls ---
daily_token_budget_usd: int = Field(default=100)
per_crew_token_budget_usd: int = Field(default=5)
settings = Settings()Pulling secrets at startup
For AWS, a short entrypoint script fetches secrets and exports them before the app starts — no secrets on disk:
#!/usr/bin/env bash
set -euo pipefail
# Fetch secret JSON from AWS Secrets Manager
SECRET_JSON=$(aws secretsmanager get-secret-value \
--secret-id "prod/crewai/$APP_NAME" \
--query SecretString \
--output text)
# Export each key as an env var
while IFS="=" read -r key value; do
export "$key"="$value"
done < <(echo "$SECRET_JSON" | jq -r 'to_entries[] | "\(.key)=\(.value)"')
# Hand off to the real command (API or worker)
exec "$@"4. Move Crew Execution to a Task Queue
The anti-pattern
Do not call crew.kickoff() inside a FastAPI or Flask route. A moderately complex crew takes 30–300 seconds. HTTP timeouts, proxy buffers, and load-balancer idle limits will bite you. Worse, a stuck crew blocks a worker thread for the full runtime.
API tier: enqueue and return
The API receives the request, validates it, enqueues a Celery job, and returns a task ID immediately. The client polls or subscribes for results:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from app.tasks import run_crew_task
from celery.result import AsyncResult
app = FastAPI(title="CrewAI API")
class CrewRequest(BaseModel):
topic: str
tenant_id: str
class CrewResponse(BaseModel):
task_id: str
status: str
@app.post("/v1/crew/run", response_model=CrewResponse, status_code=202)
def start_crew(req: CrewRequest) -> CrewResponse:
# Enqueue — do NOT call crew.kickoff() here
result = run_crew_task.delay(topic=req.topic, tenant_id=req.tenant_id)
return CrewResponse(task_id=result.id, status="queued")
@app.get("/v1/crew/{task_id}")
def get_status(task_id: str):
result = AsyncResult(task_id)
if result.state == "PENDING":
return {"task_id": task_id, "status": "queued"}
if result.state == "STARTED":
return {"task_id": task_id, "status": "running"}
if result.state == "SUCCESS":
return {"task_id": task_id, "status": "done", "result": result.result}
if result.state == "FAILURE":
raise HTTPException(500, detail=str(result.info))
return {"task_id": task_id, "status": result.state.lower()}Worker tier: run the crew
The worker builds the crew, runs it with a hard timeout, and returns structured output. The per-task budget guard caps cost even if the LLM goes off the rails:
from celery import Celery
from celery.exceptions import SoftTimeLimitExceeded
from crewai import Crew, Agent, Task, Process
from app.config import settings
from app.observability import log, record_crew_metrics
import uuid
celery_app = Celery(
"crewai-workers",
broker=settings.redis_url.get_secret_value(),
backend=settings.redis_url.get_secret_value(),
)
celery_app.conf.update(
task_acks_late=True, # re-queue on worker crash
task_reject_on_worker_lost=True,
task_soft_time_limit=settings.max_crew_runtime_sec,
task_time_limit=settings.max_crew_runtime_sec + 30,
worker_prefetch_multiplier=1, # one crew per worker slot
)
def build_research_crew(topic: str) -> Crew:
researcher = Agent(
role="Senior Researcher",
goal=f"Gather facts about {topic}",
backstory="You are meticulous and cite sources.",
llm=settings.default_llm_model,
max_iter=10, # cap reasoning loops
max_execution_time=180 # per-agent timeout (seconds)
)
writer = Agent(
role="Technical Writer",
goal="Produce a concise briefing",
backstory="You write clearly and never bury the lede.",
llm=settings.default_llm_model,
max_iter=8,
)
research = Task(
description=f"Research {topic}. Return 5 key facts with sources.",
agent=researcher,
expected_output="Bulleted list with sources",
)
write = Task(
description="Write a 300-word briefing from the research.",
agent=writer,
context=[research],
expected_output="300-word briefing",
)
return Crew(
agents=[researcher, writer],
tasks=[research, write],
process=Process.sequential,
memory=settings.crew_memory_enabled,
verbose=False,
)
@celery_app.task(bind=True, name="run_crew")
def run_crew_task(self, topic: str, tenant_id: str) -> dict:
correlation_id = str(uuid.uuid4())
log.info("crew.start", task_id=self.request.id, correlation_id=correlation_id,
tenant_id=tenant_id, topic=topic)
try:
crew = build_research_crew(topic)
result = crew.kickoff(inputs={"topic": topic})
record_crew_metrics(correlation_id, crew, status="success")
return {
"correlation_id": correlation_id,
"output": str(result),
"token_usage": result.token_usage.dict() if hasattr(result, "token_usage") else None,
}
except SoftTimeLimitExceeded:
log.error("crew.timeout", correlation_id=correlation_id)
record_crew_metrics(correlation_id, None, status="timeout")
raise
except Exception as e:
log.exception("crew.failed", correlation_id=correlation_id, error=str(e))
record_crew_metrics(correlation_id, None, status="error")
raise5. Scaling Crews Horizontally
Once crew.kickoff() is in a worker, scaling is about two things: stateless workers and the right concurrency model for your workload.
Keep workers stateless
Every piece of mutable state must live outside the worker process. CrewAI memory goes to Postgres + pgvector. Tool result caches go to Redis. Scratch files go to S3 with a task-id prefix. If a worker disappears, another one picks up without drift.
version: "3.9"
services:
api:
build: .
command: uvicorn app.api:app --host 0.0.0.0 --port 8000
ports: ["8000:8000"]
depends_on: [redis, postgres]
environment:
- REDIS_URL=redis://redis:6379/0
- DATABASE_URL=postgresql+psycopg://crew:crew@postgres:5432/crew
worker:
build: .
command: celery -A app.tasks.celery_app worker --loglevel=info --concurrency=4
depends_on: [redis, postgres]
environment:
- REDIS_URL=redis://redis:6379/0
- DATABASE_URL=postgresql+psycopg://crew:crew@postgres:5432/crew
deploy:
replicas: 3 # horizontal scaling — stateless workers
redis:
image: redis:7-alpine
volumes: ["redis-data:/data"]
postgres:
image: pgvector/pgvector:pg16
environment:
- POSTGRES_USER=crew
- POSTGRES_PASSWORD=crew
- POSTGRES_DB=crew
volumes: ["pg-data:/var/lib/postgresql/data"]
volumes:
redis-data:
pg-data:Picking concurrency per worker
Celery’s --concurrency flag controls how many crews a single worker process runs in parallel. The right number depends on what the crew is doing:
| Crew profile | Concurrency | Why |
|---|---|---|
| LLM-only, mostly I/O wait | 20–50 | Workers are idle waiting on API responses. Pack them in. |
| Mixed LLM + light tools | 8–15 | Some CPU for parsing, file I/O, vector lookups. |
| Heavy tools (code exec, scraping) | 2–6 | Tools dominate. Too much parallelism = OOM kills. |
| CPU-bound (local inference) | 1 per core | Use --pool=prefork. GIL will block otherwise. |
Autoscaling on queue depth
Scale workers based on Celery queue depth, not CPU. A queue with 200 pending jobs and low CPU is the exact case where you need more workers. On Kubernetes, use KEDA with the Redis scaler:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: crewai-worker-scaler
namespace: crewai
spec:
scaleTargetRef:
name: crewai-worker
minReplicaCount: 2
maxReplicaCount: 30
pollingInterval: 15
cooldownPeriod: 120
triggers:
- type: redis
metadata:
address: redis.crewai.svc.cluster.local:6379
listName: celery
listLength: "10" # scale up when > 10 jobs queued per pod6. Monitoring and Logging
A CrewAI deployment without observability is a liability. You need visibility into three layers: the crew itself, the infrastructure it runs on, and the cost it generates.
Structured logs with correlation IDs
Every crew run gets a correlation ID that threads through every log line, every tool call, every LLM invocation. When something breaks, you grep for one ID and see the entire story:
import structlog
from prometheus_client import Counter, Histogram, Gauge
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.stdlib.BoundLogger,
)
log = structlog.get_logger("crewai")
# Prometheus metrics
crew_runs_total = Counter(
"crewai_runs_total", "Crew runs by status", ["status"]
)
crew_duration = Histogram(
"crewai_run_duration_seconds", "Crew runtime",
buckets=(5, 15, 30, 60, 120, 300, 600, 1200)
)
crew_tokens_used = Counter(
"crewai_tokens_total", "Tokens consumed", ["kind"]
)
crew_cost_usd = Counter(
"crewai_cost_usd_total", "Estimated spend in USD"
)
queue_depth = Gauge(
"crewai_queue_depth", "Pending crew jobs in the Celery queue"
)
def record_crew_metrics(correlation_id: str, crew, status: str) -> None:
crew_runs_total.labels(status=status).inc()
if crew and hasattr(crew, "usage_metrics") and crew.usage_metrics:
u = crew.usage_metrics
crew_tokens_used.labels(kind="input").inc(u.get("prompt_tokens", 0))
crew_tokens_used.labels(kind="output").inc(u.get("completion_tokens", 0))
# Rough spend estimate — plug real pricing here
cost = (u.get("prompt_tokens", 0) * 3 + u.get("completion_tokens", 0) * 15) / 1_000_000
crew_cost_usd.inc(cost)
log.info("crew.usage", correlation_id=correlation_id,
tokens=u, estimated_cost_usd=round(cost, 4))Agent-level tracing with Langfuse
Prometheus gives you infrastructure metrics. For agent-level tracing — which task called which tool with which prompt, and what the LLM responded — wire in Langfuse or LangSmith. Both support CrewAI via OpenTelemetry:
from langfuse.decorators import observe, langfuse_context
from app.config import settings
# Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST via settings
@observe(name="crew_run")
def traced_crew_run(topic: str, correlation_id: str):
langfuse_context.update_current_trace(
user_id=correlation_id,
tags=["crewai", settings.env],
metadata={"topic": topic}
)
crew = build_research_crew(topic)
return crew.kickoff(inputs={"topic": topic})The dashboard you actually need
Skip the vanity charts. The five panels that matter on day one:
For a deeper look at what to monitor and why, see the AI Agent Observability guide — it covers traces, metrics, and alerting patterns that generalize across CrewAI, OpenClaw, and Hermes Agent.
7. Common Deployment Pitfalls
After shipping and debugging more CrewAI deployments than I care to count, five failure modes come up again and again. Dodge these and you avoid most of the pain.
Pitfall 1: No tool-call rate limiting
A hallucinating agent in a retry loop can issue hundreds of tool calls per minute, each triggering LLM requests. Wrap every tool in a per-crew call counter and hard-stop at a threshold. See the tool rate-limit patterns in the AI Agent Firewall guide.
Pitfall 2: Synchronous crew execution
Running crew.kickoff() in an HTTP handler is the #1 cause of production incidents. Use Celery (shown above). Clients poll or subscribe to task-id updates — they never wait on the wire.
Pitfall 3: Memory on local disk
CrewAI memory defaults to a local SQLite file. This breaks the moment you add a second worker. Configure the external memory backend (Postgres + pgvector) from day one.
Pitfall 4: No retry on transient LLM errors
Anthropic and OpenAI return 429s and 529s under load. Without a retry strategy, every transient error surfaces as a failed crew. Use tenacity with exponential backoff on the LLM call, NOT on the full crew (retrying a 5-minute crew for a 2-second blip wastes tokens).
Pitfall 5: No budget kill-switch
Track spend per crew and per tenant. Hard-stop when the daily budget is hit. One misbehaving prompt template has caused $4,000 overnight charges on production deployments. A kill-switch turns that into a $50 incident.
Retry helper for transient LLM errors
Wrap your LLM calls — not your crews — in a retry decorator scoped to rate-limit and transient errors only:
from tenacity import (
retry, retry_if_exception_type, stop_after_attempt,
wait_exponential_jitter,
)
from anthropic import APIStatusError, RateLimitError
TRANSIENT = (RateLimitError, APIStatusError)
@retry(
retry=retry_if_exception_type(TRANSIENT),
wait=wait_exponential_jitter(initial=1, max=30),
stop=stop_after_attempt(4),
reraise=True,
)
def call_llm(client, **kwargs):
return client.messages.create(**kwargs)8. Deploying CrewAI on Rapid Claw (The Easy Path)
Everything above is doable. It is also maybe two weeks of infrastructure work before your first production crew ships — and an ongoing maintenance tax. If your goal is shipping a product, not running Celery clusters, the managed path is faster.
Rapid Claw runs CrewAI-style multi-agent workloads with queueing, shared memory, budget controls, and monitoring built in. You upload a crew definition, wire up your tools, and ship. No Dockerfile to tune, no Celery concurrency to pick, no KEDA scalers to maintain.
Managed queue + workers
Autoscaling crew runners with per-tenant isolation. No Celery config.
Shared memory + pgvector
Long-term memory across runs. No Postgres to provision.
Budget kill-switches
Per-crew, per-tenant, and daily caps baked in. Overspend is impossible.
Observability built in
Per-crew traces, token spend, tool-call latency — no Grafana to wire.
Secrets management
API keys and credentials stored encrypted, injected at runtime.
Zero-downtime deploys
Rolling updates for crew definitions without draining workers.
Self-host or managed — how to decide
The self-hosted path is the right call if you have a platform team, strict data-residency requirements, or a workload that genuinely benefits from custom tuning. For most teams, managed wins on time-to-first-production-crew — often weeks faster.
For a full head-to-head on CrewAI alternatives that handle deployment differently, see 5 CrewAI Alternatives That Actually Handle Deployment.
Production Readiness Checklist
Before you flip the switch, walk this list. If anything is missing, you have homework:
Frequently Asked Questions
Ship your first crew in minutes, not weeks
Rapid Claw runs CrewAI-style multi-agent workloads with queueing, memory, budgets, and monitoring built in. Skip the Celery config and the KEDA scalers — bring a crew definition and go.
Deploy on Rapid ClawRelated reading
Comparison of managed + self-hosted alternatives
AI Agent Framework Comparison (2026)CrewAI vs LangGraph vs Google ADK vs Autogen
AI Agent Hosting Complete GuideSelf-host vs managed, cost breakdowns, architectures
AI Agent Observability GuideLogs, metrics, and traces that actually help
AI Agent Firewall SetupRate limits, scoped keys, network isolation
Google ADK vs LangChain vs CrewAIFramework tradeoffs for multi-agent workloads