Worker Setup and Scaling Guide
This guide covers configuring, deploying, and scaling distributed workers for orbiter-distributed.
This guide covers configuring, deploying, and scaling distributed workers for orbiter-distributed.
Prerequisites
- Redis 7+ — Required for Redis Streams with consumer groups
- Python 3.11+ — Required by orbiter packages
- orbiter-distributed —
pip install orbiter-distributed - orbiter-cli (optional) —
pip install orbiter-clifor theorbitercommand
Starting a Worker
Via CLI
orbiter start worker --redis-url redis://localhost:6379Options:
| Flag | Default | Description |
|---|---|---|
--redis-url | ORBITER_REDIS_URL env var | Redis connection URL |
--concurrency | 1 | Number of concurrent task executions per worker process |
--queue | orbiter:tasks | Redis Streams queue name |
--worker-id | Auto-generated | Unique worker identifier ({hostname}-{pid}-{random}) |
The startup banner displays the worker ID, masked Redis URL, queue name, and concurrency level.
Via Python
import asyncio
from orbiter.distributed.worker import Worker
worker = Worker(
"redis://localhost:6379",
concurrency=4,
queue_name="orbiter:tasks",
heartbeat_ttl=30,
)
asyncio.run(worker.start())Worker Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
redis_url | str | (required) | Redis connection URL |
worker_id | str | None | Auto-generated | Unique worker identifier |
concurrency | int | 1 | Concurrent task execution slots |
queue_name | str | "orbiter:tasks" | Redis Streams queue name |
heartbeat_ttl | int | 30 | Heartbeat key TTL in seconds |
executor | Literal["local", "temporal"] | "local" | Execution backend |
Environment Variables
| Variable | Default | Description |
|---|---|---|
ORBITER_REDIS_URL | None | Default Redis URL when --redis-url is not provided |
TEMPORAL_HOST | localhost:7233 | Temporal server address (only for executor="temporal") |
TEMPORAL_NAMESPACE | default | Temporal namespace (only for executor="temporal") |
Concurrency Tuning
The --concurrency flag controls how many tasks a single worker process executes in parallel. Each concurrent slot runs an independent _claim_loop coroutine using asyncio.create_task().
Guidelines:
- CPU-bound agents (heavy tool execution): Keep concurrency low (1-2). Python’s GIL limits true parallelism.
- I/O-bound agents (LLM API calls, network tools): Increase concurrency (4-16). Async I/O benefits from overlapping waits.
- Memory considerations: Each concurrent task reconstructs an agent from its config and streams events. Monitor memory usage per process.
# Low concurrency for CPU-heavy agents
orbiter start worker --concurrency 1
# Higher concurrency for I/O-bound LLM agents
orbiter start worker --concurrency 8Horizontal vs. vertical scaling:
- Increase
--concurrencyto use more of a single machine’s resources (vertical) - Run multiple worker processes across machines for true horizontal scaling (see Multi-Worker Deployment)
- All workers in the same consumer group share the queue — Redis distributes tasks evenly
Multi-Worker Deployment
Multiple worker processes can connect to the same Redis queue. Redis Streams consumer groups ensure each task is claimed by exactly one worker.
Multiple processes on one machine
# Terminal 1
orbiter start worker --redis-url redis://redis:6379 --concurrency 4
# Terminal 2
orbiter start worker --redis-url redis://redis:6379 --concurrency 4Each process auto-generates a unique worker ID ({hostname}-{pid}-{random}).
Process manager (systemd)
# /etc/systemd/system/[email protected]
[Unit]
Description=Orbiter Distributed Worker %i
After=network.target redis.service
[Service]
Type=simple
User=orbiter
Environment=ORBITER_REDIS_URL=redis://localhost:6379
ExecStart=/usr/local/bin/orbiter start worker --concurrency 4
Restart=always
RestartSec=5
KillSignal=SIGTERM
TimeoutStopSec=30
[Install]
WantedBy=multi-user.targetStart multiple instances:
sudo systemctl enable --now orbiter-worker@1
sudo systemctl enable --now orbiter-worker@2Graceful Shutdown
Workers handle SIGINT and SIGTERM for graceful shutdown:
- The shutdown event is set, stopping the claim loop from accepting new tasks
- Currently executing tasks run to completion
- The heartbeat loop stops
- Redis connections (broker, store, publisher) are closed
Press Ctrl+C or send SIGTERM to stop a worker cleanly. Avoid SIGKILL — it skips cleanup and may leave tasks in a RUNNING state without acknowledgment.
Heartbeat and Health Monitoring
Workers publish health data to Redis every heartbeat_ttl / 3 seconds (default: every 10 seconds).
Health data published to orbiter:workers:{worker_id} Redis hash:
| Field | Description |
|---|---|
status | Worker status (running) |
tasks_processed | Total successfully completed tasks |
tasks_failed | Total failed tasks |
current_task_id | Currently executing task ID (empty if idle) |
started_at | Worker start timestamp |
last_heartbeat | Last heartbeat timestamp |
concurrency | Configured concurrency |
hostname | Machine hostname |
The heartbeat key has a TTL equal to heartbeat_ttl (default 30s). If a worker crashes without graceful shutdown, the key expires and the worker is considered dead after 60 seconds.
Checking worker health
Via CLI:
# List all active workers
orbiter worker list
# Output:
# ┌────────────────────────┬────────┬──────────┬───────┬────────┬──────────────┬─────────────┬─────────────────────────┐
# │ Worker ID │ Status │ Hostname │ Tasks │ Failed │ Current Task │ Concurrency │ Last Heartbeat │
# ├────────────────────────┼────────┼──────────┼───────┼────────┼──────────────┼─────────────┼─────────────────────────┤
# │ web01-1234-a1b2c3d4 │ running│ web01 │ 42 │ 2 │ - │ 4 │ 2026-02-17 10:30:15 UTC │
# │ web02-5678-e5f6g7h8 │ running│ web02 │ 38 │ 1 │ abc123def │ 4 │ 2026-02-17 10:30:12 UTC │
# └────────────────────────┴────────┴──────────┴───────┴────────┴──────────────┴─────────────┴─────────────────────────┘Via Python:
from orbiter.distributed.health import get_worker_fleet_status, WorkerHealthCheck
# Fleet-wide status
workers = await get_worker_fleet_status("redis://localhost:6379")
for w in workers:
print(f"{w.worker_id}: {w.status} (alive={w.alive})")
# Individual worker health check (sync, implements HealthCheck protocol)
check = WorkerHealthCheck("redis://localhost:6379", "web01-1234-a1b2c3d4")
result = check.check()
print(f"{result.status}: {result.message}")Workers with heartbeat older than 60 seconds are marked as alive=False (dead) in fleet status.
Task Retry Behavior
The TaskBroker supports automatic retries with max_retries=3 by default.
Retry flow:
- Task execution fails with an exception
- Worker sets task status to
FAILEDand records the error - Worker checks current retry count against
max_retries - If retries remain: status set to
RETRYING, task is nacked (re-enqueued for any worker) - If retries exhausted: task is acknowledged and remains in
FAILEDstate
Monitoring retries:
# Check retry count for a specific task
orbiter task status <task_id>
# List all retrying tasks
orbiter task list --status retryingExecution Backends
Local Execution (default)
The worker reconstructs the agent from its serialized config and runs run.stream() directly. Suitable for most use cases.
orbiter start worker --redis-url redis://localhost:6379Temporal Execution (durable)
For tasks that must survive worker crashes, use the Temporal execution backend. Temporal wraps agent execution in durable workflows with heartbeating activities.
Requirements:
- Temporal server running
temporalioinstalled:pip install orbiter-distributed[temporal]
TEMPORAL_HOST=localhost:7233 \
TEMPORAL_NAMESPACE=default \
orbiter start worker --redis-url redis://localhost:6379Or via Python:
worker = Worker(
"redis://localhost:6379",
executor="temporal",
concurrency=4,
)
asyncio.run(worker.start())How it works:
- Worker claims task from Redis queue (same as local mode)
- Instead of executing directly, submits an
AgentExecutionWorkflowto Temporal - Temporal activity (
execute_agent_activity) reconstructs agent and runsrun.stream() - Activity sends heartbeats every 10 events for liveness detection
- If the worker crashes, Temporal retries the activity on another worker
timeout_secondsfromTaskPayloadsets the Temporalstart_to_close_timeout
Redis Configuration
Recommended settings
# redis.conf
# Memory — tune based on queue depth and event retention
maxmemory 1gb
maxmemory-policy noeviction
# Persistence — enable AOF for durability
appendonly yes
appendfsync everysec
# Streams — tune consumer group settings
# Default stream consumer group lag is fine for most workloads
# Connections — increase if running many workers
maxclients 10000
# Timeout — keep connections alive for workers
timeout 0
tcp-keepalive 300Memory considerations
| Redis Key | Estimated Size | Retention |
|---|---|---|
Task queue (orbiter:tasks) | ~1KB per pending task | Until consumed |
Task hashes (orbiter:task:{id}) | ~500B per task | TTL: 24 hours |
Event streams (orbiter:stream:{id}) | ~200B per event | TTL: 1 hour |
Worker heartbeats (orbiter:workers:{id}) | ~200B per worker | TTL: 30 seconds |
Task index (orbiter:task:index) | ~40B per task ID | Persistent |
Redis Sentinel / Cluster
For high availability, use Redis Sentinel or Cluster. Pass the appropriate URL:
# Sentinel
orbiter start worker --redis-url redis+sentinel://sentinel1:26379,sentinel2:26379/mymaster
# Standard HA
orbiter start worker --redis-url redis://redis-primary:6379Docker Deployment
Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install orbiter packages
RUN pip install orbiter-distributed orbiter-cli
# Copy application code (for tool resolution via importable paths)
COPY . .
RUN pip install -e .
# Default command: start a worker
CMD ["orbiter", "start", "worker", "--concurrency", "4"]docker-compose.yml
version: "3.8"
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
command: redis-server --appendonly yes
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
worker:
build: .
environment:
- ORBITER_REDIS_URL=redis://redis:6379
depends_on:
redis:
condition: service_healthy
deploy:
replicas: 2
restart: unless-stopped
stop_grace_period: 30s
stop_signal: SIGTERM
# Optional: Temporal server for durable execution
temporal:
image: temporalio/auto-setup:latest
ports:
- "7233:7233"
environment:
- DB=sqlite
profiles:
- temporal
temporal-worker:
build: .
environment:
- ORBITER_REDIS_URL=redis://redis:6379
- TEMPORAL_HOST=temporal:7233
- TEMPORAL_NAMESPACE=default
command: >
orbiter start worker
--concurrency 4
depends_on:
redis:
condition: service_healthy
temporal:
condition: service_started
profiles:
- temporal
restart: unless-stopped
volumes:
redis-data:Usage:
# Redis-only mode (2 worker replicas)
docker compose up -d
# With Temporal for durable execution
docker compose --profile temporal up -d
# Scale workers
docker compose up -d --scale worker=4Monitoring and Alerting
Built-in alert rules
Register pre-defined alert rules for distributed system health:
from orbiter.distributed.alerts import register_distributed_alerts
register_distributed_alerts()Alert rules:
| Rule | Threshold | Severity |
|---|---|---|
| Queue depth high | > 100 tasks | WARNING |
| Queue depth critical | > 500 tasks | CRITICAL |
| Task failure rate | > 10% | WARNING |
| Worker count zero | = 0 workers | CRITICAL |
| Task wait time high | > 60 seconds | WARNING |
Metrics
Workers automatically record metrics using the orbiter-observability infrastructure (OpenTelemetry when available, in-memory fallback otherwise):
| Metric | Type | Description |
|---|---|---|
orbiter.distributed.tasks.submitted | Counter | Tasks submitted to the queue |
orbiter.distributed.tasks.completed | Counter | Successfully completed tasks |
orbiter.distributed.tasks.failed | Counter | Failed tasks |
orbiter.distributed.tasks.cancelled | Counter | Cancelled tasks |
orbiter.distributed.queue.depth | Gauge | Current queue depth |
orbiter.distributed.task.duration | Histogram | Task execution duration (seconds) |
orbiter.distributed.task.wait_time | Histogram | Time from submission to execution start |
Task management CLI
# List all tasks
orbiter task list
# Filter by status
orbiter task list --status running
orbiter task list --status failed
# Get detailed status for a specific task
orbiter task status <task_id>
# Cancel a running task
orbiter task cancel <task_id>Scaling Recommendations
| Workload | Workers | Concurrency | Notes |
|---|---|---|---|
| Development | 1 | 1 | Single process for debugging |
| Small production | 2 | 4 | Basic HA with moderate throughput |
| Medium production | 4-8 | 4-8 | Balance across machines |
| High throughput | 10+ | 8-16 | Monitor Redis memory and connections |
Key scaling factors:
- Queue depth — If queue depth grows consistently, add more workers
- Task wait time — If tasks wait >60s before execution, add capacity
- Worker failure rate — High failure rates may indicate resource constraints, not a need for more workers
- Redis connections — Each worker uses 3+ Redis connections (broker, store, publisher, heartbeat). At scale, tune
maxclientsin Redis
Troubleshooting
Worker not claiming tasks
- Verify Redis is reachable:
redis-cli -u $ORBITER_REDIS_URL ping - Check the queue name matches between client and worker (
--queueflag) - Verify the consumer group exists:
redis-cli XINFO GROUPS orbiter:tasks
Tasks stuck in RUNNING
- Check if the worker crashed:
orbiter worker list— dead workers show in red - Tasks from dead workers remain in the pending entries list (PEL). They can be reclaimed via
XCLAIMor will expire based on task hash TTL (24h)
High memory usage
- Reduce event stream TTL (default 1 hour)
- Reduce task hash TTL (default 24 hours)
- Monitor queue depth — a growing queue means workers can’t keep up
Worker heartbeat expired
- The worker may be blocked on a long-running synchronous operation
- Increase
heartbeat_ttlif tasks legitimately take a long time - Ensure agents don’t have blocking synchronous code in tools