Worker Setup and Scaling Guide

This guide covers configuring, deploying, and scaling distributed workers for orbiter-distributed.

Prerequisites

Redis 7+ — Required for Redis Streams with consumer groups
Python 3.11+ — Required by orbiter packages
orbiter-distributed — pip install orbiter-distributed
orbiter-cli (optional) — pip install orbiter-cli for the orbiter command

Starting a Worker

Via CLI

bash

orbiter start worker --redis-url redis://localhost:6379

Options:

Flag	Default	Description
`--redis-url`	`ORBITER_REDIS_URL` env var	Redis connection URL
`--concurrency`	`1`	Number of concurrent task executions per worker process
`--queue`	`orbiter:tasks`	Redis Streams queue name
`--worker-id`	Auto-generated	Unique worker identifier (`{hostname}-{pid}-{random}`)

The startup banner displays the worker ID, masked Redis URL, queue name, and concurrency level.

Via Python

python

import asyncio
from orbiter.distributed.worker import Worker

worker = Worker(
    "redis://localhost:6379",
    concurrency=4,
    queue_name="orbiter:tasks",
    heartbeat_ttl=30,
)

asyncio.run(worker.start())

Worker Constructor Parameters

Parameter	Type	Default	Description
`redis_url`	`str`	(required)	Redis connection URL
`worker_id`	`str \| None`	Auto-generated	Unique worker identifier
`concurrency`	`int`	`1`	Concurrent task execution slots
`queue_name`	`str`	`"orbiter:tasks"`	Redis Streams queue name
`heartbeat_ttl`	`int`	`30`	Heartbeat key TTL in seconds
`executor`	`Literal["local", "temporal"]`	`"local"`	Execution backend

Environment Variables

Variable	Default	Description
`ORBITER_REDIS_URL`	None	Default Redis URL when `--redis-url` is not provided
`TEMPORAL_HOST`	`localhost:7233`	Temporal server address (only for `executor="temporal"`)
`TEMPORAL_NAMESPACE`	`default`	Temporal namespace (only for `executor="temporal"`)

Concurrency Tuning

The --concurrency flag controls how many tasks a single worker process executes in parallel. Each concurrent slot runs an independent _claim_loop coroutine using asyncio.create_task().

Guidelines:

CPU-bound agents (heavy tool execution): Keep concurrency low (1-2). Python’s GIL limits true parallelism.
I/O-bound agents (LLM API calls, network tools): Increase concurrency (4-16). Async I/O benefits from overlapping waits.
Memory considerations: Each concurrent task reconstructs an agent from its config and streams events. Monitor memory usage per process.

bash

# Low concurrency for CPU-heavy agents
orbiter start worker --concurrency 1

# Higher concurrency for I/O-bound LLM agents
orbiter start worker --concurrency 8

Horizontal vs. vertical scaling:

Increase --concurrency to use more of a single machine’s resources (vertical)
Run multiple worker processes across machines for true horizontal scaling (see Multi-Worker Deployment)
All workers in the same consumer group share the queue — Redis distributes tasks evenly

Multi-Worker Deployment

Multiple worker processes can connect to the same Redis queue. Redis Streams consumer groups ensure each task is claimed by exactly one worker.

Multiple processes on one machine

bash

# Terminal 1
orbiter start worker --redis-url redis://redis:6379 --concurrency 4

# Terminal 2
orbiter start worker --redis-url redis://redis:6379 --concurrency 4

Each process auto-generates a unique worker ID ({hostname}-{pid}-{random}).

Process manager (systemd)

ini

# /etc/systemd/system/[email protected]
[Unit]
Description=Orbiter Distributed Worker %i
After=network.target redis.service

[Service]
Type=simple
User=orbiter
Environment=ORBITER_REDIS_URL=redis://localhost:6379
ExecStart=/usr/local/bin/orbiter start worker --concurrency 4
Restart=always
RestartSec=5
KillSignal=SIGTERM
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target

Start multiple instances:

bash

sudo systemctl enable --now orbiter-worker@1
sudo systemctl enable --now orbiter-worker@2

Graceful Shutdown

Workers handle SIGINT and SIGTERM for graceful shutdown:

The shutdown event is set, stopping the claim loop from accepting new tasks
Currently executing tasks run to completion
The heartbeat loop stops
Redis connections (broker, store, publisher) are closed

Press Ctrl+C or send SIGTERM to stop a worker cleanly. Avoid SIGKILL — it skips cleanup and may leave tasks in a RUNNING state without acknowledgment.

Heartbeat and Health Monitoring

Workers publish health data to Redis every heartbeat_ttl / 3 seconds (default: every 10 seconds).

Health data published to orbiter:workers:{worker_id} Redis hash:

Field	Description
`status`	Worker status (`running`)
`tasks_processed`	Total successfully completed tasks
`tasks_failed`	Total failed tasks
`current_task_id`	Currently executing task ID (empty if idle)
`started_at`	Worker start timestamp
`last_heartbeat`	Last heartbeat timestamp
`concurrency`	Configured concurrency
`hostname`	Machine hostname

The heartbeat key has a TTL equal to heartbeat_ttl (default 30s). If a worker crashes without graceful shutdown, the key expires and the worker is considered dead after 60 seconds.

Checking worker health

Via CLI:

bash

# List all active workers
orbiter worker list

# Output:
# ┌────────────────────────┬────────┬──────────┬───────┬────────┬──────────────┬─────────────┬─────────────────────────┐
# │ Worker ID              │ Status │ Hostname │ Tasks │ Failed │ Current Task │ Concurrency │ Last Heartbeat          │
# ├────────────────────────┼────────┼──────────┼───────┼────────┼──────────────┼─────────────┼─────────────────────────┤
# │ web01-1234-a1b2c3d4    │ running│ web01    │ 42    │ 2      │ -            │ 4           │ 2026-02-17 10:30:15 UTC │
# │ web02-5678-e5f6g7h8    │ running│ web02    │ 38    │ 1      │ abc123def    │ 4           │ 2026-02-17 10:30:12 UTC │
# └────────────────────────┴────────┴──────────┴───────┴────────┴──────────────┴─────────────┴─────────────────────────┘

Via Python:

python

from orbiter.distributed.health import get_worker_fleet_status, WorkerHealthCheck

# Fleet-wide status
workers = await get_worker_fleet_status("redis://localhost:6379")
for w in workers:
    print(f"{w.worker_id}: {w.status} (alive={w.alive})")

# Individual worker health check (sync, implements HealthCheck protocol)
check = WorkerHealthCheck("redis://localhost:6379", "web01-1234-a1b2c3d4")
result = check.check()
print(f"{result.status}: {result.message}")

Workers with heartbeat older than 60 seconds are marked as alive=False (dead) in fleet status.

Task Retry Behavior

The TaskBroker supports automatic retries with max_retries=3 by default.

Retry flow:

Task execution fails with an exception
Worker sets task status to FAILED and records the error
Worker checks current retry count against max_retries
If retries remain: status set to RETRYING, task is nacked (re-enqueued for any worker)
If retries exhausted: task is acknowledged and remains in FAILED state

Monitoring retries:

bash

# Check retry count for a specific task
orbiter task status <task_id>

# List all retrying tasks
orbiter task list --status retrying

Execution Backends

Local Execution (default)

The worker reconstructs the agent from its serialized config and runs run.stream() directly. Suitable for most use cases.

bash

orbiter start worker --redis-url redis://localhost:6379

Temporal Execution (durable)

For tasks that must survive worker crashes, use the Temporal execution backend. Temporal wraps agent execution in durable workflows with heartbeating activities.

Requirements:

Temporal server running
temporalio installed: pip install orbiter-distributed[temporal]

bash

TEMPORAL_HOST=localhost:7233 \
TEMPORAL_NAMESPACE=default \
orbiter start worker --redis-url redis://localhost:6379

Or via Python:

python

worker = Worker(
    "redis://localhost:6379",
    executor="temporal",
    concurrency=4,
)
asyncio.run(worker.start())

How it works:

Worker claims task from Redis queue (same as local mode)
Instead of executing directly, submits an AgentExecutionWorkflow to Temporal
Temporal activity (execute_agent_activity) reconstructs agent and runs run.stream()
Activity sends heartbeats every 10 events for liveness detection
If the worker crashes, Temporal retries the activity on another worker
timeout_seconds from TaskPayload sets the Temporal start_to_close_timeout

Redis Configuration

Recommended settings

code

# redis.conf

# Memory — tune based on queue depth and event retention
maxmemory 1gb
maxmemory-policy noeviction

# Persistence — enable AOF for durability
appendonly yes
appendfsync everysec

# Streams — tune consumer group settings
# Default stream consumer group lag is fine for most workloads

# Connections — increase if running many workers
maxclients 10000

# Timeout — keep connections alive for workers
timeout 0
tcp-keepalive 300

Memory considerations

Redis Key	Estimated Size	Retention
Task queue (`orbiter:tasks`)	~1KB per pending task	Until consumed
Task hashes (`orbiter:task:{id}`)	~500B per task	TTL: 24 hours
Event streams (`orbiter:stream:{id}`)	~200B per event	TTL: 1 hour
Worker heartbeats (`orbiter:workers:{id}`)	~200B per worker	TTL: 30 seconds
Task index (`orbiter:task:index`)	~40B per task ID	Persistent

Redis Sentinel / Cluster

For high availability, use Redis Sentinel or Cluster. Pass the appropriate URL:

bash

# Sentinel
orbiter start worker --redis-url redis+sentinel://sentinel1:26379,sentinel2:26379/mymaster

# Standard HA
orbiter start worker --redis-url redis://redis-primary:6379

Docker Deployment

Dockerfile

dockerfile

FROM python:3.12-slim

WORKDIR /app

# Install orbiter packages
RUN pip install orbiter-distributed orbiter-cli

# Copy application code (for tool resolution via importable paths)
COPY . .
RUN pip install -e .

# Default command: start a worker
CMD ["orbiter", "start", "worker", "--concurrency", "4"]

docker-compose.yml

yaml

version: "3.8"

services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

  worker:
    build: .
    environment:
      - ORBITER_REDIS_URL=redis://redis:6379
    depends_on:
      redis:
        condition: service_healthy
    deploy:
      replicas: 2
    restart: unless-stopped
    stop_grace_period: 30s
    stop_signal: SIGTERM

  # Optional: Temporal server for durable execution
  temporal:
    image: temporalio/auto-setup:latest
    ports:
      - "7233:7233"
    environment:
      - DB=sqlite
    profiles:
      - temporal

  temporal-worker:
    build: .
    environment:
      - ORBITER_REDIS_URL=redis://redis:6379
      - TEMPORAL_HOST=temporal:7233
      - TEMPORAL_NAMESPACE=default
    command: >
      orbiter start worker
      --concurrency 4
    depends_on:
      redis:
        condition: service_healthy
      temporal:
        condition: service_started
    profiles:
      - temporal
    restart: unless-stopped

volumes:
  redis-data:

Usage:

bash

# Redis-only mode (2 worker replicas)
docker compose up -d

# With Temporal for durable execution
docker compose --profile temporal up -d

# Scale workers
docker compose up -d --scale worker=4

Monitoring and Alerting

Built-in alert rules

python

from orbiter.distributed.alerts import register_distributed_alerts

register_distributed_alerts()

Alert rules:

Rule	Threshold	Severity
Queue depth high	> 100 tasks	WARNING
Queue depth critical	> 500 tasks	CRITICAL
Task failure rate	> 10%	WARNING
Worker count zero	= 0 workers	CRITICAL
Task wait time high	> 60 seconds	WARNING

Metrics

Workers automatically record metrics using the orbiter-observability infrastructure (OpenTelemetry when available, in-memory fallback otherwise):

Metric	Type	Description
`orbiter.distributed.tasks.submitted`	Counter	Tasks submitted to the queue
`orbiter.distributed.tasks.completed`	Counter	Successfully completed tasks
`orbiter.distributed.tasks.failed`	Counter	Failed tasks
`orbiter.distributed.tasks.cancelled`	Counter	Cancelled tasks
`orbiter.distributed.queue.depth`	Gauge	Current queue depth
`orbiter.distributed.task.duration`	Histogram	Task execution duration (seconds)
`orbiter.distributed.task.wait_time`	Histogram	Time from submission to execution start

Task management CLI

bash

# List all tasks
orbiter task list

# Filter by status
orbiter task list --status running
orbiter task list --status failed

# Get detailed status for a specific task
orbiter task status <task_id>

# Cancel a running task
orbiter task cancel <task_id>

Scaling Recommendations

Workload	Workers	Concurrency	Notes
Development	1	1	Single process for debugging
Small production	2	4	Basic HA with moderate throughput
Medium production	4-8	4-8	Balance across machines
High throughput	10+	8-16	Monitor Redis memory and connections

Key scaling factors:

Queue depth — If queue depth grows consistently, add more workers
Task wait time — If tasks wait >60s before execution, add capacity
Worker failure rate — High failure rates may indicate resource constraints, not a need for more workers
Redis connections — Each worker uses 3+ Redis connections (broker, store, publisher, heartbeat). At scale, tune maxclients in Redis

Troubleshooting

Worker not claiming tasks

Verify Redis is reachable: redis-cli -u $ORBITER_REDIS_URL ping
Check the queue name matches between client and worker (--queue flag)
Verify the consumer group exists: redis-cli XINFO GROUPS orbiter:tasks

Tasks stuck in RUNNING

Check if the worker crashed: orbiter worker list — dead workers show in red
Tasks from dead workers remain in the pending entries list (PEL). They can be reclaimed via XCLAIM or will expire based on task hash TTL (24h)

High memory usage

Reduce event stream TTL (default 1 hour)
Reduce task hash TTL (default 24 hours)
Monitor queue depth — a growing queue means workers can’t keep up

Worker heartbeat expired

The worker may be blocked on a long-running synchronous operation
Increase heartbeat_ttl if tasks legitimately take a long time
Ensure agents don’t have blocking synchronous code in tools