orbiter.eval.ralph
Ralph iterative refinement loop: Run, Analyze, Learn, Plan, Halt. A 5-phase execution engine that combines scoring, reflection, and re-prompting to iteratively improve agent outputs.
Ralph iterative refinement loop: Run, Analyze, Learn, Plan, Halt. A 5-phase execution engine that combines scoring, reflection, and re-prompting to iteratively improve agent outputs.
Module Paths
from orbiter.eval.ralph.runner import RalphRunner, RalphResult, ExecuteFn, RePlanFn
from orbiter.eval.ralph.config import (
StopType,
ValidationConfig,
ReflectionConfig,
StopConditionConfig,
RalphConfig,
LoopState,
)
from orbiter.eval.ralph.detectors import (
StopDecision,
StopDetector,
MaxIterationDetector,
TimeoutDetector,
CostLimitDetector,
ConsecutiveFailureDetector,
ScoreThresholdDetector,
CompositeDetector,
)StopType
Categorised exit reason for loop termination.
class StopType(StrEnum):
NONE = "none"
COMPLETION = "completion"
MAX_ITERATIONS = "max_iterations"
TIMEOUT = "timeout"
MAX_COST = "max_cost"
MAX_CONSECUTIVE_FAILURES = "max_consecutive_failures"
SCORE_THRESHOLD = "score_threshold"
USER_INTERRUPTED = "user_interrupted"
SYSTEM_ERROR = "system_error"| Value | Description |
|---|---|
NONE | No stop condition triggered |
COMPLETION | Task completed successfully |
MAX_ITERATIONS | Reached maximum iteration count |
TIMEOUT | Wall-clock time limit exceeded |
MAX_COST | Cumulative cost limit reached |
MAX_CONSECUTIVE_FAILURES | Too many failures in a row |
SCORE_THRESHOLD | Score meets or exceeds threshold |
USER_INTERRUPTED | User requested interruption |
SYSTEM_ERROR | Unrecoverable system error |
Methods
is_success()
def is_success(self) -> boolReturns True for COMPLETION and SCORE_THRESHOLD.
is_failure()
def is_failure(self) -> boolReturns True for MAX_CONSECUTIVE_FAILURES and SYSTEM_ERROR.
ValidationConfig
Configuration for the Analyze (scoring) phase.
Decorator: @dataclass(frozen=True, slots=True)
Constructor
ValidationConfig(
enabled: bool = True,
scorer_names: tuple[str, ...] = (),
min_score_threshold: float = 0.5,
parallel: int = 4,
timeout: float = 0.0,
)| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | True | Whether scoring is enabled |
scorer_names | tuple[str, ...] | () | Names of scorers to use (empty = use all) |
min_score_threshold | float | 0.5 | Below this mean score, reflection is triggered |
parallel | int | 4 | Max parallel scorer executions |
timeout | float | 0.0 | Scorer timeout (0 = no timeout) |
Raises: ValueError if min_score_threshold is not in [0.0, 1.0].
ReflectionConfig
Configuration for the Learn (reflection) phase.
Decorator: @dataclass(frozen=True, slots=True)
Constructor
ReflectionConfig(
enabled: bool = True,
level: str = "medium",
max_history: int = 50,
)| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | True | Whether reflection is enabled |
level | str | "medium" | Reflection depth level |
max_history | int | 50 | Maximum reflection history entries |
StopConditionConfig
Configuration for the Halt (stop detection) phase.
Decorator: @dataclass(frozen=True, slots=True)
Constructor
StopConditionConfig(
max_iterations: int = 10,
timeout: float = 0.0,
max_cost: float = 0.0,
max_consecutive_failures: int = 3,
score_threshold: float = 0.0,
enable_user_interrupt: bool = False,
)| Field | Type | Default | Description |
|---|---|---|---|
max_iterations | int | 10 | Stop after this many iterations |
timeout | float | 0.0 | Wall-clock timeout in seconds (0 = disabled) |
max_cost | float | 0.0 | Cumulative cost limit (0 = disabled) |
max_consecutive_failures | int | 3 | Stop after N consecutive failures |
score_threshold | float | 0.0 | Stop when mean score reaches this (0 = disabled) |
enable_user_interrupt | bool | False | Allow user-initiated interruption |
Raises: ValueError if max_iterations < 1.
RalphConfig
Unified configuration for the Ralph iterative refinement loop. Aggregates validation, reflection, and stop-condition settings.
Decorator: @dataclass(frozen=True, slots=True)
Constructor
RalphConfig(
validation: ValidationConfig = ValidationConfig(),
reflection: ReflectionConfig = ReflectionConfig(),
stop_condition: StopConditionConfig = StopConditionConfig(),
metadata: dict[str, Any] = {},
)| Field | Type | Default | Description |
|---|---|---|---|
validation | ValidationConfig | ValidationConfig() | Scoring phase configuration |
reflection | ReflectionConfig | ReflectionConfig() | Reflection phase configuration |
stop_condition | StopConditionConfig | StopConditionConfig() | Halt phase configuration |
metadata | dict[str, Any] | {} | Custom metadata |
LoopState
Mutable runtime state for a Ralph loop execution. Tracks iteration count, timing, cost, and aggregated score/reflection history.
Decorator: @dataclass(slots=True)
Constructor
LoopState()Attributes
| Attribute | Type | Default | Description |
|---|---|---|---|
iteration | int | 0 | Current iteration number |
start_time | float | time.monotonic() | Loop start timestamp |
cumulative_cost | float | 0.0 | Total cost across iterations |
consecutive_failures | int | 0 | Current consecutive failure streak |
successful_steps | int | 0 | Total successful steps |
failed_steps | int | 0 | Total failed steps |
total_tokens | int | 0 | Total tokens consumed |
score_history | list[dict[str, float]] | [] | Score snapshots per iteration |
reflection_history | list[dict[str, Any]] | [] | Reflection summaries per iteration |
metadata | dict[str, Any] | {} | Custom metadata |
Query Methods
elapsed()
def elapsed(self) -> floatSeconds since the loop started.
success_rate()
def success_rate(self) -> floatFraction of successful steps. Returns 0.0 when no steps have been executed.
latest_score()
def latest_score(self) -> dict[str, float]Return the most recent score snapshot, or empty dict.
best_score()
def best_score(self, metric: str) -> floatReturn the highest value seen for a metric across all iterations.
| Parameter | Type | Description |
|---|---|---|
metric | str | Scorer name to look up |
Mutation Methods
record_score()
def record_score(self, scores: dict[str, float]) -> NoneAppend a score snapshot for the current iteration.
record_reflection()
def record_reflection(self, reflection: dict[str, Any]) -> NoneAppend a reflection summary for the current iteration.
record_success()
def record_success(self, *, tokens: int = 0, cost: float = 0.0) -> NoneMark the current step as successful. Resets consecutive_failures to 0.
| Parameter | Type | Default | Description |
|---|---|---|---|
tokens | int | 0 | Tokens consumed |
cost | float | 0.0 | Cost incurred |
record_failure()
def record_failure(self, *, cost: float = 0.0) -> NoneMark the current step as failed. Increments consecutive_failures.
| Parameter | Type | Default | Description |
|---|---|---|---|
cost | float | 0.0 | Cost incurred |
Serialization
to_dict()
def to_dict(self) -> dict[str, Any]Serialise to a plain dict for checkpointing.
Dunder Methods
| Method | Description |
|---|---|
__repr__ | LoopState(iteration=5, success_rate=80.0%, elapsed=12.3s) |
StopDecision
Outcome of a single detector evaluation.
Decorator: @dataclass(frozen=True, slots=True)
Constructor
StopDecision(
should_stop: bool,
stop_type: StopType = StopType.NONE,
reason: str = "",
metadata: dict[str, Any] = {},
)| Field | Type | Default | Description |
|---|---|---|---|
should_stop | bool | (required) | Whether the loop should stop |
stop_type | StopType | NONE | The category of stop condition |
reason | str | "" | Human-readable explanation |
metadata | dict[str, Any] | {} | Additional decision context |
Dunder Methods
| Method | Description |
|---|---|
__bool__ | Returns should_stop |
StopDetector (ABC)
Base class for pluggable stop-condition detectors. Each detector examines the current LoopState and the static StopConditionConfig.
Abstract Methods
check()
async def check(
self,
state: LoopState,
config: StopConditionConfig,
) -> StopDecisionReturn a stop decision for the current loop state.
Built-in Detectors
MaxIterationDetector
Stops when the loop has reached config.max_iterations.
MaxIterationDetector()Triggers StopType.MAX_ITERATIONS.
TimeoutDetector
Stops when elapsed wall-clock time exceeds config.timeout. A timeout of 0.0 (the default) disables this detector.
TimeoutDetector()Triggers StopType.TIMEOUT.
CostLimitDetector
Stops when cumulative cost meets or exceeds config.max_cost. A max_cost of 0.0 (the default) disables this detector.
CostLimitDetector()Triggers StopType.MAX_COST.
ConsecutiveFailureDetector
Stops when consecutive failures reach config.max_consecutive_failures. A value of 0 disables this detector.
ConsecutiveFailureDetector()Triggers StopType.MAX_CONSECUTIVE_FAILURES.
ScoreThresholdDetector
Stops when the mean of the latest score snapshot meets or exceeds config.score_threshold. A score_threshold of 0.0 (the default) disables this detector.
ScoreThresholdDetector()Triggers StopType.SCORE_THRESHOLD.
CompositeDetector
Aggregates multiple detectors and returns the first triggered decision.
Constructor
CompositeDetector(detectors: list[StopDetector] | None = None)| Parameter | Type | Default | Description |
|---|---|---|---|
detectors | list[StopDetector] | None | None | List of detectors to run (empty if None) |
Methods
add()
def add(self, detector: StopDetector) -> CompositeDetectorAppend a detector. Returns self for chaining.
check()
async def check(
self,
state: LoopState,
config: StopConditionConfig,
) -> StopDecisionIterates through all detectors in order. Returns the first StopDecision where should_stop=True, or a “continue” decision if none trigger.
Dunder Methods
| Method | Description |
|---|---|
__len__ | Number of detectors |
__repr__ | CompositeDetector(detectors=5) |
Type Aliases
ExecuteFn = Callable[..., Any] # Async callable: (input: str) -> str
RePlanFn = Callable[..., Any] # Async callable: (prompt: str) -> strRalphResult
Final outcome of a Ralph loop execution.
Decorator: @dataclass(frozen=True, slots=True)
Constructor
RalphResult(
output: str,
stop_type: StopType,
reason: str,
iterations: int,
scores: dict[str, float],
state: dict[str, Any],
reflections: list[dict[str, Any]] = [],
)| Field | Type | Default | Description |
|---|---|---|---|
output | str | (required) | Final output from the last iteration |
stop_type | StopType | (required) | Why the loop stopped |
reason | str | (required) | Human-readable stop reason |
iterations | int | (required) | Total iterations executed |
scores | dict[str, float] | (required) | Final scores from the last iteration |
state | dict[str, Any] | (required) | Serialized LoopState |
reflections | list[dict[str, Any]] | [] | Reflection summaries per iteration |
RalphRunner
Implements the 5-phase Ralph iterative refinement loop.
Phases Per Iteration
- Run — Execute the agent/task via
execute_fn - Analyze — Score the output using configured scorers
- Learn — Reflect on failures to extract actionable insights
- Plan — Re-prompt by appending reflection suggestions to input
- Halt — Check stop conditions; break or continue
Constructor
RalphRunner(
execute_fn: ExecuteFn,
scorers: list[Scorer],
*,
config: RalphConfig | None = None,
reflector: Reflector | None = None,
replan_fn: RePlanFn | None = None,
)| Parameter | Type | Default | Description |
|---|---|---|---|
execute_fn | ExecuteFn | (required) | Async callable to run the task |
scorers | list[Scorer] | (required) | Scorers for the Analyze phase |
config | RalphConfig | None | None | Loop configuration (defaults to RalphConfig()) |
reflector | Reflector | None | None | Reflector for the Learn phase |
replan_fn | RePlanFn | None | None | Custom re-plan function (unused in current implementation) |
Methods
run()
async def run(self, input: str) -> RalphResultExecute the full Ralph loop on the given input and return the result.
| Parameter | Type | Description |
|---|---|---|
input | str | Initial input to the task |
Returns: RalphResult with the final output, stop type, scores, and reflections.
Behavior per iteration:
- Run: Calls
execute_fn(current_input). On success, records success. On exception, records failure. - Analyze: If validation is enabled and execution succeeded, runs all scorers and records scores.
- Learn: If reflection is enabled and either execution failed or mean score <
min_score_threshold, callsreflector.reflect(). - Plan: If reflection produced suggestions, appends them to the original input as
[Previous feedback]. - Halt: Runs all stop detectors via
CompositeDetector. Stops on the first triggered condition.
Built-in Detectors
The runner automatically creates a CompositeDetector with these detectors (in order):
MaxIterationDetectorTimeoutDetectorCostLimitDetectorConsecutiveFailureDetectorScoreThresholdDetector
Dunder Methods
| Method | Description |
|---|---|
__repr__ | RalphRunner(scorers=3, config=RalphConfig(...)) |
Example
import asyncio
from orbiter.eval import (
GeneralReflector,
OutputCorrectnessScorer,
OutputLengthScorer,
)
from orbiter.eval.ralph.runner import RalphRunner
from orbiter.eval.ralph.config import (
RalphConfig,
ValidationConfig,
ReflectionConfig,
StopConditionConfig,
)
iteration_count = 0
async def my_agent(input: str) -> str:
global iteration_count
iteration_count += 1
if iteration_count >= 3:
return "The capital of France is Paris."
return "I'm not sure about that."
async def my_judge(prompt: str) -> str:
return '''{
"summary": "The answer was incomplete.",
"key_findings": ["Missing specific answer"],
"root_causes": ["Insufficient confidence"],
"insights": ["Need to be more decisive"],
"suggestions": ["Provide a direct, specific answer"]
}'''
async def main():
config = RalphConfig(
validation=ValidationConfig(min_score_threshold=0.8),
reflection=ReflectionConfig(enabled=True),
stop_condition=StopConditionConfig(
max_iterations=5,
score_threshold=0.9,
),
)
runner = RalphRunner(
execute_fn=my_agent,
scorers=[
OutputCorrectnessScorer(keywords=["Paris", "capital"]),
OutputLengthScorer(min_length=10),
],
config=config,
reflector=GeneralReflector(judge=my_judge),
)
result = await runner.run("What is the capital of France?")
print(f"Output: {result.output}")
print(f"Stop type: {result.stop_type}")
print(f"Iterations: {result.iterations}")
print(f"Scores: {result.scores}")
print(f"Success: {result.stop_type.is_success()}")
# Check reflections
for ref in result.reflections:
print(f" Iteration {ref['iteration']}: {ref['summary']}")
asyncio.run(main())