Skip to content
Navigation

Core evaluation framework: targets, scorers, criteria, result types, and the parallel evaluator.

Module Path

python
from orbiter.eval.base import (
    EvalError,
    EvalStatus,
    ScorerResult,
    EvalCaseResult,
    EvalResult,
    EvalCriteria,
    EvalTarget,
    Scorer,
    Evaluator,
)

EvalError

Raised when an evaluation fails. Inherits from OrbiterError.

python
class EvalError(OrbiterError): ...

EvalStatus

Outcome status for a single metric evaluation.

python
class EvalStatus(StrEnum):
    PASSED = "passed"
    FAILED = "failed"
    NOT_EVALUATED = "not_evaluated"
ValueDescription
PASSEDScore meets or exceeds the threshold
FAILEDScore is below the threshold
NOT_EVALUATEDNo criteria applied (default)

ScorerResult

Output from a single scorer applied to one case.

Decorator: @dataclass(frozen=True, slots=True)

Constructor

python
ScorerResult(
    scorer_name: str,
    score: float,
    status: EvalStatus = EvalStatus.NOT_EVALUATED,
    details: dict[str, Any] = {},
)
FieldTypeDefaultDescription
scorer_namestr(required)Name identifying the scorer
scorefloat(required)Numeric score (typically 0.0-1.0)
statusEvalStatusNOT_EVALUATEDPass/fail status
detailsdict[str, Any]{}Additional scorer-specific information

EvalCaseResult

Result for one evaluation case (one input/output pair).

Decorator: @dataclass(frozen=True, slots=True)

Constructor

python
EvalCaseResult(
    case_id: str,
    input: Any,
    output: Any,
    scores: dict[str, ScorerResult] = {},
)
FieldTypeDefaultDescription
case_idstr(required)Unique case identifier
inputAny(required)The input fed to the target
outputAny(required)The output from the target
scoresdict[str, ScorerResult]{}Scorer results keyed by scorer name

EvalResult

Aggregated result across all cases.

Decorator: @dataclass(slots=True)

Constructor

python
EvalResult(
    case_results: list[EvalCaseResult] = [],
    summary: dict[str, Any] = {},
    pass_at_k: dict[int, float] = {},
)
FieldTypeDefaultDescription
case_resultslist[EvalCaseResult][]Per-case results
summarydict[str, Any]{}Mean score per scorer across all cases
pass_at_kdict[int, float]{}Pass@k metrics (populated when repeat_times > 1 and criteria are set)

EvalCriteria

Threshold-based pass/fail criteria for a metric.

Decorator: @dataclass(frozen=True, slots=True)

Constructor

python
EvalCriteria(
    metric_name: str,
    threshold: float = 0.5,
)
FieldTypeDefaultDescription
metric_namestr(required)Name matching a scorer’s scorer_name
thresholdfloat0.5Minimum score to pass

Methods

judge()

python
def judge(self, value: float) -> EvalStatus

Return PASSED if value >= threshold, else FAILED.


EvalTarget (ABC)

Abstract callable evaluation subject — wraps the system under test.

Abstract Methods

predict()

python
async def predict(self, case_id: str, input: Any) -> Any

Run the system under test and return its output.

ParameterTypeDescription
case_idstrUnique identifier for this evaluation case
inputAnyInput data from the dataset

Returns: The system’s output for scoring.


Scorer (ABC)

Abstract scorer that evaluates one (input, output) pair.

Abstract Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Score a single case and return a ScorerResult.

ParameterTypeDescription
case_idstrUnique case identifier
inputAnyThe input from the dataset
outputAnyThe output from the evaluation target

Returns: ScorerResult.


Evaluator

Runs an EvalTarget over a dataset and scores results. Supports parallel execution via semaphore and repeat_times for pass@k metric computation.

Constructor

python
Evaluator(
    scorers: list[Scorer],
    *,
    criteria: list[EvalCriteria] | None = None,
    parallel: int = 4,
    repeat_times: int = 1,
)
ParameterTypeDefaultDescription
scorerslist[Scorer](required)List of scorers to apply
criterialist[EvalCriteria] | NoneNoneOptional pass/fail criteria per scorer
parallelint4Max concurrent evaluations
repeat_timesint1Repeat each case N times (for pass@k)

Raises: EvalError if parallel < 1 or repeat_times < 1.

Methods

evaluate()

python
async def evaluate(
    self,
    target: EvalTarget,
    dataset: list[dict[str, Any]],
) -> EvalResult

Run the target over every case in the dataset, score each case, and return aggregated results.

ParameterTypeDescription
targetEvalTargetThe system under test
datasetlist[dict[str, Any]]List of cases, each with "id" and "input" keys

Returns: EvalResult with per-case results, summary (mean scores), and pass@k metrics.

Behavior:

  1. Creates an asyncio.Semaphore(parallel) for concurrency control
  2. For each case, repeats repeat_times invocations
  3. Calls target.predict() to get output, then each scorer’s score()
  4. If criteria exist for a scorer, applies judge() to set status
  5. Computes mean score per scorer (summary) and pass@k

Dunder Methods

MethodDescription
__repr__Evaluator(scorers=3, parallel=4, repeat_times=1)

Example

python
import asyncio
from orbiter.eval import (
    Evaluator,
    EvalTarget,
    EvalCriteria,
    FormatValidationScorer,
    OutputCorrectnessScorer,
    OutputLengthScorer,
)

class MyAgent(EvalTarget):
    async def predict(self, case_id, input):
        return f"The answer is {input.get('expected', 'unknown')}"

async def main():
    evaluator = Evaluator(
        scorers=[
            OutputCorrectnessScorer(keywords=["answer"]),
            OutputLengthScorer(min_length=10, max_length=500),
        ],
        criteria=[
            EvalCriteria("correctness", threshold=0.8),
            EvalCriteria("length", threshold=1.0),
        ],
        parallel=8,
        repeat_times=3,  # For pass@k computation
    )

    dataset = [
        {"id": "q1", "input": {"expected": "42"}},
        {"id": "q2", "input": {"expected": "Paris"}},
    ]

    result = await evaluator.evaluate(MyAgent(), dataset)

    # Summary: mean scores per scorer
    print(result.summary)

    # Pass@k metrics
    print(result.pass_at_k)

    # Per-case results
    for cr in result.case_results:
        print(f"Case {cr.case_id}: {cr.scores}")

asyncio.run(main())