Skip to content
Navigation

Trajectory validation, time cost, accuracy, and label distribution scorers. Includes a scorer registry with @scorer_register() decorator for automatic discovery and factory-based creation.

Module Path

python
from orbiter.eval.trajectory_scorers import (
    scorer_register,
    get_scorer,
    list_scorers,
    TrajectoryValidator,
    TimeCostScorer,
    AnswerAccuracyLLMScorer,
    LabelDistributionScorer,
)

Scorer Registry

A module-level registry for Scorer subclasses, enabling factory-based lookup by name.

scorer_register()

Decorator that registers a Scorer subclass under a given name.

python
def scorer_register(name: str) -> Callable
ParameterTypeDescription
namestrRegistry key for the scorer class

Returns: A decorator that registers the class and returns it unchanged.

Usage:

python
from orbiter.eval import Scorer, ScorerResult, scorer_register

@scorer_register("my_metric")
class MyScorer(Scorer):
    async def score(self, case_id, input, output):
        return ScorerResult(scorer_name="my_metric", score=1.0)

get_scorer()

Lookup a registered scorer class by name.

python
def get_scorer(name: str) -> type[Scorer]
ParameterTypeDescription
namestrRegistry key

Returns: The scorer class.

Raises: KeyError if the name is not registered.

list_scorers()

Return all registered scorer names (sorted).

python
def list_scorers() -> list[str]

Returns: Sorted list of registered scorer names.

Built-in Registrations

NameClass
"trajectory"TrajectoryValidator
"time_cost"TimeCostScorer
"answer_accuracy"AnswerAccuracyLLMScorer
"label_distribution"LabelDistributionScorer

TrajectoryValidator

Validates a trajectory (list of step dicts) for structural integrity. Checks each step for required keys and returns the fraction of valid steps.

Registry name: "trajectory"

Constructor

python
TrajectoryValidator(
    *,
    required_keys: Sequence[str] = ("action",),
    name: str = "trajectory",
)
ParameterTypeDefaultDescription
required_keysSequence[str]("action",)Keys required in each step dict
namestr"trajectory"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Output format: Expects output to be:

  • A list[dict] of step dicts, OR
  • A dict with a "trajectory" key containing a list of step dicts

Per-step validation:

  • Must have a "step" or "id" key
  • Must have all keys in required_keys

Score: valid_steps / total_steps. Returns 0.0 for empty or invalid trajectories.

Details: {"valid": int, "total": int, "errors": [str]}.

Example

python
import asyncio
from orbiter.eval import TrajectoryValidator

async def main():
    scorer = TrajectoryValidator(required_keys=("action", "observation"))

    trajectory = [
        {"step": 1, "action": "search", "observation": "found 3 results"},
        {"step": 2, "action": "click"},  # Missing observation
        {"id": "s3", "action": "submit", "observation": "success"},
    ]

    result = await scorer.score("c1", None, trajectory)
    print(f"Score: {result.score:.2f}")  # 0.67 (2 of 3 valid)
    print(result.details["errors"])

asyncio.run(main())

TimeCostScorer

Scores based on execution time relative to a maximum budget. Reads _time_cost_ms from the output dict.

Registry name: "time_cost"

Constructor

python
TimeCostScorer(*, max_ms: float = 30_000.0, name: str = "time_cost")
ParameterTypeDefaultDescription
max_msfloat30_000.0Maximum time budget in milliseconds
namestr"time_cost"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Score formula: clamp(1.0 - elapsed / max_ms, 0.0, 1.0)

Reads output["_time_cost_ms"] if output is a dict. Falls back to 0.0 elapsed time.

Details: {"elapsed_ms": float, "max_ms": float}.

Example

python
import asyncio
from orbiter.eval import TimeCostScorer

async def main():
    scorer = TimeCostScorer(max_ms=10_000.0)

    # Fast execution
    r1 = await scorer.score("c1", None, {"_time_cost_ms": 2000.0, "result": "ok"})
    print(f"Score: {r1.score:.2f}")  # 0.80

    # Slow execution
    r2 = await scorer.score("c2", None, {"_time_cost_ms": 15000.0, "result": "ok"})
    print(f"Score: {r2.score:.2f}")  # 0.00

asyncio.run(main())

AnswerAccuracyLLMScorer

LLM-as-Judge scorer comparing agent output to a reference answer. Extends LLMAsJudgeScorer.

Registry name: "answer_accuracy"

Constructor

python
AnswerAccuracyLLMScorer(
    judge: Any = None,
    *,
    question_key: str = "question",
    answer_key: str = "answer",
    name: str = "answer_accuracy",
)
ParameterTypeDefaultDescription
judgeAnyNoneAsync callable (prompt: str) -> str
question_keystr"question"Key in input dict for the question
answer_keystr"answer"Key in input dict for the reference answer
namestr"answer_accuracy"Scorer name

Overridden Methods

build_prompt()

Formats the prompt with three sections: [Question], [Correct Answer], and [Agent Response].

Expected LLM Response

json
{"score": 0.85, "explanation": "The answer is mostly correct but missed..."}

Example

python
import asyncio
from orbiter.eval import AnswerAccuracyLLMScorer

async def judge(prompt: str) -> str:
    return '{"score": 0.9, "explanation": "Correct with minor omissions."}'

async def main():
    scorer = AnswerAccuracyLLMScorer(judge=judge)

    result = await scorer.score(
        "c1",
        {"question": "What is 2+2?", "answer": "4"},
        "The answer is 4.",
    )
    print(f"Score: {result.score}")  # 0.9

asyncio.run(main())

LabelDistributionScorer

Evaluates label balance / distribution skew across a dataset. Per-case score is 0.0 (placeholder). The real value is in the details dict and the summarize() method.

Registry name: "label_distribution"

Constructor

python
LabelDistributionScorer(*, label_key: str = "label", name: str = "label_distribution")
ParameterTypeDefaultDescription
label_keystr"label"Key in the input dict to extract label from
namestr"label_distribution"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Returns score 0.0 with details={"label": <value>}. The label is extracted from input[label_key].

summarize()

python
def summarize(self, results: list[ScorerResult]) -> dict[str, Any]

Compute label distribution across all scored cases.

ParameterTypeDescription
resultslist[ScorerResult]Scored results to aggregate

Returns: Dict with:

KeyTypeDescription
labelslistSorted unique labels
fractionslist[float]Fraction for each label
countsdictRaw counts per label
skewfloatmax_fraction - min_fraction (0 = perfectly balanced)

Example

python
import asyncio
from orbiter.eval import LabelDistributionScorer

async def main():
    scorer = LabelDistributionScorer(label_key="category")

    cases = [
        {"category": "positive"},
        {"category": "positive"},
        {"category": "negative"},
        {"category": "neutral"},
    ]

    results = []
    for i, case in enumerate(cases):
        r = await scorer.score(f"c{i}", case, "output")
        results.append(r)

    summary = scorer.summarize(results)
    print(summary["counts"])    # {'negative': 1, 'neutral': 1, 'positive': 2}
    print(f"Skew: {summary['skew']:.2f}")  # 0.25

asyncio.run(main())