orbiter.eval.trajectory_scorers
Trajectory validation, time cost, accuracy, and label distribution scorers. Includes a scorer registry with @scorer_register() decorator for automatic discovery and factory-based creation.
Trajectory validation, time cost, accuracy, and label distribution scorers. Includes a scorer registry with @scorer_register() decorator for automatic discovery and factory-based creation.
Module Path
from orbiter.eval.trajectory_scorers import (
scorer_register,
get_scorer,
list_scorers,
TrajectoryValidator,
TimeCostScorer,
AnswerAccuracyLLMScorer,
LabelDistributionScorer,
)Scorer Registry
A module-level registry for Scorer subclasses, enabling factory-based lookup by name.
scorer_register()
Decorator that registers a Scorer subclass under a given name.
def scorer_register(name: str) -> Callable| Parameter | Type | Description |
|---|---|---|
name | str | Registry key for the scorer class |
Returns: A decorator that registers the class and returns it unchanged.
Usage:
from orbiter.eval import Scorer, ScorerResult, scorer_register
@scorer_register("my_metric")
class MyScorer(Scorer):
async def score(self, case_id, input, output):
return ScorerResult(scorer_name="my_metric", score=1.0)get_scorer()
Lookup a registered scorer class by name.
def get_scorer(name: str) -> type[Scorer]| Parameter | Type | Description |
|---|---|---|
name | str | Registry key |
Returns: The scorer class.
Raises: KeyError if the name is not registered.
list_scorers()
Return all registered scorer names (sorted).
def list_scorers() -> list[str]Returns: Sorted list of registered scorer names.
Built-in Registrations
| Name | Class |
|---|---|
"trajectory" | TrajectoryValidator |
"time_cost" | TimeCostScorer |
"answer_accuracy" | AnswerAccuracyLLMScorer |
"label_distribution" | LabelDistributionScorer |
TrajectoryValidator
Validates a trajectory (list of step dicts) for structural integrity. Checks each step for required keys and returns the fraction of valid steps.
Registry name: "trajectory"
Constructor
TrajectoryValidator(
*,
required_keys: Sequence[str] = ("action",),
name: str = "trajectory",
)| Parameter | Type | Default | Description |
|---|---|---|---|
required_keys | Sequence[str] | ("action",) | Keys required in each step dict |
name | str | "trajectory" | Scorer name |
Methods
score()
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResultOutput format: Expects output to be:
- A
list[dict]of step dicts, OR - A
dictwith a"trajectory"key containing a list of step dicts
Per-step validation:
- Must have a
"step"or"id"key - Must have all keys in
required_keys
Score: valid_steps / total_steps. Returns 0.0 for empty or invalid trajectories.
Details: {"valid": int, "total": int, "errors": [str]}.
Example
import asyncio
from orbiter.eval import TrajectoryValidator
async def main():
scorer = TrajectoryValidator(required_keys=("action", "observation"))
trajectory = [
{"step": 1, "action": "search", "observation": "found 3 results"},
{"step": 2, "action": "click"}, # Missing observation
{"id": "s3", "action": "submit", "observation": "success"},
]
result = await scorer.score("c1", None, trajectory)
print(f"Score: {result.score:.2f}") # 0.67 (2 of 3 valid)
print(result.details["errors"])
asyncio.run(main())TimeCostScorer
Scores based on execution time relative to a maximum budget. Reads _time_cost_ms from the output dict.
Registry name: "time_cost"
Constructor
TimeCostScorer(*, max_ms: float = 30_000.0, name: str = "time_cost")| Parameter | Type | Default | Description |
|---|---|---|---|
max_ms | float | 30_000.0 | Maximum time budget in milliseconds |
name | str | "time_cost" | Scorer name |
Methods
score()
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResultScore formula: clamp(1.0 - elapsed / max_ms, 0.0, 1.0)
Reads output["_time_cost_ms"] if output is a dict. Falls back to 0.0 elapsed time.
Details: {"elapsed_ms": float, "max_ms": float}.
Example
import asyncio
from orbiter.eval import TimeCostScorer
async def main():
scorer = TimeCostScorer(max_ms=10_000.0)
# Fast execution
r1 = await scorer.score("c1", None, {"_time_cost_ms": 2000.0, "result": "ok"})
print(f"Score: {r1.score:.2f}") # 0.80
# Slow execution
r2 = await scorer.score("c2", None, {"_time_cost_ms": 15000.0, "result": "ok"})
print(f"Score: {r2.score:.2f}") # 0.00
asyncio.run(main())AnswerAccuracyLLMScorer
LLM-as-Judge scorer comparing agent output to a reference answer. Extends LLMAsJudgeScorer.
Registry name: "answer_accuracy"
Constructor
AnswerAccuracyLLMScorer(
judge: Any = None,
*,
question_key: str = "question",
answer_key: str = "answer",
name: str = "answer_accuracy",
)| Parameter | Type | Default | Description |
|---|---|---|---|
judge | Any | None | Async callable (prompt: str) -> str |
question_key | str | "question" | Key in input dict for the question |
answer_key | str | "answer" | Key in input dict for the reference answer |
name | str | "answer_accuracy" | Scorer name |
Overridden Methods
build_prompt()
Formats the prompt with three sections: [Question], [Correct Answer], and [Agent Response].
Expected LLM Response
{"score": 0.85, "explanation": "The answer is mostly correct but missed..."}Example
import asyncio
from orbiter.eval import AnswerAccuracyLLMScorer
async def judge(prompt: str) -> str:
return '{"score": 0.9, "explanation": "Correct with minor omissions."}'
async def main():
scorer = AnswerAccuracyLLMScorer(judge=judge)
result = await scorer.score(
"c1",
{"question": "What is 2+2?", "answer": "4"},
"The answer is 4.",
)
print(f"Score: {result.score}") # 0.9
asyncio.run(main())LabelDistributionScorer
Evaluates label balance / distribution skew across a dataset. Per-case score is 0.0 (placeholder). The real value is in the details dict and the summarize() method.
Registry name: "label_distribution"
Constructor
LabelDistributionScorer(*, label_key: str = "label", name: str = "label_distribution")| Parameter | Type | Default | Description |
|---|---|---|---|
label_key | str | "label" | Key in the input dict to extract label from |
name | str | "label_distribution" | Scorer name |
Methods
score()
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResultReturns score 0.0 with details={"label": <value>}. The label is extracted from input[label_key].
summarize()
def summarize(self, results: list[ScorerResult]) -> dict[str, Any]Compute label distribution across all scored cases.
| Parameter | Type | Description |
|---|---|---|
results | list[ScorerResult] | Scored results to aggregate |
Returns: Dict with:
| Key | Type | Description |
|---|---|---|
labels | list | Sorted unique labels |
fractions | list[float] | Fraction for each label |
counts | dict | Raw counts per label |
skew | float | max_fraction - min_fraction (0 = perfectly balanced) |
Example
import asyncio
from orbiter.eval import LabelDistributionScorer
async def main():
scorer = LabelDistributionScorer(label_key="category")
cases = [
{"category": "positive"},
{"category": "positive"},
{"category": "negative"},
{"category": "neutral"},
]
results = []
for i, case in enumerate(cases):
r = await scorer.score(f"c{i}", case, "output")
results.append(r)
summary = scorer.summarize(results)
print(summary["counts"]) # {'negative': 1, 'neutral': 1, 'positive': 2}
print(f"Skew: {summary['skew']:.2f}") # 0.25
asyncio.run(main())