Skip to content
Navigation

Evaluation and scoring framework with rule-based scorers, LLM-as-Judge assessment, trajectory validation, reflection, and iterative refinement.

Installation

bash
pip install "orbiter-eval @ git+https://github.com/Midsphere-AI/orbiter-ai.git#subdirectory=packages/orbiter-eval"

Module Path

python
import orbiter.eval

Public Exports (31)

ExportSource ModuleDescription
EvaluatorbaseParallel evaluation runner with pass@k
ScorerbaseAbstract base class for all scorers
ScorerResultbaseOutput from a single scorer
EvalCaseResultbaseResult for one input/output pair
EvalResultbaseAggregated result across all cases
EvalCriteriabaseThreshold-based pass/fail criteria
EvalTargetbaseAbstract evaluation subject
EvalStatusbaseOutcome status enum
EvalErrorbaseEvaluation error type
FormatValidationScorerscorersFormat validation (json/xml/yaml/markdown/csv)
SchemaValidationScorerscorersJSON Schema validation
OutputCorrectnessScorerscorersGround truth / keyword matching
OutputLengthScorerscorersLength constraint checking
OutputRelevanceScorerscorersKeyword-overlap relevance
OutputCompletenessScorerscorersRequired sections checking
LLMAsJudgeScorerllm_scorerBase LLM-as-Judge scorer
OutputQualityScorerllm_scorer5-dimensional quality assessment
LogicConsistencyScorerllm_scorerInternal contradiction detection
ReasoningValidityScorerllm_scorerArgumentation logic validation
ConstraintSatisfactionScorerllm_scorerBinary constraint checking
TrajectoryValidatortrajectory_scorersTrajectory structural integrity
TimeCostScorertrajectory_scorersExecution time scoring
AnswerAccuracyLLMScorertrajectory_scorersReference answer comparison
LabelDistributionScorertrajectory_scorersLabel balance / skew analysis
scorer_registertrajectory_scorersDecorator to register scorers
get_scorertrajectory_scorersLookup registered scorer by name
list_scorerstrajectory_scorersList all registered scorer names
ReflectorreflectionAbstract 3-step reflector
GeneralReflectorreflectionLLM-powered reflector
ReflectionHistoryreflectionTracks reflections over time
ReflectionResultreflectionSingle reflection output
ReflectionTypereflectionReflection category enum
ReflectionLevelreflectionReflection depth enum

Import Patterns

python
# Core evaluation
from orbiter.eval import Evaluator, Scorer, ScorerResult, EvalCriteria

# Rule-based scorers
from orbiter.eval import (
    FormatValidationScorer,
    SchemaValidationScorer,
    OutputCorrectnessScorer,
    OutputLengthScorer,
    OutputRelevanceScorer,
    OutputCompletenessScorer,
)

# LLM-as-Judge scorers
from orbiter.eval import (
    LLMAsJudgeScorer,
    OutputQualityScorer,
    LogicConsistencyScorer,
    ReasoningValidityScorer,
    ConstraintSatisfactionScorer,
)

# Trajectory scorers + registry
from orbiter.eval import (
    TrajectoryValidator,
    TimeCostScorer,
    AnswerAccuracyLLMScorer,
    LabelDistributionScorer,
    scorer_register,
    get_scorer,
    list_scorers,
)

# Reflection
from orbiter.eval import (
    Reflector,
    GeneralReflector,
    ReflectionHistory,
    ReflectionResult,
    ReflectionType,
    ReflectionLevel,
)

# Ralph iterative refinement (sub-package)
from orbiter.eval.ralph.runner import RalphRunner, RalphResult
from orbiter.eval.ralph.config import RalphConfig, LoopState, StopType
from orbiter.eval.ralph.detectors import StopDetector, CompositeDetector

Architecture

code
orbiter.eval
  base.py              Evaluator, Scorer ABC, result types, criteria
  scorers.py           6 rule-based scorers
  llm_scorer.py        LLMAsJudgeScorer + 4 specialized subclasses
  trajectory_scorers.py  Trajectory/time/accuracy scorers + registry
  reflection.py        Reflector ABC, GeneralReflector, history
  ralph/
    config.py          RalphConfig, LoopState, StopType, sub-configs
    runner.py          RalphRunner (5-phase loop)
    detectors.py       StopDetector ABC + 5 built-in + composite

Quick Example

python
import asyncio
from orbiter.eval import (
    Evaluator,
    EvalTarget,
    EvalCriteria,
    FormatValidationScorer,
    OutputCorrectnessScorer,
)

class MySystem(EvalTarget):
    async def predict(self, case_id, input):
        return '{"answer": "Paris"}'

async def main():
    evaluator = Evaluator(
        scorers=[
            FormatValidationScorer("json"),
            OutputCorrectnessScorer(keywords=["Paris"]),
        ],
        criteria=[EvalCriteria("format_json", threshold=1.0)],
        parallel=4,
    )

    dataset = [
        {"id": "q1", "input": "What is the capital of France?"},
    ]

    result = await evaluator.evaluate(MySystem(), dataset)
    print(result.summary)
    # {'format_json': 1.0, 'correctness': 1.0}

asyncio.run(main())

Submodule Reference

PageDescription
baseEvaluator, Scorer, result types, criteria, EvalTarget
scorersRule-based format, schema, correctness, length, relevance, completeness scorers
llm-scorerLLM-as-Judge scorers for quality, logic, reasoning, constraints
trajectory-scorersTrajectory validation, time cost, accuracy, label distribution, scorer registry
reflectionReflector framework with 3-step pipeline and history tracking
ralphRalph iterative refinement loop (Run-Analyze-Learn-Plan-Halt)