Skip to content
Navigation

Rule-based scorers for format validation, schema validation, correctness, length, relevance, and completeness.

Module Path

python
from orbiter.eval.scorers import (
    FormatValidationScorer,
    SchemaValidationScorer,
    OutputCorrectnessScorer,
    OutputLengthScorer,
    OutputRelevanceScorer,
    OutputCompletenessScorer,
)

FormatValidationScorer

Validates that output conforms to a specified format. Supports five formats: json, xml, yaml, markdown, csv.

Constructor

python
FormatValidationScorer(fmt: str = "json", *, name: str | None = None)
ParameterTypeDefaultDescription
fmtstr"json"Target format to validate against
namestr | NoneNoneScorer name (defaults to "format_{fmt}")

Raises: ValueError if fmt is not one of the supported formats.

Supported Formats

FormatValidation Logic
"json"Parses with json.loads()
"xml"Parses with xml.etree.ElementTree.fromstring()
"yaml"Parses with yaml.safe_load(), must produce dict or list. Falls back to regex if pyyaml not installed
"markdown"Detects headings, lists, links, code fences, blockquotes, or bold markers
"csv"Requires header + data row with consistent delimiter counts. Checks ,, \t, ;, |

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Returns score 1.0 if format is valid, 0.0 otherwise. Details include {"format": "<fmt>"} and any error information.

Example

python
import asyncio
from orbiter.eval import FormatValidationScorer

async def main():
    json_scorer = FormatValidationScorer("json")
    xml_scorer = FormatValidationScorer("xml")
    md_scorer = FormatValidationScorer("markdown")

    r1 = await json_scorer.score("c1", None, '{"key": "value"}')
    print(r1.score)  # 1.0

    r2 = await json_scorer.score("c2", None, "not json")
    print(r2.score)  # 0.0

    r3 = await xml_scorer.score("c3", None, "<root><item>text</item></root>")
    print(r3.score)  # 1.0

    r4 = await md_scorer.score("c4", None, "# Hello\n\nSome **bold** text")
    print(r4.score)  # 1.0

asyncio.run(main())

SchemaValidationScorer

Validates JSON output against a JSON Schema. Performs minimal validation: type checking, required fields, and recursive property validation.

Constructor

python
SchemaValidationScorer(schema: dict[str, Any], *, name: str = "schema")
ParameterTypeDefaultDescription
schemadict[str, Any](required)JSON Schema to validate against
namestr"schema"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Returns 1.0 if the output is valid JSON matching the schema, 0.0 otherwise. Details include an "errors" list when validation fails.

Supported Schema Keywords

KeywordDescription
typeChecks object, array, string, number, integer, boolean
requiredChecks for presence of required fields in objects
propertiesRecursively validates nested properties

Example

python
import asyncio
from orbiter.eval import SchemaValidationScorer

async def main():
    schema = {
        "type": "object",
        "required": ["name", "age"],
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
        },
    }
    scorer = SchemaValidationScorer(schema)

    r1 = await scorer.score("c1", None, '{"name": "Alice", "age": 30}')
    print(r1.score)  # 1.0

    r2 = await scorer.score("c2", None, '{"name": "Bob"}')
    print(r2.score)  # 0.0
    print(r2.details)  # {'errors': ["Missing required field: 'age'"]}

asyncio.run(main())

OutputCorrectnessScorer

Checks output against a ground truth string or keyword list.

Constructor

python
OutputCorrectnessScorer(
    *,
    ground_truth: str | None = None,
    keywords: list[str] | None = None,
    normalize: bool = True,
    name: str = "correctness",
)
ParameterTypeDefaultDescription
ground_truthstr | NoneNoneExpected output for exact matching
keywordslist[str] | NoneNoneKeywords to check for presence
normalizeboolTrueNormalize whitespace and case for ground truth comparison
namestr"correctness"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

With ground_truth: Returns 1.0 for exact match (after optional normalization), 0.0 otherwise. Details include {"match": bool}.

With keywords: Returns fraction of keywords found (case-insensitive substring match). Details include {"found": [...], "missing": [...]}.

With neither: Returns 0.0 with an error.

Example

python
import asyncio
from orbiter.eval import OutputCorrectnessScorer

async def main():
    # Exact match
    exact = OutputCorrectnessScorer(ground_truth="Hello World")
    r1 = await exact.score("c1", None, "  hello   world  ")
    print(r1.score)  # 1.0 (normalized match)

    # Keyword check
    kw = OutputCorrectnessScorer(keywords=["Python", "machine learning", "AI"])
    r2 = await kw.score("c2", None, "Python is great for AI applications")
    print(r2.score)  # 0.667 (2 of 3 keywords found)
    print(r2.details["missing"])  # ['machine learning']

asyncio.run(main())

OutputLengthScorer

Scores based on output length being within a specified range.

Constructor

python
OutputLengthScorer(
    *,
    min_length: int = 1,
    max_length: int = 10_000,
    name: str = "length",
)
ParameterTypeDefaultDescription
min_lengthint1Minimum allowed character count
max_lengthint10_000Maximum allowed character count
namestr"length"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Returns 1.0 if min_length <= len(output) <= max_length, else 0.0. Details include {"length": int, "min": int, "max": int}.

Example

python
import asyncio
from orbiter.eval import OutputLengthScorer

async def main():
    scorer = OutputLengthScorer(min_length=10, max_length=100)

    r1 = await scorer.score("c1", None, "This is a valid length response.")
    print(r1.score)  # 1.0

    r2 = await scorer.score("c2", None, "Short")
    print(r2.score)  # 0.0
    print(r2.details)  # {'length': 5, 'min': 10, 'max': 100}

asyncio.run(main())

OutputRelevanceScorer

Keyword-overlap relevance between input and output. Computes the fraction of input words that appear in the output.

Constructor

python
OutputRelevanceScorer(*, name: str = "relevance")
ParameterTypeDefaultDescription
namestr"relevance"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Score = min(overlap_count / input_word_count, 1.0). Uses case-insensitive word splitting. Returns 0.0 if input is empty.

Details include {"overlap": int, "input_words": int}.

Example

python
import asyncio
from orbiter.eval import OutputRelevanceScorer

async def main():
    scorer = OutputRelevanceScorer()

    r = await scorer.score(
        "c1",
        "What is Python programming?",
        "Python is a popular programming language used for many tasks.",
    )
    print(f"Score: {r.score:.2f}")
    print(r.details)  # {'overlap': 3, 'input_words': 4}

asyncio.run(main())

OutputCompletenessScorer

Checks that output covers all required sections (case-insensitive substring match).

Constructor

python
OutputCompletenessScorer(required_sections: list[str], *, name: str = "completeness")
ParameterTypeDefaultDescription
required_sectionslist[str](required)Sections that must appear in the output
namestr"completeness"Scorer name

Methods

score()

python
async def score(self, case_id: str, input: Any, output: Any) -> ScorerResult

Score = fraction of required sections found. Details include {"found": [...], "missing": [...]}.

Example

python
import asyncio
from orbiter.eval import OutputCompletenessScorer

async def main():
    scorer = OutputCompletenessScorer(
        required_sections=["introduction", "methodology", "results", "conclusion"]
    )

    output = """
    # Introduction
    This study examines...
    # Methodology
    We used a survey approach...
    # Results
    The findings show...
    """

    r = await scorer.score("c1", None, output)
    print(r.score)  # 0.75 (3 of 4 sections found)
    print(r.details["missing"])  # ['conclusion']

asyncio.run(main())