Skip to content
Navigation

VeRL integration for reinforcement learning from human feedback (RLHF).

python
from orbiter.train.verl import (
    RewardSpec,
    VeRLAlgorithm,
    VeRLConfig,
    VeRLTrainer,
)

Requires: pip install "orbiter-train[verl] @ git+https://github.com/Midsphere-AI/orbiter-ai.git#subdirectory=packages/orbiter-train"


VeRLAlgorithm

python
class VeRLAlgorithm(StrEnum)

Supported VeRL RL algorithms.

ValueDescription
PPO = "ppo"Proximal Policy Optimization
GRPO = "grpo"Group Relative Policy Optimization

RewardSpec

python
@dataclass(frozen=True, slots=True)
class RewardSpec

Descriptor for a reward function used during RL training. Either callable (an in-process function) or module_path + func_name (an importable reference) must be provided.

FieldTypeDefaultDescription
callableCallable[..., float] | NoneNoneIn-process reward function
module_pathstr""Module containing the reward function
func_namestr""Function name within the module

Raises: TrainerError — If neither callable nor both module_path and func_name are provided.

Methods

resolve

python
def resolve(self) -> Callable[..., float]

Return the concrete callable, importing if necessary.

Raises: TrainerError — If the function cannot be found or is not callable.

Example

python
# In-process callable
spec = RewardSpec(callable=lambda output, expected: 1.0 if output == expected else 0.0)

# Importable reference
spec = RewardSpec(module_path="my_rewards", func_name="accuracy_reward")
fn = spec.resolve()  # imports and returns the function

VeRLConfig

python
@dataclass(slots=True)
class VeRLConfig(TrainConfig)

VeRL-specific training configuration. Extends TrainConfig with RL algorithm selection, rollout parameters, and model/tokenizer references.

Fields (inherited from TrainConfig)

FieldTypeDefaultDescription
epochsint1Number of training epochs
batch_sizeint8Batch size
learning_ratefloat1e-5Learning rate
output_dirstr""Output directory
extradict[str, Any]{}Extra settings

Fields (VeRL-specific)

FieldTypeDefaultDescription
algorithmVeRLAlgorithmVeRLAlgorithm.GRPORL algorithm
rollout_batch_sizeint4Rollout batch size (must be >= 1)
ppo_epochsint4PPO inner loop epochs (must be >= 1)
kl_coefffloat0.1KL divergence coefficient
clip_rangefloat0.2PPO clip range (must be in [0, 1])
gammafloat1.0Discount factor
lamfloat0.95GAE lambda
model_namestr""Model name/path for VeRL
tokenizer_namestr""Tokenizer name/path
max_prompt_lengthint1024Maximum prompt length in tokens
max_response_lengthint512Maximum response length in tokens

Raises: ValueError — If rollout_batch_size < 1, ppo_epochs < 1, or clip_range not in [0, 1].


VeRLTrainer

python
class VeRLTrainer(Trainer)(config: VeRLConfig | None = None)

Concrete trainer that integrates with the VeRL framework.

Lifecycle:

  1. check_agent(agent) — validate agent compatibility
  2. check_dataset(data) — validate dataset format
  3. check_reward(spec) — validate reward function
  4. check_config(cfg) — validate and merge VeRL config
  5. mark_validated() — transition to VALIDATED
  6. train() — execute RL training loop
  7. evaluate(test_data) — run evaluation

Constructor parameters

NameTypeDefaultDescription
configVeRLConfig | NoneNoneVeRL configuration. Defaults to VeRLConfig()

Properties

PropertyTypeDescription
verl_configVeRLConfigTyped access to the VeRL-specific config
stateTrainerStateCurrent lifecycle state (inherited)
configTrainConfigTraining configuration (inherited)

Methods

check_agent

python
def check_agent(self, agent: Any) -> None

Validate that agent is usable for VeRL training. Agent must be non-None and should have an instructions attribute.

Raises: TrainerError — If agent is None.

check_dataset

python
def check_dataset(
    self,
    train_data: Any,
    test_data: Any | None = None,
) -> None

Validate training data format. Expects a sequence of dicts, each containing at least an input key.

Raises: TrainerError — If data is empty, not a list/tuple, items are not dicts, or items lack input key.

check_reward

python
def check_reward(self, reward_fn: Any | None = None) -> None

Validate reward function. Accepts a RewardSpec, a plain callable, or None (uses VeRL built-in reward).

Raises: TrainerError — If reward_fn is an unsupported type.

check_config

python
def check_config(
    self,
    config: TrainConfig | dict[str, Any] | None = None,
) -> None

Validate and optionally merge VeRL config overrides. Dict values are merged into extra. A VeRLConfig replaces the current config entirely.

train

python
async def train(self) -> TrainMetrics

Execute the VeRL RL training loop. Requires VeRL to be installed.

Returns: TrainMetrics with training statistics.

Raises: TrainerError — If not validated or VeRL is not installed.

evaluate

python
async def evaluate(self, test_data: Any | None = None) -> TrainMetrics

Run evaluation on test data. Uses test data from check_dataset if not provided.

Returns: TrainMetrics with evaluation statistics.

Example

python
from orbiter.train import VeRLTrainer, VeRLConfig, VeRLAlgorithm, RewardSpec

config = VeRLConfig(
    algorithm=VeRLAlgorithm.GRPO,
    epochs=3,
    batch_size=16,
    rollout_batch_size=8,
    model_name="Qwen/Qwen2.5-7B",
    output_dir="/tmp/verl_output",
)

trainer = VeRLTrainer(config)

# Validation phase
trainer.check_agent(my_agent)
trainer.check_dataset(
    train_data=[{"input": "What is Python?"}, {"input": "Explain ML"}],
    test_data=[{"input": "What is Java?"}],
)
trainer.check_reward(RewardSpec(callable=my_reward_fn))
trainer.check_config()
trainer.mark_validated()

# Training phase
metrics = await trainer.train()
print(f"Steps: {metrics.steps}, Loss: {metrics.loss}")

# Evaluation phase
eval_metrics = await trainer.evaluate()
print(f"Accuracy: {eval_metrics.accuracy}")