Gym Anything
Runtime

CLI Reference

The shipped `gym-anything` commands and what they do.

The main command is:

gym-anything

gym-anything run

Run one environment, optionally with a task.

Examples:

gym-anything run moodle --task enroll_student -i
gym-anything run benchmarks/cua_world/environments/moodle_env --task enroll_student

Important flags:

  • --task: task id to load
  • -i, --interactive: keep the environment alive for interactive use
  • --steps: number of steps in non-interactive mode
  • --seed: reset seed
  • --open-vnc: open a VNC viewer in interactive mode

If you pass a short environment name such as moodle, the CLI resolves it against benchmarks/cua_world/environments/. If you don't pass --task, the CLI picks a random task from that environment.

gym-anything benchmark

Run an agent on benchmark tasks. This is the main way to evaluate agents.

Single task:

gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4

All tasks in an environment:

gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split test

Full corpus:

gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split test

Important flags:

  • --agent (required): agent class name (e.g. ClaudeAgent)
  • --task: task ID. Omit to run all tasks in the split (batch mode)
  • --model: model identifier (e.g. claude-opus-4)
  • --steps: max steps per task (default: 50)
  • --split: task split for batch mode (default: test)
  • --agent-arg KEY=VALUE: extra agent argument (repeatable)

When --task is omitted, the CLI enters batch mode and runs all tasks for the given environment and split. Each task runs in its own process for fault isolation.

gym-anything agents

List available agent implementations.

gym-anything agents

Use this to see which agent names you can pass to gym-anything benchmark --agent.

gym-anything list

List available environments.

Examples:

gym-anything list
gym-anything list --verbose

--verbose also prints the tasks under each environment.

gym-anything doctor

Check host prerequisites and optional verifier imports.

Examples:

gym-anything doctor
gym-anything doctor --runner avf
gym-anything doctor --json

Useful flags:

  • --runner: limit the report to one runner
  • --verification-root: check verifier imports under a specific root
  • --json: machine-readable output

gym-anything compatibility

Show the runner compatibility matrix.

Examples:

gym-anything compatibility
gym-anything compatibility --runner docker
gym-anything compatibility --json

gym-anything validate

Validate one environment and its task specs.

Example:

gym-anything validate moodle --task enroll_student

This is the lighter spec check.

gym-anything verify spec

Verify one environment directory and its task specs in more detail.

Example:

gym-anything verify spec moodle --task enroll_student

Useful flag:

  • --json

This is the more detailed spec verification path.

gym-anything verify corpus

Verify all environment and task specs under a root.

Examples:

gym-anything verify corpus
gym-anything verify corpus benchmarks/cua_world/environments --max-failures 100

Useful flags:

  • --max-failures
  • --write-status-manifest
  • --write-verified-split
  • --write-missing-hook-manifest
  • --json

The default root is benchmarks/cua_world/environments.

gym-anything verify task

Run a task through reset and finalization, then execute its final check.

Example:

gym-anything verify task moodle --task enroll_student

Useful flags:

  • --seed
  • --use_cache
  • --cache_level
  • --use_savevm
  • --json

On this page