CLI Reference
The shipped `gym-anything` commands and what they do.
The main command is:
gym-anythinggym-anything run
Run one environment, optionally with a task.
Examples:
gym-anything run moodle --task enroll_student -i
gym-anything run benchmarks/cua_world/environments/moodle_env --task enroll_studentImportant flags:
--task: task id to load-i,--interactive: keep the environment alive for interactive use--steps: number of steps in non-interactive mode--seed: reset seed--open-vnc: open a VNC viewer in interactive mode
If you pass a short environment name such as moodle, the CLI resolves it against benchmarks/cua_world/environments/. If you don't pass --task, the CLI picks a random task from that environment.
gym-anything benchmark
Run an agent on benchmark tasks. This is the main way to evaluate agents.
Single task:
gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4All tasks in an environment:
gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split testFull corpus:
gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split testImportant flags:
--agent(required): agent class name (e.g.ClaudeAgent)--task: task ID. Omit to run all tasks in the split (batch mode)--model: model identifier (e.g.claude-opus-4)--steps: max steps per task (default: 50)--split: task split for batch mode (default:test)--agent-arg KEY=VALUE: extra agent argument (repeatable)
When --task is omitted, the CLI enters batch mode and runs all tasks for the given environment and split. Each task runs in its own process for fault isolation.
gym-anything agents
List available agent implementations.
gym-anything agentsUse this to see which agent names you can pass to gym-anything benchmark --agent.
gym-anything list
List available environments.
Examples:
gym-anything list
gym-anything list --verbose--verbose also prints the tasks under each environment.
gym-anything doctor
Check host prerequisites and optional verifier imports.
Examples:
gym-anything doctor
gym-anything doctor --runner avf
gym-anything doctor --jsonUseful flags:
--runner: limit the report to one runner--verification-root: check verifier imports under a specific root--json: machine-readable output
gym-anything compatibility
Show the runner compatibility matrix.
Examples:
gym-anything compatibility
gym-anything compatibility --runner docker
gym-anything compatibility --jsongym-anything validate
Validate one environment and its task specs.
Example:
gym-anything validate moodle --task enroll_studentThis is the lighter spec check.
gym-anything verify spec
Verify one environment directory and its task specs in more detail.
Example:
gym-anything verify spec moodle --task enroll_studentUseful flag:
--json
This is the more detailed spec verification path.
gym-anything verify corpus
Verify all environment and task specs under a root.
Examples:
gym-anything verify corpus
gym-anything verify corpus benchmarks/cua_world/environments --max-failures 100Useful flags:
--max-failures--write-status-manifest--write-verified-split--write-missing-hook-manifest--json
The default root is benchmarks/cua_world/environments.
gym-anything verify task
Run a task through reset and finalization, then execute its final check.
Example:
gym-anything verify task moodle --task enroll_studentUseful flags:
--seed--use_cache--cache_level--use_savevm--json