CLI Reference

The main command is:

gym-anything

`gym-anything run`

Run one environment, optionally with a task.

Examples:

gym-anything run moodle --task enroll_student -i
gym-anything run benchmarks/cua_world/environments/moodle_env --task enroll_student

Important flags:

--task: task id to load
-i, --interactive: keep the environment alive for interactive use
--steps: number of steps in non-interactive mode
--seed: reset seed
--open-vnc: open a VNC viewer in interactive mode

If you pass a short environment name such as moodle, the CLI resolves it against benchmarks/cua_world/environments/. If you don't pass --task, the CLI picks a random task from that environment.

`gym-anything benchmark`

Run an agent on benchmark tasks. This is the main way to evaluate agents.

Single task:

gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4

All tasks in an environment:

gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split test

Full corpus:

gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split test

Important flags:

--agent (required): agent class name (e.g. ClaudeAgent)
--task: task ID. Omit to run all tasks in the split (batch mode)
--model: model identifier (e.g. claude-opus-4)
--steps: max steps per task (default: 50)
--split: task split for batch mode (default: test)
--agent-arg KEY=VALUE: extra agent argument (repeatable)

When --task is omitted, the CLI enters batch mode and runs all tasks for the given environment and split. Each task runs in its own process for fault isolation.

`gym-anything agents`

List available agent implementations.

gym-anything agents

Use this to see which agent names you can pass to gym-anything benchmark --agent.

`gym-anything list`

List available environments.

Examples:

gym-anything list
gym-anything list --verbose

--verbose also prints the tasks under each environment.

`gym-anything doctor`

Check host prerequisites and optional verifier imports.

Examples:

gym-anything doctor
gym-anything doctor --runner avf
gym-anything doctor --json

Useful flags:

--runner: limit the report to one runner
--verification-root: check verifier imports under a specific root
--json: machine-readable output

`gym-anything compatibility`

Show the runner compatibility matrix.

Examples:

gym-anything compatibility
gym-anything compatibility --runner docker
gym-anything compatibility --json

`gym-anything validate`

Validate one environment and its task specs.

Example:

gym-anything validate moodle --task enroll_student

This is the lighter spec check.

`gym-anything verify spec`

Verify one environment directory and its task specs in more detail.

Example:

gym-anything verify spec moodle --task enroll_student

Useful flag:

--json

This is the more detailed spec verification path.

`gym-anything verify corpus`

Verify all environment and task specs under a root.

Examples:

gym-anything verify corpus
gym-anything verify corpus benchmarks/cua_world/environments --max-failures 100

Useful flags:

--max-failures
--write-status-manifest
--write-verified-split
--write-missing-hook-manifest
--json

The default root is benchmarks/cua_world/environments.

`gym-anything verify task`

Run a task through reset and finalization, then execute its final check.

Example:

gym-anything verify task moodle --task enroll_student

Useful flags:

--seed
--use_cache
--cache_level
--use_savevm
--json

CLI Reference

On this page