CLI Reference

The main command is:

gym-anything

A second binary, gym-anything-extras, dispatches to optional research and infrastructure tools that live alongside the library. See Extras for what's available.

`gym-anything run`

Run one environment, optionally with a task.

Examples:

gym-anything run moodle --task enroll_student -i
gym-anything run benchmarks/cua_world/environments/moodle_env --task enroll_student

Important flags:

--task: task id to load
-i, --interactive: keep the environment alive for interactive use
--steps: number of steps in non-interactive mode
--seed: reset seed
--open-vnc: open a VNC viewer in interactive mode

If you pass a short environment name such as moodle, the CLI resolves it against benchmarks/cua_world/environments/. If you don't pass --task, the CLI picks a random task from that environment.

`gym-anything benchmark`

Run an agent on benchmark tasks. This is the main way to evaluate agents.

Single task:

gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4

All tasks in an environment:

gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split test

Full corpus:

gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split test

Important flags:

--agent (required): agent class name (e.g. ClaudeAgent)
--task: task ID. Omit to run all tasks in the split (batch mode)
--model: model identifier (e.g. claude-opus-4)
--steps: max steps per task (default: 50)
--split: task split for batch mode (default: test)
--parallel, --jobs: batch task processes to run at once
--max-tasks: limit the number of tasks in batch mode
--agent-arg KEY=VALUE: extra agent argument (repeatable)
--remote-url: route environment execution through a remote master or worker
--remote-timeout: HTTP timeout for remote calls
--remote-worker-reset-policy: worker-local reset policy, usually core
--verifier-mode: override task.json verifier mode for the run, for example vlm_checklist
--vlm-checklist-model: model used by the VLM checklist verifier

When --task is omitted, the CLI enters batch mode and runs all tasks for the given environment and split. Each task runs in its own process for fault isolation.

Remote benchmark example:

gym-anything benchmark moodle \
  --task enroll_student \
  --agent ClaudeAgent \
  --model claude-opus-4 \
  --remote-url http://master-host:5800

Verifier overrides can also be set with environment variables. CLI flags take precedence over environment variables, which take precedence over task.json.

GYM_ANYTHING_VERIFIER_MODE=vlm_checklist \
GYM_ANYTHING_VLM_CHECKLIST_MODEL=gemini-3-flash-preview \
gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4

Common checklist verifier environment variables:

GYM_ANYTHING_VERIFIER_MODE
GYM_ANYTHING_VLM_CHECKLIST_MODEL
GYM_ANYTHING_VLM_CHECKLIST_BACKEND such as gemini, local, openai, or anthropic
GYM_ANYTHING_VLM_CHECKLIST_BASE_URL
GYM_ANYTHING_VLM_CHECKLIST_TEMPERATURE
GYM_ANYTHING_VLM_CHECKLIST_MAX_FRAMES
GYM_ANYTHING_VLM_CHECKLIST_COMPLETION_THRESHOLD
GYM_ANYTHING_VLM_CHECKLIST_INTEGRITY_THRESHOLD

`gym-anything agents`

List available agent implementations.

gym-anything agents

Use this to see which agent names you can pass to gym-anything benchmark --agent.

`gym-anything list`

List available environments.

Examples:

gym-anything list
gym-anything list --verbose

--verbose also prints the tasks under each environment.

`gym-anything doctor`

Check host prerequisites and optional verifier imports.

Examples:

gym-anything doctor
gym-anything doctor --runner avf
gym-anything doctor --json

Useful flags:

--runner: limit the report to one runner
--verification-root: check verifier imports under a specific root
--json: machine-readable output

`gym-anything compatibility`

Show the runner compatibility matrix.

Examples:

gym-anything compatibility
gym-anything compatibility --runner docker
gym-anything compatibility --json

`gym-anything validate`

Validate one environment and its task specs.

Example:

gym-anything validate moodle --task enroll_student

This is the lighter spec check.

`gym-anything verify spec`

Verify one environment directory and its task specs in more detail.

Example:

gym-anything verify spec moodle --task enroll_student

Useful flag:

--json

This is the more detailed spec verification path.

`gym-anything verify corpus`

Verify all environment and task specs under a root.

Examples:

gym-anything verify corpus
gym-anything verify corpus benchmarks/cua_world/environments --max-failures 100

Useful flags:

--max-failures
--write-status-manifest
--write-verified-split
--write-missing-hook-manifest
--json

The default root is benchmarks/cua_world/environments.

`gym-anything verify task`

Run a task through reset and finalization, then execute its final check.

Example:

gym-anything verify task moodle --task enroll_student

Useful flags:

--seed
--use_cache
--cache_level
--use_savevm
--json

CLI Reference

On this page