CLI Reference
The shipped `gym-anything` commands and what they do.
The main command is:
gym-anythingA second binary, gym-anything-extras, dispatches to optional
research and infrastructure tools that live alongside the library.
See Extras for what's available.
gym-anything run
Run one environment, optionally with a task.
Examples:
gym-anything run moodle --task enroll_student -i
gym-anything run benchmarks/cua_world/environments/moodle_env --task enroll_studentImportant flags:
--task: task id to load-i,--interactive: keep the environment alive for interactive use--steps: number of steps in non-interactive mode--seed: reset seed--open-vnc: open a VNC viewer in interactive mode
If you pass a short environment name such as moodle, the CLI resolves it against benchmarks/cua_world/environments/. If you don't pass --task, the CLI picks a random task from that environment.
gym-anything benchmark
Run an agent on benchmark tasks. This is the main way to evaluate agents.
Single task:
gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4All tasks in an environment:
gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split testFull corpus:
gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split testImportant flags:
--agent(required): agent class name (e.g.ClaudeAgent)--task: task ID. Omit to run all tasks in the split (batch mode)--model: model identifier (e.g.claude-opus-4)--steps: max steps per task (default: 50)--split: task split for batch mode (default:test)--parallel,--jobs: batch task processes to run at once--max-tasks: limit the number of tasks in batch mode--agent-arg KEY=VALUE: extra agent argument (repeatable)--remote-url: route environment execution through a remote master or worker--remote-timeout: HTTP timeout for remote calls--remote-worker-reset-policy: worker-local reset policy, usuallycore--verifier-mode: overridetask.jsonverifier mode for the run, for examplevlm_checklist--vlm-checklist-model: model used by the VLM checklist verifier
When --task is omitted, the CLI enters batch mode and runs all tasks for the given environment and split. Each task runs in its own process for fault isolation.
Remote benchmark example:
gym-anything benchmark moodle \
--task enroll_student \
--agent ClaudeAgent \
--model claude-opus-4 \
--remote-url http://master-host:5800Verifier overrides can also be set with environment variables. CLI flags take precedence over environment variables, which take precedence over task.json.
GYM_ANYTHING_VERIFIER_MODE=vlm_checklist \
GYM_ANYTHING_VLM_CHECKLIST_MODEL=gemini-3-flash-preview \
gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4Common checklist verifier environment variables:
GYM_ANYTHING_VERIFIER_MODEGYM_ANYTHING_VLM_CHECKLIST_MODELGYM_ANYTHING_VLM_CHECKLIST_BACKENDsuch asgemini,local,openai, oranthropicGYM_ANYTHING_VLM_CHECKLIST_BASE_URLGYM_ANYTHING_VLM_CHECKLIST_TEMPERATUREGYM_ANYTHING_VLM_CHECKLIST_MAX_FRAMESGYM_ANYTHING_VLM_CHECKLIST_COMPLETION_THRESHOLDGYM_ANYTHING_VLM_CHECKLIST_INTEGRITY_THRESHOLD
gym-anything agents
List available agent implementations.
gym-anything agentsUse this to see which agent names you can pass to gym-anything benchmark --agent.
gym-anything list
List available environments.
Examples:
gym-anything list
gym-anything list --verbose--verbose also prints the tasks under each environment.
gym-anything doctor
Check host prerequisites and optional verifier imports.
Examples:
gym-anything doctor
gym-anything doctor --runner avf
gym-anything doctor --jsonUseful flags:
--runner: limit the report to one runner--verification-root: check verifier imports under a specific root--json: machine-readable output
gym-anything compatibility
Show the runner compatibility matrix.
Examples:
gym-anything compatibility
gym-anything compatibility --runner docker
gym-anything compatibility --jsongym-anything validate
Validate one environment and its task specs.
Example:
gym-anything validate moodle --task enroll_studentThis is the lighter spec check.
gym-anything verify spec
Verify one environment directory and its task specs in more detail.
Example:
gym-anything verify spec moodle --task enroll_studentUseful flag:
--json
This is the more detailed spec verification path.
gym-anything verify corpus
Verify all environment and task specs under a root.
Examples:
gym-anything verify corpus
gym-anything verify corpus benchmarks/cua_world/environments --max-failures 100Useful flags:
--max-failures--write-status-manifest--write-verified-split--write-missing-hook-manifest--json
The default root is benchmarks/cua_world/environments.
gym-anything verify task
Run a task through reset and finalization, then execute its final check.
Example:
gym-anything verify task moodle --task enroll_studentUseful flags:
--seed--use_cache--cache_level--use_savevm--json