Gym Anything
Runtime

CLI Reference

The shipped `gym-anything` commands and what they do.

The main command is:

gym-anything

A second binary, gym-anything-extras, dispatches to optional research and infrastructure tools that live alongside the library. See Extras for what's available.

gym-anything run

Run one environment, optionally with a task.

Examples:

gym-anything run moodle --task enroll_student -i
gym-anything run benchmarks/cua_world/environments/moodle_env --task enroll_student

Important flags:

  • --task: task id to load
  • -i, --interactive: keep the environment alive for interactive use
  • --steps: number of steps in non-interactive mode
  • --seed: reset seed
  • --open-vnc: open a VNC viewer in interactive mode

If you pass a short environment name such as moodle, the CLI resolves it against benchmarks/cua_world/environments/. If you don't pass --task, the CLI picks a random task from that environment.

gym-anything benchmark

Run an agent on benchmark tasks. This is the main way to evaluate agents.

Single task:

gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4

All tasks in an environment:

gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split test

Full corpus:

gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split test

Important flags:

  • --agent (required): agent class name (e.g. ClaudeAgent)
  • --task: task ID. Omit to run all tasks in the split (batch mode)
  • --model: model identifier (e.g. claude-opus-4)
  • --steps: max steps per task (default: 50)
  • --split: task split for batch mode (default: test)
  • --parallel, --jobs: batch task processes to run at once
  • --max-tasks: limit the number of tasks in batch mode
  • --agent-arg KEY=VALUE: extra agent argument (repeatable)
  • --remote-url: route environment execution through a remote master or worker
  • --remote-timeout: HTTP timeout for remote calls
  • --remote-worker-reset-policy: worker-local reset policy, usually core
  • --verifier-mode: override task.json verifier mode for the run, for example vlm_checklist
  • --vlm-checklist-model: model used by the VLM checklist verifier

When --task is omitted, the CLI enters batch mode and runs all tasks for the given environment and split. Each task runs in its own process for fault isolation.

Remote benchmark example:

gym-anything benchmark moodle \
  --task enroll_student \
  --agent ClaudeAgent \
  --model claude-opus-4 \
  --remote-url http://master-host:5800

Verifier overrides can also be set with environment variables. CLI flags take precedence over environment variables, which take precedence over task.json.

GYM_ANYTHING_VERIFIER_MODE=vlm_checklist \
GYM_ANYTHING_VLM_CHECKLIST_MODEL=gemini-3-flash-preview \
gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4

Common checklist verifier environment variables:

  • GYM_ANYTHING_VERIFIER_MODE
  • GYM_ANYTHING_VLM_CHECKLIST_MODEL
  • GYM_ANYTHING_VLM_CHECKLIST_BACKEND such as gemini, local, openai, or anthropic
  • GYM_ANYTHING_VLM_CHECKLIST_BASE_URL
  • GYM_ANYTHING_VLM_CHECKLIST_TEMPERATURE
  • GYM_ANYTHING_VLM_CHECKLIST_MAX_FRAMES
  • GYM_ANYTHING_VLM_CHECKLIST_COMPLETION_THRESHOLD
  • GYM_ANYTHING_VLM_CHECKLIST_INTEGRITY_THRESHOLD

gym-anything agents

List available agent implementations.

gym-anything agents

Use this to see which agent names you can pass to gym-anything benchmark --agent.

gym-anything list

List available environments.

Examples:

gym-anything list
gym-anything list --verbose

--verbose also prints the tasks under each environment.

gym-anything doctor

Check host prerequisites and optional verifier imports.

Examples:

gym-anything doctor
gym-anything doctor --runner avf
gym-anything doctor --json

Useful flags:

  • --runner: limit the report to one runner
  • --verification-root: check verifier imports under a specific root
  • --json: machine-readable output

gym-anything compatibility

Show the runner compatibility matrix.

Examples:

gym-anything compatibility
gym-anything compatibility --runner docker
gym-anything compatibility --json

gym-anything validate

Validate one environment and its task specs.

Example:

gym-anything validate moodle --task enroll_student

This is the lighter spec check.

gym-anything verify spec

Verify one environment directory and its task specs in more detail.

Example:

gym-anything verify spec moodle --task enroll_student

Useful flag:

  • --json

This is the more detailed spec verification path.

gym-anything verify corpus

Verify all environment and task specs under a root.

Examples:

gym-anything verify corpus
gym-anything verify corpus benchmarks/cua_world/environments --max-failures 100

Useful flags:

  • --max-failures
  • --write-status-manifest
  • --write-verified-split
  • --write-missing-hook-manifest
  • --json

The default root is benchmarks/cua_world/environments.

gym-anything verify task

Run a task through reset and finalization, then execute its final check.

Example:

gym-anything verify task moodle --task enroll_student

Useful flags:

  • --seed
  • --use_cache
  • --cache_level
  • --use_savevm
  • --json

On this page