Evaluation
How to run one agent on one task, or many tasks in a row.
The evaluation programs are the part of agents/ that actually runs agents against environments.
The easiest way to run an evaluation is through the CLI:
gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4Run gym-anything agents to see which agent names are available.
One Agent, One Task
Run a single agent on a single task:
gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4The important arguments are:
env_dir: environment name (e.g.moodle) or full path--task: task id inside that environment--agent: agent class name (e.g.ClaudeAgent)--model: model identifier (e.g.claude-opus-4)
Common optional arguments are:
--steps: maximum number of agent steps (default: 50)--seed: reset seed (default: 42)--temperature: sampling temperature--agent-arg KEY=VALUE: extra agent arguments (repeatable)--use-cache,--cache-level,--use-savevm
What run_single Does
At a high level, run_single does this:
- load the environment and selected task
- reset the environment
- create the agent class you selected
- call
agent.init(...) - capture an observation
- repeat:
- call
agent.step(obs, action_outputs) - send the returned actions to the environment
- feed the action results back into the agent
- call
- finish the task with
mark_done=True - close the environment
- call
agent.finish(...)
If you're trying to understand how the loop works, agents/evaluation/run_single.py is the file to read.
One Agent, Many Tasks
Omit --task to run all tasks in a split:
gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split testTo run across all environments:
gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split testThe main arguments for batch mode are:
--split: task list name such astrain,test, orall(default:test)--surface:raworverified(default:raw)
Each task runs in its own process for fault isolation — if one task crashes, the rest continue.
If you're only trying to verify that an agent works at all, start with a single task before running batch mode.
If You Want To Change The Loop
Read the evaluation files in this order:
agents/evaluation/run_single.pyagents/evaluation/run_batch.pyagents/evaluation/setup.py
run_single.py is the main loop. run_batch.py is mostly task selection plus repeated calls into run_single.