Evaluation

The evaluation programs are the part of agents/ that actually runs agents against environments.

The easiest way to run an evaluation is through the CLI:

gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4

Run gym-anything agents to see which agent names are available.

One Agent, One Task

Run a single agent on a single task:

gym-anything benchmark moodle --task enroll_student --agent ClaudeAgent --model claude-opus-4

The important arguments are:

Common optional arguments are:

At a high level, run_single does this:

load the environment and selected task
reset the environment
create the agent class you selected
call agent.init(...)
capture an observation
repeat:
- call agent.step(obs, action_outputs)
- send the returned actions to the environment
- feed the action results back into the agent
finish the task with mark_done=True
close the environment
call agent.finish(...)

If you're trying to understand how the loop works, agents/evaluation/run_single.py is the file to read.

Omit --task to run all tasks in a split:

gym-anything benchmark moodle --agent ClaudeAgent --model claude-opus-4 --split test

To run across all environments:

gym-anything benchmark all --agent ClaudeAgent --model claude-opus-4 --split test

The main arguments for batch mode are:

Each task runs in its own process for fault isolation — if one task crashes, the rest continue.

If you're only trying to verify that an agent works at all, start with a single task before running batch mode.

Read the evaluation files in this order:

run_single.py is the main loop. run_batch.py is mostly task selection plus repeated calls into run_single.