Tasks and Checks
How a task is defined, how it starts, and how Gym Anything decides whether it passed.
A task is one specific thing the agent is asked to do inside an environment.
In Moodle, a task might ask the agent to enroll a student in a course.
In some other environment, a task might ask it to configure a setting, export a file, fill in a form, or fix something that's broken.
So a task isn't the whole environment. It's one job inside that environment.
What Makes A Task Different From A Prompt
A task here is more than just an instruction string. It includes setup code, export logic, and an automatic checker — that's what makes it reusable and testable.
It usually comes with:
- a written goal
- code that prepares the right starting state
- code that exports the result at the end
- code that checks whether the task was solved
One Real Task Folder
Here is a real task folder from the Moodle benchmark:
benchmarks/cua_world/environments/moodle_env/tasks/enroll_student/
task.json
setup_task.sh
export_result.sh
verifier.py
README.mdThe files mean:
task.json: the task description and settingssetup_task.sh: prepares the exact starting stateexport_result.sh: saves task-specific results at the endverifier.py: checks whether the task passedREADME.md: optional human explanation
What task.json Usually Contains
task.json is the main file to read first.
In a typical task, it tells you:
- the instruction shown to the agent
- the timeout and step limit
- which setup script runs before the task
- which check runs at the end
A simplified example looks like this:
{
"description": "Enroll the student 'Jane Doe' in the 'Intro to Biology' course.",
"init": {
"timeout_sec": 300,
"max_steps": 40
},
"hooks": {
"pre_task": "/workspace/tasks/enroll_student/setup_task.sh",
"post_task": "/workspace/tasks/enroll_student/export_result.sh"
},
"success": {
"mode": "program",
"spec": {
"program": "verifier.py::verify_enroll_student"
}
}
}What Happens When The Task Runs
The normal flow is:
- the environment starts
- the task setup script runs
- the agent receives the task instruction
- the agent interacts with the software
- the run is finished with
mark_done=True - the task export and final check run
If you're driving the environment directly from Python, the finish step usually looks like this:
obs, reward, done, info = env.step([], mark_done=True)That's the step that tells Gym Anything to end the task cleanly and run the final task logic.
What The Setup Script Is For
The setup script makes the starting point specific to that task.
For the Moodle enroll_student task, the setup script:
- resets the Moodle database state
- creates the student and course records
- restarts the Moodle service
- leaves the app ready for the agent to begin
Without that step, the task wouldn't start from a known state.
What The Final Check Is For
The final check decides whether the task succeeded.
Most of our benchmark tasks use a Python check in verifier.py. That check can look at things like:
- files created during the run
- exported JSON or CSV output
- application state
- database contents
- screenshots from the run
For the Moodle enroll-student task, the checker queries the Moodle database and verifies that the correct student was enrolled in the expected course.
We also support image-based checks and mixed checks, but most tasks use code in verifier.py.
If You Want To Understand A Task Quickly
Read these files in order:
task.jsonsetup_task.shverifier.py
That gives you the shortest path to understanding:
- what the agent is supposed to do
- what the task changes before the agent starts
- what the checker will accept as success
If You Want To Create A New Task
The simplest workflow is:
- copy an existing task folder that's close to what you want
- change the instruction in
task.json - update the setup script so the starting state matches the new goal
- update the final check so it matches the new goal
- run the task once yourself and make sure the check behaves the way you expect