Extras
propose_and_amplify
Generate hard, realistic task folders for an existing gym-anything environment.
Path: extras/research/task_generation/propose_and_amplify/
This is the §4 pipeline from the CUA-World paper.
| Stage | What it does |
|---|---|
propose | An agentic Claude Code session reads the task creation memory, inspects the env, and writes a small set of hard, realistic seed tasks directly into <env>/tasks/. Three internal phases — read notes, create tasks, blind nudge — match the original driver one-for-one. |
amplify | Non-agentic Gemini expands the seeds into many more tasks. Three passes: a README pass that generates task spec markdown, a files pass that fills in setup_task.sh / verifier.py / export_result.sh / README.md, and a snapshot pass that records the generated task names into <env>/tasks/seed_tasks.json so subsequent amplify runs see them as seeds. |
extract | The files-pass pickle is unpacked into final task folders under <env>/tasks/. |
End-to-end takes a few hours per environment depending on
--amplify-count, the size of the env, and rate limits.
Prerequisites
gym-anythinginstalled.- An environment already exists under
benchmarks/cua_world/environments/<env_dir>/. Build one with creation_audit first if it doesn't. claude(Claude Code CLI) onPATHfor the proposer. SetCLAUDE_BINor pass--claude-bin.ANTHROPIC_API_KEY— used by the files pass if you choose a Claude amplifier model.GEMINI_API_KEY— used by the default amplifier.- Optional: the
visual_groundingMCP server (the proposer's prompt references it but the run will proceed without it).
Quickstart
gym-anything-extras research task_generation propose_and_amplify \
--software "Moodle" --env-dir moodle_envWhat happens after you press enter:
- propose — Claude Code opens a session, reads
memory/task_creation_notes/, looks at the existing tasks undermoodle_env/tasks/, and writes 5 new hard, realistic seed tasks directly to that folder. A blind nudge round catches anything skipped. - amplify — Gemini 3 Pro generates 75 more task specs, then
produces implementation files for each. Outputs land under
task_generation_runs/moodle_env/. After both passes, the generated task names are merged intomoodle_env/tasks/seed_tasks.json. - extract — task folders are written under
benchmarks/cua_world/environments/moodle_env/tasks/. Existing tasks (including the 5 seeds the proposer wrote) are preserved; pass--overwriteto replace them.
Common variations
# Generate more tasks
... propose_and_amplify --software "Bahmni" --env-dir bahmni_env \
--amplify-count 150
# Re-run only the amplify stage (e.g. after fixing a verifier template)
... propose_and_amplify --software "Moodle" --env-dir moodle_env \
--stage amplify
# Use a different proposer model
... propose_and_amplify --software "Moodle" --env-dir moodle_env \
--proposer-model opus
# Resume the proposer from phase 2 (after notes are read) of an
# earlier session
... propose_and_amplify --software "Moodle" --env-dir moodle_env \
--propose-start-idx 1 --session-id <existing-session-uuid>Run ... propose_and_amplify --help for the full flag list.
Output
benchmarks/cua_world/environments/<env_dir>/tasks/<task_name>/— the task folders. Each containstask.json,setup_task.sh,export_result.sh,verifier.py,README.md. They conform to the gym-anythingTaskSpeccontract.benchmarks/cua_world/environments/<env_dir>/tasks/seed_tasks.json— task names recorded by the snapshot pass; future amplify runs read them as in-context examples.task_generation_runs/<env_dir>/— stage pickles and run logs. Re-running picks up from these.
After it finishes
# Validate every new task spec
gym-anything verify spec <env_dir>
# Run one of the new tasks live to sanity-check setup + verifier
gym-anything run <env_dir> --task <new_task_name> -i --open-vnc