propose_and_amplify

Path: extras/research/task_generation/propose_and_amplify/

This is the §4 pipeline from the CUA-World paper.

Stage	What it does
`propose`	An agentic Claude Code session reads the task creation memory, inspects the env, and writes a small set of hard, realistic seed tasks directly into `<env>/tasks/`. Three internal phases — read notes, create tasks, blind nudge — match the original driver one-for-one.
`amplify`	Non-agentic Gemini expands the seeds into many more tasks. Three passes: a README pass that generates task spec markdown, a files pass that fills in `setup_task.sh` / `verifier.py` / `export_result.sh` / `README.md`, and a snapshot pass that records the generated task names into `<env>/tasks/seed_tasks.json` so subsequent amplify runs see them as seeds.
`extract`	The files-pass pickle is unpacked into final task folders under `<env>/tasks/`.

End-to-end takes a few hours per environment depending on --amplify-count, the size of the env, and rate limits.

Prerequisites

gym-anything installed.
An environment already exists under benchmarks/cua_world/environments/<env_dir>/. Build one with creation_audit first if it doesn't.
claude (Claude Code CLI) on PATH for the proposer. Set CLAUDE_BIN or pass --claude-bin.
ANTHROPIC_API_KEY — used by the files pass if you choose a Claude amplifier model.
GEMINI_API_KEY — used by the default amplifier.
Optional: the visual_grounding MCP server (the proposer's prompt references it but the run will proceed without it).

Quickstart

gym-anything-extras research task_generation propose_and_amplify \
    --software "Moodle" --env-dir moodle_env

What happens after you press enter:

propose — Claude Code opens a session, reads memory/task_creation_notes/, looks at the existing tasks under moodle_env/tasks/, and writes 5 new hard, realistic seed tasks directly to that folder. A blind nudge round catches anything skipped.
amplify — Gemini 3 Pro generates 75 more task specs, then produces implementation files for each. Outputs land under task_generation_runs/moodle_env/. After both passes, the generated task names are merged into moodle_env/tasks/seed_tasks.json.
extract — task folders are written under benchmarks/cua_world/environments/moodle_env/tasks/. Existing tasks (including the 5 seeds the proposer wrote) are preserved; pass --overwrite to replace them.

Common variations

# Generate more tasks
... propose_and_amplify --software "Bahmni" --env-dir bahmni_env \
    --amplify-count 150

# Re-run only the amplify stage (e.g. after fixing a verifier template)
... propose_and_amplify --software "Moodle" --env-dir moodle_env \
    --stage amplify

# Use a different proposer model
... propose_and_amplify --software "Moodle" --env-dir moodle_env \
    --proposer-model opus

# Resume the proposer from phase 2 (after notes are read) of an
# earlier session
... propose_and_amplify --software "Moodle" --env-dir moodle_env \
    --propose-start-idx 1 --session-id <existing-session-uuid>

Run ... propose_and_amplify --help for the full flag list.

Output

benchmarks/cua_world/environments/<env_dir>/tasks/<task_name>/ — the task folders. Each contains task.json, setup_task.sh, export_result.sh, verifier.py, README.md. They conform to the gym-anything TaskSpec contract.
benchmarks/cua_world/environments/<env_dir>/tasks/seed_tasks.json — task names recorded by the snapshot pass; future amplify runs read them as in-context examples.
task_generation_runs/<env_dir>/ — stage pickles and run logs. Re-running picks up from these.

After it finishes

# Validate every new task spec
gym-anything verify spec <env_dir>

# Run one of the new tasks live to sanity-check setup + verifier
gym-anything run <env_dir> --task <new_task_name> -i --open-vnc