creation_audit
Convert a software application name into a working gym-anything environment.
Path: extras/research/software_as_env/creation_audit/
A coding+computer-use agent reads a creation prompt, writes the install / configure / per-task scripts, runs the environment to gather evidence, and an independent audit agent reviews that evidence and reports issues. The creation agent then ingests the audit and fixes whatever was flagged. This is the §3 pipeline from the CUA-World paper.
A run typically takes 30 minutes to several hours per environment depending on how complex the target software is. Plan for it.
Prerequisites
gym-anythinginstalled (pip install -e ".[all]").- An agent CLI on
PATH:- Claude Code
(
claude), used by default. - Codex CLI (
codex), enabled with--backend codex.
- Claude Code
(
- A runtime that can boot the target environment — typically QEMU +
Apptainer for cluster work, or Docker locally. Run
gym-anything doctorto see what your machine has. - Disk + memory headroom for VM images (10–40 GB per env), plus network access for installing the target software inside the VM.
- Optional: the
visual_groundingMCP server inmcp/. The creation prompt asks the agent to call it for pixel-coordinate UI grounding. Setup is manual and one-time.
Quickstart
gym-anything-extras research software_as_env creation_audit \
--software "Moodle" --env-dir moodle_envWhat happens after you press enter:
- Initial pass — the creation agent reads the creation prompt and spends most of the run authoring scripts and booting the environment.
- Blind nudge ×N (default 1) — the agent is re-prompted to recheck the creation prompt. Recovers omissions caused by long-context drift.
- Audit ×M (default 2) — a fresh audit agent (no chain-of-thought
from the creator) reviews
evidence_docs/and writesaudits/audit_<env>.md. The creation agent ingests that audit and fixes issues, then the next audit round runs.
Each phase is logged to creation_audit_logs/<env>.txt. Tail it to
see progress.
Common variations
# Use Codex instead of Claude Code
... creation_audit --software "Inkscape" --env-dir inkscape_env --backend codex
# Heavier auditing for tricky software
... creation_audit --software "Bahmni" --env-dir bahmni_env --audit-rounds 4
# Resume from phase 2 (after the first nudge) of a previous session
... creation_audit --software "Moodle" --env-dir moodle_env \
--start-idx 2 --session-id <existing-session-uuid>Run ... creation_audit --help for the full flag list.
Output
benchmarks/cua_world/environments/<env_dir>/— the new environment. Containsenv.json,scripts/,config/,tasks/,evidence_docs/. Conforms to the gym-anythingEnvSpec/TaskSpeccontract.audits/audit_<env_dir>.md— the final audit report.creation_audit_logs/<env_dir>.txt— phase-by-phase run log.
After it finishes
# Validate the new env spec
gym-anything verify spec <env_dir>
# Boot it and watch the desktop
gym-anything run <env_dir> -i --open-vncIf verify spec fails or the VNC view shows the env in a wrong
state, look at audits/audit_<env_dir>.md for what the auditor
flagged in the last round.