Turn Any Software into an Agent Environment
Got a software? Gym-Anything automatically converts it into a computer-use agent environment, setup with realistic data, tasks, and verification.
Our key insight: setting up computer-use agent environments is itself a coding and computer-use agent task. Gym-Anything automates the entire pipeline.
We identify which software matters most by decomposing U.S. GDP data into per-software economic contributions, then select 200 high-impact applications across all major occupation groups.
A coding agent automatically installs, configures, and populates each software with real-world data. An independent audit agent verifies correctness against quality checklists, iterating until the environment is production-ready.
An agentic model creates 5 high-quality seed tasks per software by actually running it. An LLM then amplifies these to 75 tasks each, yielding 10K+ verified tasks with checklist-based evaluation.
CUA-World covers tasks and softwares related to all 22 major occupation groups, spanning from healthcare and education to engineering, finance, and creative arts.
200 long-horizon tasks, one per software. Tasks often require 500+ steps. The best configuration achieves 27.5% pass rate, highlighting the difficulty of long-horizon tasks.
| # | Model | Method | Avg Score | Pass Rate |
|---|---|---|---|---|
| 1 | GPT-5.4 ▼ | Direct (2000 steps) | 55.5 | 27.5% |
OpenAI Provider Extended 2,000-step budget, no cost cap 2,000 Max Steps / Task | ||||
| 2 | Gemini 3 Flash ▼ | + Test-Time Auditing | 39.9 | 14.0% |
Google Provider TTA Auditor reviews trajectory mid-run and provides corrective feedback 2,000 Max Steps / Task | ||||
| 3 | Gemini 3 Flash ▼ | Direct (2000 steps) | 37.4 | 11.5% |
Google Provider 4× budget Extended to 2,000 steps vs standard 500 2,000 Max Steps / Task | ||||
| 4 | Gemini 3 Flash ▼ | Direct (500 steps) | 35.4 | 7.5% |
Google Provider Standard Baseline 500-step run 500 Max Steps / Task | ||||
| 5 | Sonnet 4.6 ▼ | Direct (500 steps) | 20.5 | 6.0% |
Anthropic Provider Computer Use Native computer-use capability 500 Max Steps / Task | ||||
| 6 | Kimi-K 2.5 ▼ | Direct (500 steps) | 33.9 | 5.5% |
Moonshot AI Provider Computer Use Native computer-use capability 500 Max Steps / Task | ||||
| 7 | GPT-5.4 ▼ | Direct (500 steps) | 22.7 | 3.0% |
OpenAI Provider Computer Use Via Operator / CUA API 500 Max Steps / Task | ||||
Evaluation uses checklist-based VLM verification with privileged information.
Pass Rate = fraction of tasks fully completed. Standard budget: 500 steps or $5; extended entries use 2,000 steps.
Submit your results: coming soon
Performance scales with both training data and test-time compute, while generalization to unseen software remains a challenge.
Avg. Score on CUA-World-Test improves log-linearly with more software and tasks.
Training on a subset of software recovers most gains on seen software, but barely helps on unseen software.
Pass rate on CUA-World-Long scales with step budget. Test-Time Auditing provides further gains.
Browse 200+ software environments, tasks, and agent trajectories.
A Gym-like Python API to turn any software into a computer-use agent environment
with a simple make() call.
import gym_anything
# Turn any software into an env
env = gym_anything.make("blender.modeling@1.0")
obs = env.reset()
for step in range(500):
screenshot = obs["screen"]
action = agent.act(screenshot)
obs, reward, done, info = env.step(action)
if done:
break
score = env.verify()
print(f"Task score: {score}")
env.close()
Desktop apps, web platforms, Android apps — wrap any software as an environment. Your agent interacts through screenshots and mouse/keyboard, just like a human would.
Docker, QEMU, Apptainer, or Apple Virtualization Framework. Works on your laptop, a cloud VM, or a rootless SLURM cluster.
Run hundreds of environments across remote machines with health monitoring, fault recovery, and automatic load balancing.
Per-step screenshot capture, structured action logs, a live dashboard, and VM state checkpointing for fast resets.
@misc{aggarwal2026gymanythingturnsoftwareagent,
title={Gym-Anything: Turn any Software into an Agent Environment},
author={Pranjal Aggarwal and Graham Neubig and Sean Welleck},
year={2026},
eprint={2604.06126},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.06126},
}