Gym-Anything

Turn Any Software into an Agent Environment

Pranjal Aggarwal · Graham Neubig · Sean Welleck
Carnegie Mellon University

Got a software? Gym-Anything automatically converts it into a computer-use agent environment, setup with realistic data, tasks, and verification.

📄 Paper 💻 Code 📦 Library 📄 Interactive Paper 🏆 Leaderboard 🔍 Explore CUA-World

200+

Software Environments

10,000+

Tasks & Environments

22/22

SOC Occupation Groups

Operating Systems

The Pipeline

How It Works

Our key insight: setting up computer-use agent environments is itself a coding and computer-use agent task. Gym-Anything automates the entire pipeline.

01 · SELECT

GDP-Grounded Selection

We identify which software matters most by decomposing U.S. GDP data into per-software economic contributions, then select 200 high-impact applications across all major occupation groups.

02 · CREATE

Agent-Built Environments

A coding agent automatically installs, configures, and populates each software with real-world data. An independent audit agent verifies correctness against quality checklists, iterating until the environment is production-ready.

03 · SCALE

Task Generation

An agentic model creates 5 high-quality seed tasks per software by actually running it. An LLM then amplifies these to 75 tasks each, yielding 10K+ verified tasks with checklist-based evaluation.

Coverage

Covering Every Occupation and Domain

CUA-World covers tasks and softwares related to all 22 major occupation groups, spanning from healthcare and education to engineering, finance, and creative arts.

Occupation

Domain

Software

Sample Task

3D Slicer — Medical image analysis

Segment the complete tumor region on the loaded brain MRI (FLAIR, T1, T1ce, T2 sequences) and create a 3D visualization. Save the segmentation and report the tumor volume in mL...

Benchmark

Leaderboard: CUA-World-Long

200 long-horizon tasks, one per software. Tasks often require 500+ steps. The best configuration achieves 27.5% pass rate, highlighting the difficulty of long-horizon tasks.

#	Model	Method	Avg Score	Pass Rate
1	GPT-5.4 ▼	Direct (2000 steps)	55.5	27.5%
OpenAI Provider Extended 2,000-step budget, no cost cap 2,000 Max Steps / Task
2	Gemini 3 Flash ▼	+ Test-Time Auditing	39.9	14.0%
Google Provider TTA Auditor reviews trajectory mid-run and provides corrective feedback 2,000 Max Steps / Task
3	Gemini 3 Flash ▼	Direct (2000 steps)	37.4	11.5%
Google Provider 4× budget Extended to 2,000 steps vs standard 500 2,000 Max Steps / Task
4	Gemini 3 Flash ▼	Direct (500 steps)	35.4	7.5%
Google Provider Standard Baseline 500-step run 500 Max Steps / Task
5	Sonnet 4.6 ▼	Direct (500 steps)	20.5	6.0%
Anthropic Provider Computer Use Native computer-use capability 500 Max Steps / Task
6	Kimi-K 2.5 ▼	Direct (500 steps)	33.9	5.5%
Moonshot AI Provider Computer Use Native computer-use capability 500 Max Steps / Task
7	GPT-5.4 ▼	Direct (500 steps)	22.7	3.0%
OpenAI Provider Computer Use Via Operator / CUA API 500 Max Steps / Task

Evaluation uses checklist-based VLM verification with privileged information. Pass Rate = fraction of tasks fully completed. Standard budget: 500 steps or $5; extended entries use 2,000 steps.
Submit your results: coming soon

Findings

Key Results

Performance scales with both training data and test-time compute, while generalization to unseen software remains a challenge.

Training Data Scaling

Avg. Score on CUA-World-Test improves log-linearly with more software and tasks.

Generalization

Training on a subset of software recovers most gains on seen software, but barely helps on unseen software.

Test-Time Compute

Pass rate on CUA-World-Long scales with step budget. Test-Time Auditing provides further gains.

Collection

Explore CUA-World

Browse 200+ software environments, tasks, and agent trajectories.

Unipro UGENE

CUA-World

Sample TaskYou are a molecular biologist designing a recombinant expression construct to clone the human erythropoietin (EPO) coding sequence into a pET-28a(+) expression...

VitalRecorder

CUA-World

Sample TaskYou are an anesthesiologist preparing a case for the department's Morbidity & Mortality conference. Vital Recorder is open with surgical case 0001.vital loaded...

JASP

CUA-World

Sample TaskThe file '/home/ga/Documents/JASP/WorldHappiness.csv' is already open in JASP. This is real data from the World Happiness Report covering 155 countries with...

OpenBCI GUI

CUA-World

Sample TaskYou are a neurofeedback clinician setting up a hemispheric alpha asymmetry monitoring station, a standard clinical protocol for depression assessment. The...

Panoply

CUA-World

Sample TaskYou are a synoptic meteorologist preparing a Northern Hemisphere winter circulation diagnostic for a WMO regional training workshop. Using the briefing at...

Subsurface

CUA-World

Sample TaskYou are a technical diving instructor preparing an Advanced Trimix training course at the Blue Hole, Dahab, Egypt (GPS: 28.5720 N, 34.5412 E). Configure the...

Weasis

CUA-World

Sample TaskPerform a quantitative tissue density characterization of the loaded chest CT phantom study. Find the slice where the central mediastinal structure has its...

KStars

CUA-World

Sample TaskA ZTF transient alert reports a supernova candidate near host galaxy NGC 4526. As the on-duty transient astronomer, perform a complete photometric...

Browse All 200+ Environments →

Open Source

The Gym-Anything Library

A Gym-like Python API to turn any software into a computer-use agent environment with a simple make() call.

quickstart.py

import gym_anything

# Turn any software into an env
env = gym_anything.make("blender.modeling@1.0")

obs = env.reset()

for step in range(500):
    screenshot = obs["screen"]
    action = agent.act(screenshot)

    obs, reward, done, info = env.step(action)
    if done:
        break

score = env.verify()
print(f"Task score: {score}")

env.close()

Any Software, One API

Desktop apps, web platforms, Android apps — wrap any software as an environment. Your agent interacts through screenshots and mouse/keyboard, just like a human would.

Runs Anywhere

Docker, QEMU, Apptainer, or Apple Virtualization Framework. Works on your laptop, a cloud VM, or a rootless SLURM cluster.

Distributed Execution

Run hundreds of environments across remote machines with health monitoring, fault recovery, and automatic load balancing.

Trajectory Logging & Monitoring

Per-step screenshot capture, structured action logs, a live dashboard, and VM state checkpointing for fast resets.

💻 View on GitHub 📖 Documentation

Citation

BibTeX

@misc{aggarwal2026gymanythingturnsoftwareagent,
        title={Gym-Anything: Turn any Software into an Agent Environment}, 
        author={Pranjal Aggarwal and Graham Neubig and Sean Welleck},
        year={2026},
        eprint={2604.06126},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2604.06126}, 
  }

📄 arXiv 💻 GitHub 📊 Dataset