Gym-Anything

Turn Any Software into an Agent Environment

Pranjal Aggarwal · Graham Neubig · Sean Welleck
Carnegie Mellon University

Got a software? Gym-Anything automatically converts it into a computer-use agent environment, setup with realistic data, tasks, and verification.

200+ real-world applications, from GIMP and Blender to NinjaTrader and OpenEMR. Each is installed, configured, and populated with realistic data inside a sandboxed VM.

200+
Software Environments

~50 tasks per software, created through agentic seed generation and LLM amplification. Each task has a checklist-based verifier with privileged information for reliable evaluation.

10,000+
Tasks & Environments

The Standard Occupational Classification (SOC) system defines 22 major occupation groups used by the U.S. Bureau of Labor Statistics. CUA-World covers all 22, from Healthcare to Construction to Arts.

22/22
SOC Occupation Groups

Environments run on Linux (desktop apps), Windows (via QEMU), and Android (via emulators), all managed through the same unified API.

3
Operating Systems

How It Works

Our key insight: setting up computer-use agent environments is itself a coding and computer-use agent task. Gym-Anything automates the entire pipeline.

01 · SELECT

GDP-Grounded Selection

We identify which software matters most by decomposing U.S. GDP data into per-software economic contributions, then select 200 high-impact applications across all major occupation groups.

02 · CREATE

Agent-Built Environments

A coding agent automatically installs, configures, and populates each software with real-world data. An independent audit agent verifies correctness against quality checklists, iterating until the environment is production-ready.

03 · SCALE

Task Generation

An agentic model creates 5 high-quality seed tasks per software by actually running it. An LLM then amplifies these to 75 tasks each, yielding 10K+ verified tasks with checklist-based evaluation.

Gym-Anything pipeline overview

Covering Every Occupation and Domain

CUA-World covers tasks and softwares related to all 22 major occupation groups, spanning from healthcare and education to engineering, finance, and creative arts.

Occupation
Domain
Software
Sample Task
Software preview
3D Slicer — Medical image analysis
Segment the complete tumor region on the loaded brain MRI (FLAIR, T1, T1ce, T2 sequences) and create a 3D visualization. Save the segmentation and report the tumor volume in mL...

Leaderboard: CUA-World-Long

200 long-horizon tasks, one per software. Tasks often require 500+ steps. The best configuration achieves 27.5% pass rate, highlighting the difficulty of long-horizon tasks.

# Model Method Avg Score Pass Rate
1 GPT-5.4 Direct (2000 steps) 55.5 27.5%
OpenAI
Provider
Extended
2,000-step budget, no cost cap
2,000
Max Steps / Task
2 Gemini 3 Flash + Test-Time Auditing 39.9 14.0%
Google
Provider
TTA
Auditor reviews trajectory mid-run and provides corrective feedback
2,000
Max Steps / Task
3 Gemini 3 Flash Direct (2000 steps) 37.4 11.5%
Google
Provider
4× budget
Extended to 2,000 steps vs standard 500
2,000
Max Steps / Task
4 Gemini 3 Flash Direct (500 steps) 35.4 7.5%
Google
Provider
Standard
Baseline 500-step run
500
Max Steps / Task
5 Sonnet 4.6 Direct (500 steps) 20.5 6.0%
Anthropic
Provider
Computer Use
Native computer-use capability
500
Max Steps / Task
6 Kimi-K 2.5 Direct (500 steps) 33.9 5.5%
Moonshot AI
Provider
Computer Use
Native computer-use capability
500
Max Steps / Task
7 GPT-5.4 Direct (500 steps) 22.7 3.0%
OpenAI
Provider
Computer Use
Via Operator / CUA API
500
Max Steps / Task

Evaluation uses checklist-based VLM verification with privileged information. Pass Rate = fraction of tasks fully completed. Standard budget: 500 steps or $5; extended entries use 2,000 steps.
Submit your results: coming soon

Key Results

Performance scales with both training data and test-time compute, while generalization to unseen software remains a challenge.

Training Data Scaling

Avg. Score on CUA-World-Test improves log-linearly with more software and tasks.

Generalization

Training on a subset of software recovers most gains on seen software, but barely helps on unseen software.

Test-Time Compute

Pass rate on CUA-World-Long scales with step budget. Test-Time Auditing provides further gains.

Explore CUA-World

Browse 200+ software environments, tasks, and agent trajectories.

Unipro UGENE

Unipro UGENE

CUA-World

Sample TaskYou are a molecular biologist designing a recombinant expression construct to clone the human erythropoietin (EPO) coding sequence into a pET-28a(+) expression...

VitalRecorder

VitalRecorder

CUA-World

Sample TaskYou are an anesthesiologist preparing a case for the department's Morbidity & Mortality conference. Vital Recorder is open with surgical case 0001.vital loaded...

JASP

JASP

CUA-World

Sample TaskThe file '/home/ga/Documents/JASP/WorldHappiness.csv' is already open in JASP. This is real data from the World Happiness Report covering 155 countries with...

OpenBCI GUI

OpenBCI GUI

CUA-World

Sample TaskYou are a neurofeedback clinician setting up a hemispheric alpha asymmetry monitoring station, a standard clinical protocol for depression assessment. The...

Panoply

Panoply

CUA-World

Sample TaskYou are a synoptic meteorologist preparing a Northern Hemisphere winter circulation diagnostic for a WMO regional training workshop. Using the briefing at...

Subsurface

Subsurface

CUA-World

Sample TaskYou are a technical diving instructor preparing an Advanced Trimix training course at the Blue Hole, Dahab, Egypt (GPS: 28.5720 N, 34.5412 E). Configure the...

Weasis

Weasis

CUA-World

Sample TaskPerform a quantitative tissue density characterization of the loaded chest CT phantom study. Find the slice where the central mediastinal structure has its...

KStars

KStars

CUA-World

Sample TaskA ZTF transient alert reports a supernova candidate near host galaxy NGC 4526. As the on-duty transient astronomer, perform a complete photometric...

Browse All 200+ Environments →

The Gym-Anything Library

A Gym-like Python API to turn any software into a computer-use agent environment with a simple make() call.

quickstart.py
import gym_anything

# Turn any software into an env
env = gym_anything.make("blender.modeling@1.0")

obs = env.reset()

for step in range(500):
    screenshot = obs["screen"]
    action = agent.act(screenshot)

    obs, reward, done, info = env.step(action)
    if done:
        break

score = env.verify()
print(f"Task score: {score}")

env.close()
Any Software, One API

Desktop apps, web platforms, Android apps — wrap any software as an environment. Your agent interacts through screenshots and mouse/keyboard, just like a human would.

Runs Anywhere

Docker, QEMU, Apptainer, or Apple Virtualization Framework. Works on your laptop, a cloud VM, or a rootless SLURM cluster.

Distributed Execution

Run hundreds of environments across remote machines with health monitoring, fault recovery, and automatic load balancing.

Trajectory Logging & Monitoring

Per-step screenshot capture, structured action logs, a live dashboard, and VM state checkpointing for fast resets.

💻 View on GitHub 📖 Documentation

BibTeX

@misc{aggarwal2026gymanythingturnsoftwareagent,
        title={Gym-Anything: Turn any Software into an Agent Environment}, 
        author={Pranjal Aggarwal and Graham Neubig and Sean Welleck},
        year={2026},
        eprint={2604.06126},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2604.06126}, 
  }