L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Length Control for Reasoning Language Models with just a Prompt!

Precisely control reasoning length

Pranjal Aggarwal, Sean Welleck

Carnegie Mellon University

The Challenge of Length Control

Reasoning language models have shown an uncanny ability to improve performance at test-time by "thinking longer"—that is, by generating longer chain-of-thought sequences and hence using more compute. However, these models lack dynamic control over output length, leading to three critical problems:

Computational Waste

In some cases, sequences span tens of thousands of tokens, wasting compute when shorter reasoning would suffice.

Premature Halting

Without length control, models may stop too early on complex problems, failing to allocate enough reasoning steps.

Unexplored Trade-offs

There is no way to calibrate inference compute budgets for target performance levels, leaving potential efficiency gains unexplored.

Our Solution: Length Controlled Policy Optimization (LCPO)

We propose Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that gives reasoning language models adaptive control over the length using just a prompt.

Two Variants

LCPO-Exact: Requires the generated reasoning to be exactly equal to the target length.

Example prompt: "Think for exactly 512 tokens."

Use case: When precise control is needed for benchmarking or exact token budgeting.

LCPO-Max: Requires output to be no longer than the target length, allowing flexibility while respecting upper bounds.

Example prompt: "Think for maximum 1024 tokens."

Use case: When limiting maximum computation while allowing flexibility for problem difficulty.

How It Works

1. Problem Formulation

Given an input prompt x and a target length n_gold, generate a response whose length n_y minimizes |n_gold - n_y| while producing the correct answer.

2. Prompt Augmentation

Each prompt is augmented with a target length instruction:

x_new = Concat(x, "Think for exactly n_gold tokens.")

3. Reinforcement Learning

We optimize using a reward function that balances accuracy and length adherence:

r(y, y_gold, n_gold) = I(y = y_gold) - α · |n_gold - n_y|

α balances correctness vs. length matching

Key Results

Upto 2x

Performance improvement per token over S1 method

~3%

Mean length deviation on math reasoning tasks

+2%

Our 1.5B model outperforms GPT-4o at equal reasoning lengths

Performance Across Token Budgets

Performance across token budgets
Our Methods:
L1-Exact
L1-Max
Baselines:
★ S1 (Budget Forcing)
Agentica-4K
Agentica-24K
DeepSeek-R1-1.5B

L1 significantly outperforms S1 method by up to 100% relative and 20% absolute across all token budgets.

Interactive Model Comparison

Compare Model Performance

Compare how L1 models perform with different token constraints and control strategies.

Surprising Findings

Long CoT Models are Secretly Strong Short CoT Models

Key Insight:

Our L1-1.5B model trained with LCPO outperforms its original counterparts by significant margins (up to 10% improvement) and even matches GPT-4o despite using the same token budget.

Model Performance Comparison

Each pair of model uses the same generation length

Hover over bars to see exact accuracy and token count

First demonstration that a 1.5B model can match the performance of GPT-4o, despite using the same generation length.

Generalizes to Out-of-Distribution Tasks

Key Insight:

L1's length control capabilities generalize to domains outside its training distribution, including logical reasoning (GPQA, LSAT) and general knowledge (MMLU).

OOD Task Performance
Out-of-domain task performance

Performance scales positively with token budget even on OOD tasks

L1's length control can generalize to new domains, matching base model performance at comparable token budgets.

Citation

@misc{aggarwal2025l1controllinglongreasoning,
            title={L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning}, 
            author={Pranjal Aggarwal and Sean Welleck},
            year={2025},
            eprint={2503.04697},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2503.04697}, 
      }