L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

The Challenge of Length Control

Reasoning language models have shown an uncanny ability to improve performance at test-time by "thinking longer"—that is, by generating longer chain-of-thought sequences and hence using more compute. However, these models lack dynamic control over output length, leading to three critical problems:

Computational Waste

In some cases, sequences span tens of thousands of tokens, wasting compute when shorter reasoning would suffice.

Premature Halting

Without length control, models may stop too early on complex problems, failing to allocate enough reasoning steps.

Unexplored Trade-offs

There is no way to calibrate inference compute budgets for target performance levels, leaving potential efficiency gains unexplored.

Our Solution: Length Controlled Policy Optimization (LCPO)

We propose Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that gives reasoning language models adaptive control over the length using just a prompt.

LCPO-Exact: Requires the generated reasoning to be exactly equal to the target length.

Example prompt: "Think for exactly 512 tokens."

Use case: When precise control is needed for benchmarking or exact token budgeting.

LCPO-Max: Requires output to be no longer than the target length, allowing flexibility while respecting upper bounds.

Example prompt: "Think for maximum 1024 tokens."

Use case: When limiting maximum computation while allowing flexibility for problem difficulty.

How It Works

1. Problem Formulation

Given an input prompt x and a target length n_gold, generate a response whose length n_y minimizes |n_gold - n_y| while producing the correct answer.

2. Prompt Augmentation

Each prompt is augmented with a target length instruction:

x_new = Concat(x, "Think for exactly n_gold tokens.")

3. Reinforcement Learning

We optimize using a reward function that balances accuracy and length adherence:

r(y, y_gold, n_gold) = I(y = y_gold) - α · |n_gold - n_y|

α balances correctness vs. length matching

Key Results

Upto 2x

Performance improvement per token over S1 method

~3%

Mean length deviation on math reasoning tasks

+2%

Our 1.5B model outperforms GPT-4o at equal reasoning lengths

Performance Across Token Budgets

Our Methods:

L1-Exact

L1-Max

Baselines:

★ S1 (Budget Forcing)

Agentica-4K

Agentica-24K

DeepSeek-R1-1.5B

L1 significantly outperforms S1 method by up to 100% relative and 20% absolute across all token budgets.

Interactive Model Comparison

Compare Model Performance

Dataset

Problem

Select a dataset and click "Load Random Example" to see a problem.

Correct Answer: Select a problem to see the answer

Model Comparison

Example #123

Model 1

Token Length

L1-Exact (512 tokens)

Correct Incorrect

Select options and load a random example to see the model response.

0 tokens

Model 2

Token Length

L1-Max (3600 tokens)

Correct Incorrect

Select options and load a random example to see the model response.

0 tokens

Compare how L1 models perform with different token constraints and control strategies.

Surprising Findings

Long CoT Models are Secretly Strong Short CoT Models

Key Insight:

Our L1-1.5B model trained with LCPO outperforms its original counterparts by significant margins (up to 10% improvement) and even matches GPT-4o despite using the same token budget.

Model Performance Comparison

Each pair of model uses the same generation length

Hover over bars to see exact accuracy and token count

First demonstration that a 1.5B model can match the performance of GPT-4o, despite using the same generation length.

Generalizes to Out-of-Distribution Tasks

Key Insight:

L1's length control capabilities generalize to domains outside its training distribution, including logical reasoning (GPQA, LSAT) and general knowledge (MMLU).

OOD Task Performance

Performance scales positively with token budget even on OOD tasks

L1's length control can generalize to new domains, matching base model performance at comparable token budgets.

Citation

@misc{aggarwal2025l1controllinglongreasoning,
            title={L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning}, 
            author={Pranjal Aggarwal and Sean Welleck},
            year={2025},
            eprint={2503.04697},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2503.04697}, 
      }

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Length Control for Reasoning Language Models with just a Prompt!

The Challenge of Length Control

Computational Waste

Premature Halting

Unexplored Trade-offs

Our Solution: Length Controlled Policy Optimization (LCPO)

Two Variants

How It Works

1. Problem Formulation

2. Prompt Augmentation

3. Reinforcement Learning

1. Problem Formulation

2. Prompt Augmentation

3. Reinforcement Learning

Key Results

Performance Across Token Budgets

Our Methods:

Baselines:

Interactive Model Comparison

Compare Model Performance

Problem

Model Comparison

L1-Exact (512 tokens)

L1-Max (3600 tokens)

Surprising Findings

Long CoT Models are Secretly Strong Short CoT Models

Model Performance Comparison

Generalizes to Out-of-Distribution Tasks

OOD Task Performance

Citation