Neurips 2024 Tutorial:
Beyond Decoding: Meta-Generation Algorithms for Large Language Models

Sean Welleck

Sean Welleck¹

Amanda Bertsch

Amanda Bertsch¹

Matthew Finlayson

Matthew Finlayson²

Alex Xie

Alex Xie¹

Graham Neubig

Graham Neubig¹

Konstantin Golobokov

Konstantin Golobokov⁵

Hailey Schoelkopf

Hailey Schoelkopf³

Ilia Kulikov

Ilia Kulikov⁴

Zaid Harchaoui

Zaid Harchaoui⁵

¹Carnegie Mellon University ²University of Southern California ³Work done while at EleutherAI ⁴Meta AI ⁵University of Washington

Tuesday December 10, 1:30-4:00pm @ East Exhibition Hall C, NeurIPS

[Slides] [Code] [TMLR Survey Paper] [NeurIPS.cc Page]

About this tutorial

One of the most striking findings in modern research on large language models (LLMs) is that, given a model and dataset of sufficient scale, scaling up compute at training time leads to better final results. However, there is also another lesser-mentioned scaling phenomenon, where adopting more sophisticated methods and/or scaling compute at inference time can result in significantly better output from LLMs. We will present a tutorial on past and present classes of generation algorithms for generating text from autoregressive LLMs, ranging from greedy decoding to sophisticated meta-generation algorithms used to power compound AI systems. We place a special emphasis on techniques for making these algorithms efficient, both in terms of token costs and generation speed. Our tutorial unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems. In turn, we aim to make attendees aware of (meta-)generation algorithms as a promising direction for improving quality, increasing diversity, and enabling resource-constrained research on LLMs.

Schedule

Our tutorial will be held on Tuesday December 10, 1:30pm - 4:00pm (all the times are Vancouver local time).

[ALL SLIDES]

Time	Section	Presenter
1:30pm - 1:40pm	Section 1: Introduction [Slides]	Sean
1:40pm - 2:10pm	Section 2: Generation algorithms [Slides]	Matthew
2:10pm - 2:50pm	Section 3: Meta-generation algorithms [Slides]	Sean
2:50pm - 3:20pm	Section 4: Efficient generation [Slides]	Hailey
3:20pm - 3:25pm	Section 5: Conclusion [Slides]	Sean
3:25pm - 3:55pm	Panel discussion	Ilia

Reading List

Primary Reference

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (Welleck et al., 2024)

Section 1: Introduction

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Section 1 (Welleck et al., 2024)
Scaling Laws for Neural Language Models (Kaplan et al., 2020)
Scaling Instruction-Finetuned Language Models (Chung et al., 2022)
Learning to Reason with LLMs (OpenAI, 2024)
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Competition-Level Code Generation with AlphaCode (Li et al., 2022)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (Brown et al., 2024)
Leveraging training and search for better software engineering agents (Nebius, 2024)
Critique-out-Loud Reward Models (Ankner et al., 2024)
The Shift from Models to Compound AI Systems (Zaharia et al., 2024)

Section 2: Generation Algorithms

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Section 2, 3 (Welleck et al., 2024)
Beam Search Strategies for Neural Machine Translation (Freitag and Al-Onaizan, 2017)
A Thorough Examination of Decoding Methods in the Era of LLMs (Shi et al., 2024)
On NMT Search Errors and Model Errors: Cat Got Your Tongue? (Stahlberg and Byrne, 2019)
Locally Typical Sampling (Meister et al., 2022)
If Beam Search is the Answer, What Was the Question? (Meister et al., 2020)
Truncation Sampling as Language Model Desmoothing (Hewitt et al., 2022)
The Curious Case of Neural Text Degeneration (Holtzman et al., 2020)
Closing the Curious Case of Neural Text Degeneration (Finlayson et al., 2024)
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo et al., 2018)
Neural Text Generation with Unlikelihood Training (Welleck et al., 2019)
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts (Liu et al., 2021)
Contrastive Decoding: Open-ended Text Generation as Optimization (Li et al., 2022)

Section 3: Meta-Generation Algorithms

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Sections 4, 5, 6 (Welleck et al., 2024)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (Brown et al., 2024)
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (Wu et al., 2024)
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters (Snell et al., 2024)
Competition-Level Code Generation with AlphaCode (Li et al., 2022)
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
The Expressive Power of Transformers with Chain of Thought (Merrill et al., 2023)
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective (Feng et al., 2022)
On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning (Nowak et al., 2024)
Measuring and Narrowing the Compositionality Gap in Language Models (Press et al., 2023)
Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP (Khattab et al., 2022)
Large Language Model Programs (Schlag et al., 2023)
Language Model Cascades (Dohan et al., 2022)
The Shift from Models to Compound AI Systems (Zaharia et al., 2024)
System 2 Attention (is something you might need too) (Weston and Sukhbaatar, 2023)
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs (Jiang et al., 2023)
Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)
Learning to Summarize with Human Feedback (Stiennon et al., 2020)
WebGPT: Browser-Assisted Question-Answering with Human Feedback (Nakano et al., 2022)
Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2023)
It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (Bertsch et al., 2023)
Making Large Language Models Better Reasoners with Step-Aware Verifier (Li et al., 2022)
Solving math word problems with process- and outcome-based feedback (Li et al., 2022)
Let's Verify Step by Step (Lightman et al., 2023)
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision (Sun et al., 2024)
Generative Verifiers: Reward Modeling as Next-Token Prediction (Zhang et al., 2024)
Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations (Wang et al., 2024)
Mastering the Game of Go with Deep Neural Networks and Tree Search (Silver et al., 2016)
Mastering the Game of Go Without Human Knowledge (Silver et al., 2017)
Generative Language Modeling for Automated Theorem Proving (Polu et al., 2020)
Formal Mathematics Statement Curriculum Learning (Polu et al., 2023)
HyperTree Proof Search for Neural Theorem Proving (Lample et al., 2022)
Tree Search for Language Model Agents (Koh et al., 2024)
AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement (Aggarwal et al., 2024)
Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023)
Teaching Large Language Models to Self-Debug (Chen et al., 2024)
A Theoretical Understanding of Self-Correction through In-context Alignment (Wang et al., 2024)
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs (Asai et al., 2024)
Large Language Models Cannot Self-Correct Reasoning Yet (Huang et al., 2024)
Generating Sequences by Learning to Self-Correct (Welleck et al., 2023)
Training Language Models to Self-Correct via Reinforcement Learning (Kumar et al., 2024)

Section 4: Efficient Generation

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Section 7 (Welleck et al., 2024)
Making Deep Learning Go Brrrr from First Principles (He, 2022)
How Does Batching Work on Modern GPUs? (Timbers, 2024)
A Visual Guide to Quantization (Grootendorst, 2024)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)
Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu et al., 2022)
Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022)
Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al., 2023
Cold-Compress 1.0: A Hackable Toolkit for KV-Cache Compression (Adams et al., 2024)
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023)
Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019)
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., 2023)
Hydragen: High-Throughput LLM Inference with Shared Prefixes (Juravsky et al., 2024)
Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023)
SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2024)

Section 5: Conclusion

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, Section 8 (Welleck et al., 2024)
AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement (Aggarwal et al., 2024)
Stream of Search (SoS): Learning to Search in Language (Gandhi et al., 2024)
QwQ: Reflect Deeply on the Boundaries of the Unknown (Qwen, 2024)
DeepSeek R1 Lite (Qwen, 2024)
Tree Search for Language Model Agents (Koh et al., 2024)
Agent Refinement example (@gneubig, 2024)
Leveraging training and search for better software engineering agents (Nebius, 2024)
Archon: An Architecture Search Framework for Inference-Time Techniques (Saad-Falcon et al., 2024)

Panel discussion

Join us for an insightful panel discussion featuring a selected group of experts in research related to Large Language Models (LLMs) and meta-generation algorithms. Our panelists are listed below!

Beidi Chen

Beidi Chen¹

Nouha Dziri

Nouha Dziri²

Rishabh Agarwal

Rishabh Agarwal³

Jakob Foerster

Jakob Foerster⁴

Noam Brown

Noam Brown⁵

¹Carnegie Mellon University ²AI2 ³DeepMind, McGill ⁴Meta AI ⁵OpenAI

BibTeX

@article{welleck2024metageneration,
  title={From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models},
  author={Sean Welleck and Amanda Bertsch and Matthew Finlayson and Hailey Schoelkopf and Alex Xie and Graham Neubig and Ilia Kulikov and Zaid Harchaoui},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2024},
  url={https://openreview.net/forum?id=eskQMcIbMS},
  note={Survey Certification}
}