Advanced Natural Language Processing / Spring 2025

Advanced natural language processing is an introductory graduate-level course on natural language processing aimed at students who are interested in doing cutting-edge research in the field. In it, we describe fundamental tasks in natural language processing as well as methods to solve these tasks. The course focuses on modern methods using neural networks, and covers the basic modeling, learning, and inference algorithms required therefore. The class culminates in a project in which students attempt to reimplement and improve upon a research paper in a topic of their choosing.

Course Details

Instructor

Sean Welleck

Teaching Assistants

Darsh Agrawal

Hugo Contant

Alex Fang

Akshita Gupta

Trisha Sarkar

Manan Sharma

Sanidhya Vijayvargiya

Logistics

Class times: TR 3:30pm - 4:50pm
Room: TEP 1403
Course identifier: LTI 11-711

Office hours:

	Location	Day	Time
Akshita Gupta	GHC 5417	Friday	12:30pm - 1:30pm
Alex Fang	GHC 5417	Monday	10:00am - 11:00am
Darsh Agrawal	GHC 5417	Friday	5:00pm - 6:00pm
Hugo Contant	GHC 5115	Tuesday	8:00am - 9:00am
Manan Sharma	GHC 5417	Tuesday	1:30pm - 2:30pm
Sanidhya Vijayvargiya	GHC 5417	Monday	2:00pm - 3:00pm
Trisha Sarkar	Wean Hall 3002	Tuesday	5:00pm - 6:00pm
Sean Welleck	GHC 6513	Wednesday	5:00pm - 6:00pm

Grading

The assignments will be given a grade of A+ (100), A (96), A- (92), B+ (88), B (85), B- (82), or below.
The final grades will be determined based on the weighted average of the quizzes, assignments, and project. Cutoffs for final grades will be approximately 97+ A+, 93+ A, 90+ A-, 87+ B+, 83+ B, 80+ B-, etc., although we reserve some flexibility to change these thresholds slightly.
Quizzes: Worth 20% of the grade. Your lowest 3 quiz grades will be dropped.
Assignments: There will be 4 assignments (the final one being the project), worth respectively 15%, 15%, 20%, 30% of the grade.

Course description

The course covers key algorithmic foundations and applications of advanced natural language processing.

There are no hard pre-requisites for the course, but programming experience in Python and knowledge of probability and linear algebra are expected. It will be helpful if you have used neural networks previously.

Acknowledgements. This semester's course is adapted from Advanced NLP Fall 2024, designed and taught by Graham Neubig. The course structure (e.g., grading, course description, class format, assignments, poster presentation) is from Advanced NLP Fall 2024. Many lectures are adapted from Advanced NLP Fall 2024; please refer to individual slides.

Class format

Lectures: For each class there will be:

Reading: Most classes will have associated reading material that we recommend you read before the class to familiarize yourself with the topic.
Lecture and Discussion: There will be a lecture and discussion regarding the class material. This will be recorded and posted online for those who cannot make the in-person class.
Code/Data Walkthrough: Some classes will involve looking through code or data.
Quiz: There will be a quiz covering the reading material and/or lecture material that you can fill out on Canvas. The quiz will be released by the end of the day of the class and will be due at the end of the following day.

Recitation Sections: additional recitation sections offer hands-on introductions to practical tools. These take place during the office hours of the TA leading the recitation. For those unable to attend in person, recordings and a hybrid meeting link are provided.

Questions and Discussion: Ideally in class or through piazza so we can share information with the class, but emailing the TA mailing list and coming to office hours are also encouraged.

Schedule

Class

Type

Topic

Resources
# 1 01/14/2025

Lecture

Intro and Basics
Introduction
[slides]
[code]
Main readings:
- Natural Language Understanding with Distributed Representation (Ch. 1) (Cho 2015)
# 2 01/16/2025

Lecture

Intro and Basics
Word representation and text classification
[slides]
[code]
Main readings:
- Natural Language Understanding with Distributed Representation (Ch. 2, Ch. 3) (Cho 2015)
Additional references:
- (Video) Let's build the GPT Tokenizer (Karpathy 2024)
# 3 01/21/2025

Lecture

Intro and Basics
Sequence Modeling I
Language Modeling Fundamentals
[slides]
[code]
Main readings:
- Natural Language Understanding with Distributed Representation (Ch. 5 up to 5.4.2) (Cho 2015)
Additional references:
- A Neural Probabilistic Language Model (Bengio et al 2003)
- Understanding the difficulty of training deep feedforward neural networks (Glorot & Bengio 2010)
- Adam: A Method for Stochastic Optimization (Kingma & Ba 2015)
# 4 01/23/2025

Lecture

Intro and Basics
Sequence Modeling II
Recurrent neural networks
[slides]
[code]
Main readings:
- Natural Language Understanding with Distributed Representation (Ch. 4, Ch. 5.5-5.6, Ch. 6) (Cho 2015)
Additional references:
- Recurrent neural network based language model (Mikolov et al 2010)
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al 2014)
- Why LSTMs Stop Your Gradients From Vanishing: A View from the Backwards Pass (Weber 2017)
- Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al 2015)
# 5 01/28/2025

Lecture

Building Blocks
Sequence Modeling III
Attention and Transformers
[slides]
[code]
Main readings:
- Attention Is All You Need (Vaswani et al 2017)
- The Annotated Transformer (Rush et al 2018)
Additional references:
- Root Mean Square Layer Normalization (Zhang & Sennrich 2019)
- On Layer Normalization in the Transformer Architecture (Xiong et al 2020)
- RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al 2021)
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al 2023)
- (Helpful Blog Post): Why Are Sines and Cosines Used For Positional Encoding? (Muhammad 2023)
# 5 01/28/2025

Assignments
Assignment 1 Out
# 6 01/30/2025

Lecture

Building Blocks
Learning I
Pretraining
[slides]
[code]
Main readings:
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al 2018)
- Language Models are Unsupervised Multitask Learners (Radford et al 2019)
Additional references:
- LLaMA: Open and Efficient Foundation Language Models (Touvron et al 2023)
- The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (Penedo et al 2024)
- OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text (Paster et al 2023)
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (Soldaini et al 2024)
- Scaling Laws for Neural Language Models (Kaplan et al 2020)
- Training Compute-Optimal Large Language Models (Hoffmann et al 2022)
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (Deepseek AI 2024)
# 7 02/04/2025

Lecture
Guest Lecture:
Amanda Bertsch

Building Blocks
Inference I
Decoding and Generation Algorithms
[slides]
Additional references:
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (Sections 1-3) (Welleck et al 2024)
# 7 02/05/2025

Recitation

Building Blocks
Annotated Transformer
# 8 02/06/2025

Lecture

Building Blocks
Inference II
Prompting
[slides]
[code]
Additional references:
- Prompting Survey (Liu et al 2021)
- Language Models are Few-Shot Learners (Brown et al 2020)
- Many-Shot In-Context Learning (Agarwal et al 2024)
- Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design (Sclar et al 2023)
- Large Language Models as Optimizers (Yang et al 2023)
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al 2022)
- DSPy (Khattab et al 2023)
# 9 02/11/2025

Lecture

Building Blocks
Learning II
Fine-Tuning
[slides]
[code]
Additional references:
- Universal Language Model Fine-tuning for Text Classification (Howard & Ruder 2018)
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions (Mishra et al 2021)
- Finetuned Language Models Are Zero-Shot Learners (Wei et al 2021)
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks (Wang et al 2022)
- Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al 2023)
- Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (Mukherjee et al 2023)
- Distillation Tutorial (Agarwal 2025)
- Distilling the Knowledge in a Neural Network (Hinton et al 2015)
- Sequence-Level Knowledge Distillation (Kim & Rush 2016)
- Symbolic Knowledge Distillation (West et al 2022)
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al 2021)
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al 2023)
# 9
Office Hours (See Piazza)

Recitation

Building Blocks
HuggingFace Transformers
# 10 02/13/2025

Lecture
Guest Lecture:
Akari Asai

Building Blocks
Modeling IV
Retrieval and RAG
# 10
Office Hours (See Piazza)

Recitation

Building Blocks
LiteLLM and LLM APIs
# 10 02/13/2025

Assignments
Assignment 2 Out
# 10 02/14/2025

Assignments
Assignment 1 Due
# 10
Office Hours (See Piazza)

Recitation

Building Blocks
LangChain/LlamaIndex
# 11 02/18/2025

Lecture

Building Blocks
Learning III
Reinforcement Learning
[slides]
Additional references:
- Deep Reinforcement Learning: Pong from Pixels (Karpathy 2016)
- Spinning Up in Deep RL (Part 1, Part 3, Vanilla PG, PPO) (OpenAI)
- Proximal Policy Optimization Algorithms (Schulman et al 2017)
- Deep reinforcement learning from human preferences (Christiano et al 2017)
- Fine-Tuning Language Models from Human Preferences (Ziegler et al 2019)
- Learning to summarize from human feedback (Stiennon et al 2020)
- Training language models to follow instructions with human feedback (Ouyang et al 2022)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI 2025)
# 12 02/20/2025

Lecture
Guest Lecture:
Seungone Kim

Building Blocks
Evaluation
Evaluating Language Generators
[slides]
# 12 02/20/2025

Assignments
Assignment 3, 4 Out
# 13 02/25/2025

Lecture

Building Blocks
Research Skills
Experimental Design
[slides]
Additional references:
- With Little Power Comes Great Responsibility (Card et al 2020)
# 14 02/27/2025

Lecture

Advanced Topics
Agents
Agents
[slides]
[code]
Additional references:
- World of Bits: An Open-Domain Platform for Web-Based Agents (Shi et al 2017)
- WebGPT: Browser-assisted question-answering with human feedback (Nakano et al 2022)
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents (Yao et al 2022)
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al 2024)
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Koh et al 2024)
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al 2024)
- Programming with Pixels: Computer-Use Meets Software Engineering (Aggarwal & Welleck 2025)
# 15 03/04/2025

Break

No Class
Spring Break
# 16 03/06/2025

Break

No Class
Spring Break
# 17 03/11/2025

Lecture
Guest Lecture:
Tim Dettmers

Advanced Topics
Efficiency
Quantization
[slides]
Additional references:
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al 2022)
- 8-bit Optimizers via Block-wise Quantization (Dettmers et al 2021)
- The case for 4-bit precision: k-bit Inference Scaling Laws (Dettmers & Zettlemoyer 2022)
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al 2023)
# 18 03/13/2025

Lecture

Advanced Topics
Learning III
Advanced Pretraining: Parallelism and Scaling
[slides]
Main readings:
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters (Tazi et al 2025)
# 18 03/14/2025

Assignments
Assignment 2 Due
# 19 03/18/2025

Lecture

Course Project
Project Discussion
# 20 03/20/2025

Lecture

Advanced Topics
Modeling V
Long Sequence Models
[slides]
Additional references:
- Self-attention Does Not Need O(n2) Memory (Rabe & Staats 2021)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al 2022)
- Ring Attention with Blockwise Transformers for Near-Infinite Context (Liu et al 2023)
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters (Tazi et al 2025)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao 2023)
# 20
Office Hours (See Piazza)

Recitation

Advanced Topics
OpenRLHF
# 21 03/25/2025

Lecture

Advanced Topics
Inference III
Advanced Inference Strategies
[slides]
Main readings:
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (Section 4, 5, 6) (Welleck et al 2024)
Additional references:
- NeurIPS 2024 LLM Inference Tutorial (Reading List)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI 2025)
- s1: Simple test-time scaling (Muennighoff et al 2025)
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (Aggarwal & Welleck 2025)
# 22 03/27/2025

Lecture

Advanced Topics
Inference IV
Efficient Inference
[slides]
Main readings:
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (Section 7) (Welleck et al 2024)
Additional references:
- NeurIPS 2024 LLM Inference Tutorial (Reading List)
# 22
Office Hours (See Piazza)

Recitation

Advanced Topics
vLLM / SGLang
# 23 04/01/2025

Lecture

Advanced Topics
Learning IV
Advanced Post Training
[slides]
Additional references:
# 24 04/03/2025

Break

No Class
Spring Carnival
# 25 04/07/2025

Assignments
Assignment 3 Due
# 25 04/08/2025

Project Discussion

Project Discussion
Project Discussion
# 26 04/10/2025

Lecture

Applications and Society
Multimodal models I
[slides]
Additional references:
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al 2020)
- Learning Transferable Visual Models From Natural Language Supervision (Radford et al 2021)
- Visual Instruction Tuning (Liu et al 2023)
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models (Deitke et al 2024)
# 27 04/15/2025

Lecture

Applications and Society
AI for Mathematics
[slides]
Additional references:
- Lean-STaR: Learning to Interleave Thinking and Proving (Lin et al 2024)
- Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs (Jiang et al 2022)
- miniCTX: Neural Theorem Proving with (Long-)Contexts (Hu et al 2024)
- LLMLean ()
# 28 04/17/2025

Lecture

Applications and Society
Multimodal models II
[slides]
Main readings:
- Neural Discrete Representation Learning (van den Oord et al 2017)
- Taming Transformers for High-Resolution Image Synthesis (Esser et al 2021)
Additional references:
- Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta 2024)
# 29 04/22/2025

Lecture

Course Project
Posters
# 30 04/24/2025

Lecture

Course Project
Posters
# 30 04/27/2025

Assignments
Assignment 4 Due

Assignments

The aim of the assignment and project is to build basic understanding and advanced implementation skills needed to build cutting-edge systems or do cutting-edge research using neural networks for NLP, culminating with a project that demonstrates these abilities through a project.

Read all the instructions on this page carefully
You are responsible for reading these instructions and following them carefully. If you do not, you may be marked down as a result.

Assignment Policies

Working in Teams:

There are 4 assignments in the class. Assignment 1 must be done individually, while Assignments 2, 3, and 4 must be done in teams of 2-3 (individual submissions will not be accepted for these assignments). If you are having trouble finding a group, the instructor and TAs will help you find one after the first initial survey.

Submission Information:

To submit your assignment you must submit via Canvas a zip file containing:

your code: This should be in a directory “code” in the top directory unless specified otherwise.
system outputs (assignments 1 and 2): The format will be specified separately for each assignment.
a report (assignments 2, 3 and 4, optional for assignment 1): This should be named “report.pdf” in the top directory. This is for assignments 2, 3 and 4, and can be up to 7 pages for assignments 2 and 3 and 9 pages for assignment 4. References are not included in the page count, and it is OK to submit appendices that include supplementary information such as hyperparameter settings or additional output examples, although there is no guarantee that the TAs will read them. Submissions that exceed the page count will be penalized one third grade for each page over (e.g., A to A- or A- to B+). You may also submit report.pdf for assignment 1 if you have any interesting information to convey to the TAs, for example, if you did anything interesting above and beyond the minimal requirements.
a link to a GitHub repository containing your code (assignments 2, 3 and 4): This should be a single line file “github.txt” in the top directory. Your GitHub repository must be viewable to the TAs in charge of the assignment by the submission deadline. If your repository is private, make it accessible to the TAs by the submission deadline. If your repository is not visible to the TAs, your assignment will not be considered complete, so if you are worried, please submit well in advance of the deadline so we can confirm the submission is visible. We use this repository to check contributions of all team members.

Late Day Policy:

In case there are unforeseen circumstances that don’t let you turn in your assignment on time, 5 late days total for assignments 2 and 3 will be allowed. Note that other than these late days, we will not be making exceptions and extending deadlines except for health reasons, so please try to be frugal with your late days and use them only if necessary. Assignments that are late beyond the allowed late days will be graded down one third-grade per day late (e.g., A to A- for one day, and A to B+ for two days).

Plagiarism/Code Reuse Policy:

All assignments are expected to be conducted under the CMU policy for academic integrity. All rules here apply and violations will be subject to penalty including zero credit on the assignment, failing the course, or other disciplinary measures. In particular, in your implementation:

Code or pseudo-code provided by the TAs or instructor may be used freely without restriction.
For assignment 2, you may not just re-use an existing implementation written by someone else. The implementation should basically be your own.
Code written by other students in the class cannot be used (except, obviously, you can share code within your group for assignments 2, 3, and 4).
If you are doing a similar project for a graded class at CMU (including independent studies or directed research), you must declare so on your report, and note which parts of the project are for 11-711, and which parts are for the other class. Consult with the TA mailing list if you are unsure.

Consulting w/ Instructors/TAs:

For assignments and projects, you are free to consult as much as you want, any time you want with the instructors and TAs. That is what we’re here for, and in no way is this considered cheating. In fact, if you don’t have much experience with NLP previously, it will be helpful to liberally consult with the instructors and TAs to learn about how to do the implementation and finish the assignments. So please do so.

Because this is a project-based course, we assume that many of the students taking the course will be interested in turning their assignments or project into research papers. In this case, if you have received useful advice from the instructor or TAs that made the project significantly better, consider inviting them to be co-authors on the paper. Of course, you do not need to do so just because the paper is a result of the class, only if you feel that their advice or help made a contribution.

Details of Each Assignment

Assignment 1: Release Date: 01/28 11:59pm, Due Date: 02/14 11:59pm
Assignment 2: Release Date: 02/13 11:59pm, Due Date: 03/14 11:59pm
Assignment 3: Release Date: 02/20 11:59pm, Due Date: 04/07 11:59pm

Group assignment.

You will perform a literature survey on a topic of interest, and propose a project topic based on this literature survey.
You will reproduce the evaluation numbers of a competitive baseline model for a task related to this project topic (not necessarily the same). In other words, you must get the same numbers as the previous paper on the same dataset.

The grading rubric for the project proposal component is as follows:

A+: Exceptional or surprising. Goes far beyond most other submissions.
A: A survey that covers all the major relevant papers in the field and a well-grounded project proposal based on this survey.
A-: The survey has a good analysis but is missing a few pieces of relevant related work, or is quite complete but is lacking in critical analysis or forward directions.
B+: The survey is either quite lacking in coverage or analysis, or is decent but not complete in both aspects.
B or B-: The survey is lacking in both coverage and analysis, but does make an attempt to cover some related research.
C+ or below: Clear lack of effort or incompleteness.

The grading rubric for the reproduction component is as follows:

A+: Exceptional or surprising. Goes far beyond most other submissions.
A: Numbers that meet or exceed the previously reported results. A comprehensive analysis of the results, and forward-looking plans for further development.
A-: A complete re-implementation with competitive result numbers, but less analysis or forward-looking plans for development than assignments rewarded an A.
B+: An implementation and evaluation numbers exist, but they do not match previous work in the field. Or the analysis or forward-looking plans may be seriously lacking.
B or B-: Two or more of the above three elements are lacking.
C+ or below: Clear lack of effort or incompleteness.

Assignment 4: Release Date: 02/20 11:59pm, Due Date: 04/27 11:59pm

Group assignment.

A+: Exceptional or surprising. Goes far beyond most other submissions.
A: A respectable research contribution that is novel and effective, and could be submitted largely as-is as a paper to an academic conference.
A-: A respectable research contribution that has some small incomplete parts, but is largely complete and promising.
B+: An idea that is novel, but the results may not be there yet, or the analysis is short.
B or B-: Results, analysis, or novelty are lacking.
C+ or below: Clear lack of effort or incompleteness.

Negative Results:

Poster Presentation

Time/Location

Time: 3:30PM-5:00PM, 22nd April, 2025 and 24th April, 2025
Location: Hallway below LTI (GHC4400)

We will announce which teams will be presenting on which days during the course. If you have a major, immovable conflict that will prevent your team from presenting on one day please contact us via piazza and we will try to make accommodations.

Goals and Grading

The intention of the poster is several-fold:

That you share your preliminary results with the TAs and instructor so we can give feedback to make any last adjustments to improve your final project report.
That you can see the other projects in the class to learn from them and get any ideas that may improve your final project report.
That you can practice explaining the work that you did.

The poster is graded for attendance, so you need to show up with a poster or will be graded down on the project. However, it is basically pass/fail, so basically if you show up with a poster you will not be graded down. That being said, putting the important information on the poster will help you get better feedback.

What information should be included in a poster? It should be mostly:

What is the problem you’re solving
What is your method for solving that problem
What are the results

There is not a set format for creating a poster, but if you would like some guidance, I would suggest creating three columns, where the left one describes “1”, the middle one describes “2”, and the right one describes “3”. The middle one can be a bit wider.

Poster Printing

If you are a member of SCS, we suggest that you use SCS poster printing. If you are not a member of SCS, you can send your PDF to the TAs no less than 5 days before your presentation, and we will print it for you.