miniCTX: Neural Theorem Proving with (Long-)Contexts

A context-rich benchmark for evaluating neural theorem proving in realistic scenarios.

Jiewen Hu, Thomas Zhu, Sean Welleck

Carnegie Mellon University

Paper arXiv Benchmark Model NTP Toolkit Evaluation Code Leaderboard

Traditional LLM-based provers use a static snapshot of existing theorems and proofs (e.g., Mathlib), while context-dependent proving requires models to incorporate new context, such as new definitions and lemmas, enabling proof generation under realistic conditions.

Comparison of miniCTX with existing theorem proving benchmarks. miniCTX stands out by providing premises, full context, multi-source benchmark, and temporal splits, enabling evaluation of a model's ability to work with context that evolves over time.

Abstract

Real-world formal theorem proving often depends on a wealth of context, including definitions, lemmas, comments, file structure, and other information. We introduce miniCTX, which tests a model's ability to prove formal mathematical theorems that depend on new context that is not seen during training.

miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem's repository, which contains context that is needed for the proof. As a baseline for miniCTX, we tested fine-tuning and prompting methods that condition theorem proving on preceding context. Both approaches substantially outperform traditional methods that rely solely on state information.

We found that this ability to use context is not captured by previous benchmarks such as miniF2F. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic evaluation of neural theorem provers.

News: miniCTX-v2 Update

We release miniCTX-v2, an updated version of miniCTX with new theorems extracted after a more recent cutoff date of November 28, 2024. The dataset is available here.

Key Features of miniCTX

Real-world theorem proving

Unlike popular benchmarks that focus on isolated competition problems, miniCTX includes complex theorems from a variety of ongoing Lean projects, like the Prime Number Theorem and the Polynomial Freiman-Ruzsa Conjecture. This diversity tests a model's ability to generalize in real-world formalization projects.

Contextual evaluation

miniCTX includes theorems along with new context, such as definitions, lemmas, and comments, which may make some theorems easier to prove. This contextual information helps evaluate whether models can leverage real-world context beyond standalone proofs.

Automatically updating the benchmark

miniCTX ensures that evaluation data is kept fresh by automatically updating the benchmark with new Lean projects using ntp-toolkit. This prevents data contamination from evaluating models and maintains a realistic evaluation setting.

Benchmark Structure

Each theorem in miniCTX consists of three main components: the theorem statement, preceding file contents, and metadata in JSON format. These elements provide the complete context required for proving the theorem.

Theorem statement: The core assertion that needs to be proven.
Preceding file contents: All code preceding the theorem statement, which provides in-file context.
Metadata: Includes information such as:
- File name.
- Project commit and version.
- Commit and time at which the theorem and its file was added.
- Position of the theorem in the file and number of premises preceding it.
- Number of in-file premises and cross-file premises used by the statement or proof.
- Imported modules (along with separate premises file for each module).
- Proof length and type.

Users can reconstruct the complete context for each theorem using the metadata, which includes both in-file and cross-file context, enabling realistic and comprehensive evaluation.

Converted JSON Representation — Converted JSON representation

Theorem representation in Lean (left) and its corresponding JSON metadata (right). This metadata helps reconstruct in-file and cross-file context for proof generation in miniCTX.

Experiment and Results

Our experiments evaluate the effectiveness of various baseline models on the miniCTX benchmark. We find that models utilizing richer context information significantly outperform those relying solely on proof states. Specifically, the file-tuned model, which incorporates full file context, achieves an average success rate of 35.94%, compared to 19.53% for state-tactic models trained without context. Adding relevant premises also improves performance in scenarios requiring high cross-file dependency, as seen with GPT-4o on datasets like PFR and SciLean. These results highlight the value of providing comprehensive context, demonstrating that miniCTX effectively captures the importance of both in-file and cross-file context in formal theorem proving.

Baseline Evaluation Results on miniCTX Benchmark

Performance of various baseline models on the miniCTX benchmark, highlighting the effectiveness of context-dependent proving methods.

Performance by In-file and Cross-file Dependencies

Performance of baseline models on theorems categorized by in-file and cross-file dependencies in miniCTX. File-tuning provides significant improvements for problems with in-file dependencies, indicating its ability to effectively use new definitions and context. Premise selection helps with cross-file dependencies, particularly for models like GPT-4o, but may interfere with in-file context in some cases, suggesting a need for more nuanced integration strategies.

NTP Toolkit

The NTP Toolkit is an automated tool for extracting training and evaluation data from Lean repositories with minimal manual effort. It transforms Lean 4 repositories into next-tactic or full proof examples, enabling seamless fine-tuning and evaluation of language models. All problems and training data for miniCTX are automatically extracted using the toolkit. The toolkit also generates instruction-tuning data with natural language instructions to facilitate task learning. To learn more, visit the GitHub repository.

BibTeX

@article{hu2024minictx,
      title={miniCTX: Neural Theorem Proving with (Long-) Contexts},
      author={Hu, Jiewen and Zhu, Thomas and Welleck, Sean},
      journal={arXiv preprint arXiv:2408.03350},
      year={2024}
    }