> For the complete documentation index, see [llms.txt](https://www.daiwk.net/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://www.daiwk.net/6.llm_math.md).

# 6.llm\_math

## LLM+math

### mathscale

[【LLM-数学】MathScale 用于数学推理的指令调优扩展方法](https://mp.weixin.qq.com/s/tQUIGdViMZTb_9NNh3b3RQ)

[MathScale: Scaling Instruction Tuning for Mathematical Reasoning](https://arxiv.org/pdf/2403.02884.pdf)

### AlphaGeometry

[奥数能力金牌级：DeepMind几何推理模型登上Nature，代码开源，菲尔兹奖得主点赞](https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==\&mid=2650904746\&idx=1\&sn=d39a3d92078cecbd29bd0fc82560d1da\&chksm=84e45cd4b393d5c24747f163fa0338761690447904a1a654aacf2344d7a16d36dfa2ac3ccdb0\&scene=21#wechat_redirect)提出了AlphaGeometry

### AlphaProof & AlphaGeometry 2

[谷歌AI拿下IMO奥数银牌，数学推理模型AlphaProof面世，强化学习 is so back](https://mp.weixin.qq.com/s/LNzbyf0w412BIz71sROyzw)提出AlphaProof和AlphaGeometry 2

### WE-Math基准

[真相了！大模型解数学题和人类真不一样：死记硬背、知识欠缺明显，GPT-4o表现最佳](https://mp.weixin.qq.com/s/uU1lZV0Ymj31cmZryhffyQ)

[WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?](https://arxiv.org/pdf/2407.01284)

<https://github.com/We-Math/We-Math>

<https://huggingface.co/datasets/We-Math/We-Math>

### case-based or rule-based

[ICML 2024｜Transformer究竟如何推理？基于样例还是基于规则](https://mp.weixin.qq.com/s/aVRiGW3xU_LpvxZzjDpwzQ)

[Case-Based or Rule-Based: How Do Transformers Do the Math?](https://arxiv.org/pdf/2402.17709)

<https://github.com/GraphPKU/Case_or_Rule>

## AlphaProof(nature)

[Olympiad-level formal mathematical reasoning with reinforcement learning](https://www.nature.com/articles/s41586-025-09833-y)

[代码](https://github.com/daiwk/collections/blob/master/assets/alphaproof_pseudocode.py)，从<https://www.nature.com/articles/s41586-025-09833-y>这里的Supplementary Data下载

![](/files/DD4TM8aFkI7f31Q2WxEN)

* 状态：Lean prover的逻辑状态，即Lean tactic state，表示已经建立的假设，和剩下的目标
* 环境：给定一个状态，和agent采取的动作，输出下一个状态
* 动作：一个证明的中间步骤，是一段文本，例如“a=2,b=2a,==>b=4”
* episode结束条件：证明完成or预算花完（例如超时）
* reward：每个tactic的reward都是-1，因为要鼓励用更少的步骤完成证明，即到达终点的分支路径最短

prover agent是把dnn和alphazero的搜索算法相结合：

* proof network：是一个3B的encoder-decoder transformer，输入Lean tactic state，输出2个东西：
  * policy：下一步要尝试的N个动作
  * value：当前**状态**的价值，即从此刻到episode结束时的每步reward之和
* tree search：和alphazero类似，使用AND-OR tree structure将证明拆成多个独立的子目标(类似[Hypertree proof search for neural theorem proving](https://arxiv.org/pdf/2205.11491))，采样使用的progressive sampling

![](/files/u6O3XOpUl0YpppagMkPp)

* 预训练：proof network在大概300 billion tokens的代码和数学语料通过next token prediction预训练
* SFT：在300k的人类用matlab标注的证明语料(state-tactic pairs)上sft，
* main RL：
  * 用gemini的LLM搞了一个自动形式化（formalization）的系统，将大概1M的非形式化的问题形式化成了80M的问题
  * proof network+tree search和Lean环境交互，生成形式化的证明和反证（disproof），然后用Alphazero的方式基于得到的经验去RL训练
* inference：
  * 首先参考Alphazero，增加tree search的预算，例如产生更多的搜索路径
  * 如果增加了搜索路径还不够，那就用Test-time RL（TTRL）：
    * 给定一个问题，variant generator生成其变种（例如简化或者泛化）
    * 基于这些变种去进行RL训练


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.daiwk.net/6.llm_math.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
