# 1.9.llm\_others

## 一些工具

论文翻译网站：<https://hjfy.top/>

## 泄露的系统提示词

<https://github.com/asgeirtj/system_prompts_leaks/blob/main/OpenAI/gpt-5-thinking.md>

## 语言物理学

(toread)

[大模型边推理边纠错，有可能做到吗？这是ICML爆火的演讲](https://mp.weixin.qq.com/s/NOVFYmXiHUJ7x1SU7yH0CA)

<https://www.bilibili.com/video/BV1Yw4m1k7nH>

## 蒸馏

### ABKD

[ICML Spotlight 2025丨追求概率质量的帕累托最优：基于广义α-β散度引导的知识蒸馏框架ABKD](https://mp.weixin.qq.com/s/UwRwDJJxWrS-9mVoHSUPDQ)

[ABKD: Pursuing a Proper Allocation of the Probability Massin Knowledge Distillation via α-β-Divergence](https://arxiv.org/pdf/2505.04560)

<https://github.com/ghwang-s/abkd>

现有问题：

* 前向KL：概率分配过于“佛系”，学生“雨露均沾”，难专注目标类
* 反向KL：概率分配过于“内卷”，学生“死磕”高置信度类，忽略教师全局信息

ABKD引入α-β散度，统一前向/反向KL，并推广到此前未探索的海灵格距离和β-散度等。

## LLM常见难题

### LLM as a judge

[A Survey on LLM-as-a-Judge](https://arxiv.org/pdf/2411.15594)

[Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge](https://arxiv.org/pdf/2501.18099)，meta的2025年1月的

### 重复生成

<https://www.zhihu.com/question/616130636>

<https://mp.weixin.qq.com/s/cSwWapqFhxu9zafzPUeVEw>

[Interpreting the Repeated Token Phenomenon in Large Language Models](https://arxiv.org/pdf/2503.08908)

deepmind的文章，发现和attention sink（初始token会有很高的attn score）有关，初始注意力层负责标记序列中的第一个单词，而后期的一些特定神经元则会放大这些标记单词的隐藏状态值。当处理重复单词时，这一机制会失效，导致模型行为异常。

<https://github.com/yossigandelsman/attn_sinkhole>

### 幻觉

#### 综述

[OpenAI Lilian Weng万字长文解读LLM幻觉：从理解到克服](https://mp.weixin.qq.com/s/UGcui0rLW2Vz7y2Mt4atqA)

<https://lilianweng.github.io/posts/2024-07-07-hallucination/>

#### 语义熵

[语义熵识破LLM幻觉！牛津大学新研究登Nature](https://mp.weixin.qq.com/s/fdLZ9DDqG9C_uxAAlKgQbw)

[Detecting hallucinations in large language models using semantic entropy](https://www.nature.com/articles/s41586-024-07421-0)

### Zilliz

[向量数据库的中场战事：长期主义者Zilliz如何全球突围](https://mp.weixin.qq.com/s/lRryjRiUGKdT11qfi62pUg)

### 记忆能力

[Localizing Paragraph Memorization in Language Models](https://arxiv.org/pdf/2403.19851v1.pdf)

对应代码：<https://github.com/googleinterns/localizing-paragraph-memorization>

我们能否定位出语言模型中用于记忆其训练数据中整段文字的权重和机制？

* 尽管记忆现象分布在模型的多个层级和组件中，但记忆段落的梯度在空间上有可辨别的模式，即**在较低模型层级的梯度比非记忆example的梯度大**。
* 通过**仅微调高梯度的权重**，可以使模型**遗忘记忆的example**。
* 定位了一个特别参与段落记忆的**低层注意力头**，它**主要关注**在语料库级单词频率分布中**最不频繁出现的独特、罕见的token**。
* 总的来说，相较非记忆的续写，记忆续写不仅**更难以遗忘**，也**更难以损坏**。

#### reasoning

[Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks](https://arxiv.org/pdf/2307.02477) MIT的

[Do Large Language Models Latently Perform Multi-Hop Reasoning?](https://arxiv.org/pdf/2402.16837) deepmind的

[How do Language Models Bind Entities in Context?](https://arxiv.org/pdf/2310.17191) UC berkeley的，ICLR2024

#### memorizing

[Knowledge Neurons in Pretrained Transformers](https://arxiv.org/pdf/2104.08696) ACL 2022

[Language Modeling Is Compression](https://arxiv.org/pdf/2309.10668) ICLR 2024 deepmind

[Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models](https://arxiv.org/pdf/2205.10770) meta NeurIPS 2022

### 越狱

[长文本之罪：Claude团队新越狱技术，Llama 2到GPT-4无一幸免](https://mp.weixin.qq.com/s/C0opoIzLCFojfmoa6poM8A)

### LLM compiler

[开发者狂喜！Meta最新发布的LLM Compiler，实现77%自动调优效率](https://mp.weixin.qq.com/s/Js0lUS_5ZPspVLazthkEOg)

[Meta Large Language Model Compiler: Foundation Models of Compiler Optimization](https://ai.meta.com/research/publications/meta-large-language-model-compiler-foundation-models-of-compiler-optimization/)

### ProLong

[2024 年了，你的长文本训练数据真的够长吗？](https://mp.weixin.qq.com/s/5dVm-VWiZG09ixMMegKCbw)

[Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models](https://arxiv.org/pdf/2405.17915)

<https://github.com/October2001/ProLong>

### 白化

在 transformer 领域里，“白化”（whitening）主要是指一种对句子嵌入进行后处理的方法，通过将句子向量的均值变为0，并将协方差矩阵变为单位矩阵，从而解决句子嵌入中的各向异性问题。这种技术能够提高句子嵌入在语义相似性任务中的表现，并且加快检索速度。

[Whitening Sentence Representations for Better Semantics and Faster Retrieval](https://ar5iv.labs.arxiv.org/html/2103.15316)

代码：<https://github.com/bojone/BERT-whitening>

[Transformer Scale Gate for Semantic Segmentation](https://arxiv.org/pdf/2205.07056v1)

### 蒸馏

[Revisiting Knowledge Distillation for Autoregressive Language Models](https://arxiv.org/pdf/2402.11890)

[Meta开发System 2蒸馏技术，Llama 2对话模型任务准确率接近100%](https://mp.weixin.qq.com/s/QycbrMXsR0nUsvHBx0_GBw)

[Distilling System 2 into System 1](https://arxiv.org/pdf/2407.06023v2)

### 证明者-验证者博弈

[OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了](https://mp.weixin.qq.com/s/MiLYbYcYUPO9rdQjijF_tQ)

[Prover-Verifier Games improve legibility of LLM outputs](https://arxiv.org/pdf/2407.13692)

参考：[Learning to Give Checkable Answers with Prover-Verifier Games](https://arxiv.org/pdf/2108.12099)

### 道德风险

[GPT-4o模仿人类声音，诡异尖叫引OpenAI研究员恐慌！32页技术报告出炉](https://mp.weixin.qq.com/s/XSTNHTILAOkINg7mxssb6g)

openai的报告：[GPT-4o System Card](https://cdn.openai.com/gpt-4o-system-card.pdf)

之前deepmind也有一个报告[The Ethics of Advanced AI Assistants](https://arxiv.org/pdf/2404.16244)

### 选择性偏差

[ACL2024|大模型选择偏差在腾讯广告特征评测上的优化及应用](https://mp.weixin.qq.com/s/0P1D1H1HoXMwZg2nBiM07Q)

[Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors](https://arxiv.org/pdf/2406.01026)

给定一个问题(question)及其对应的选项内容(options)，大模型无法把选项内容(option content)和对应的选项标识符(symbol，特指选项标识A/B/C/D)关联到一起。例如，当把正确答案"the president"放到选项B时，模型能够正确选择出答案；当我们把正确答案放到C时，模型依然选择"B"，即模型偏向于选"B"或者第二个答案，而忽略了正确答案的内容。

### lost in the middle

[Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/pdf/2307.03172)

### reasoning boundary

[NeurIPS 2024 (Oral) | 如何量化与提升思维链的推理能力边界？](https://mp.weixin.qq.com/s/BwuGacSHKY4RTdvYNMa66Q)

[Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought](https://arxiv.org/abs/2410.05695)

<https://github.com/LightChen233/reasoning-boundary>

### 语言≠思维

[语言≠思维，大模型学不了推理：一篇Nature让AI社区炸锅了](https://mp.weixin.qq.com/s/BgMNITn5e1RGUOHQLKv7yg)

<https://www.nature.com/articles/s41586-024-07522-w>

## 一些其他比较重要的工作

### 几篇出现频率比较高的论文

[Scaling instruction-finetuned language models](https://arxiv.org/pdf/2210.11416.pdf) 引用数800+

[How can we know what language models know?](https://arxiv.org/pdf/1911.12543.pdf) 引用数800+

[Chain of thought prompting elicits reasoning in large language models](https://arxiv.org/pdf/2201.11903.pdf)引用1800+

### Anthropic的一些工作

[Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/pdf/2204.05862.pdf)

[Studying Large Language Model Generalization with Influence Functions](https://arxiv.org/pdf/2308.03296.pdf)

[Measuring Faithfulness in Chain-of-Thought Reasoning](https://www-files.anthropic.com/production/files/measuring-faithfulness-in-chain-of-thought-reasoning.pdf)

[从Claude 3中提取数百万特征，首次详细理解大模型的「思维」](https://mp.weixin.qq.com/s/cZhmvAva6NDLG84kD819Ww)

[Scaling Dictionary Learning to Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)

[LLM惊现篡改代码获得奖励，欺骗人类无法根除逆转！Anthropic新作揭露惊人真相](https://mp.weixin.qq.com/s/Fgkkc3p7zIW8OrCvSU-2lA)

[SYCOPHANCY TO SUBTERFUGE: INVESTIGATING REWARD TAMPERING IN LANGUAGE MODELS](https://arxiv.org/pdf/2406.10162)

## 个性化搜索

随便找一篇[Denoising Attention for Query-aware User Modeling in Personalized Search](https://arxiv.org/pdf/2308.15968.pdf)，来看看它的参考文献：

学术界：

* [A Transformer-based Embedding Model for Personalized Product Search](https://arxiv.org/pdf/2005.08936.pdf)，sigir20
* [Learning a Fine-Grained Review-based Transformer Model for Personalized Product Search](https://arxiv.org/pdf/2004.09424.pdf)，sigir21
* [RLPer: A Reinforcement Learning Model for Personalized Search](http://playbigdata.ruc.edu.cn/dou/publication/2020_WWW_RLPer.pdf)，www20

工业界：

* [A Zero Attention Model for Personalized Product Search](https://arxiv.org/pdf/1908.11322.pdf)，CIKM19，亚马逊
* [Real-time Personalization using Embeddings for Search Ranking at Airbnb](https://github.com/daiwk/collections/blob/master/assets/airbnb-kdd18.pdf)，KDD18，airbnb
* [End-to-End Deep Attentive Personalized Item Retrieval for Online Content-sharing Platforms](https://dl.acm.org/doi/pdf/10.1145/3366423.3380051)，www20，Google
* [Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning](https://arxiv.org/pdf/2006.02282.pdf)，sigir20，京东
* [Personalized Query Suggestions](https://guoweiwei.github.io/files/personalized-query-suggestion.pdf)，sigir20，LinkedIn
* [A GNN-based Multi-task Learning Framework for Personalized Video Search](https://eprints.whiterose.ac.uk/181816/1/GNNVideoSearch_WSDM2022.pdf)，WSDM22，百度

### q-i双塔的改进

#### 引入location+social

[Embedding-based Retrieval in Facebook Search](https://arxiv.org/pdf/2006.11632.pdf)，KDD20

q-d双塔结构，在两个塔的最底层均加入：

* location：用户所处地理位置，如城市
* social：facebook是社交网络，通过另一个基于graph的模型训练得到的user和item emb，直接加进来

#### 引入用户行为序列

[Encoding History with Context-aware Representation Learning for Personalized Search](http://playbigdata.ruc.edu.cn/dou/publication/2020_sigir_context_ps.pdf)，sigir20，人大，提出HTPS

![htps-disambiguate-query](https://1725978874-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwALUwRCP16VZ2I1KP%2Fuploads%2Fgit-blob-1b4c57f0b47fd63bf6e94d8ec4017a0cca5cb323%2Fhtps-disambiguate-query.png?alt=media)

把用户历史的q-d pair对和当前query一起，过短期transformer和长期transformer得到输出$$q^l$$。

![htps-predict-intent](https://1725978874-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwALUwRCP16VZ2I1KP%2Fuploads%2Fgit-blob-0d688b67ba62d3ae42d9acb6fbf14b8b7c027bb5%2Fhtps-predict-intent.png?alt=media)

把$$q^l$$加上\[mask]，过transoformer得到预估的intent $$q^p$$

然后将$$q^l$$和$$q^p$$通过gate nn融合得到最终的context-aware的query表示$$q^f$$

最终doc和query的打分包括两部分，通过$$\phi$$（一个MLP，激活是tanh）进行融合：

$$
p(d \mid q, H)=\phi\left(p(d, q), p\left(d, q^H\right)\right)
$$

* $$p(d, q)$$：q和d的语义相似度，可以用正常的nlp模型得到
* $$p\left(d, q^H\right)$$：q和d的个性化得分，公式如下，其中$$s^R$$是cos：

$$
p\left(d, q^H\right)=\phi\left(s^R\left(q^s, d^w\right), s^R\left(q^l, d^w\right), s^R\left(q^p, d^w\right), s^R\left(q^f, d^w\right)\right)
$$

有两个loss：

* pred loss：预估intent，即下一个query，拿$$q^p$$与下一个query中各个词向量的avg算cos
* rank loss：依据$$p(d \mid q, H)$$算lambda rank的pairwise loss

#### 三塔+gnn邻居+mtl

[A GNN-based Multi-task Learning Framework for Personalized Video Search](https://eprints.whiterose.ac.uk/181816/1/GNNVideoSearch_WSDM2022.pdf)，WSDM22，百度，提出MGNN-PVS

现有的PSM(g personalized search methods)大多使用用户反馈（如点击）进行训练，缺点：

* 反馈信号大部分表达的是吸引力而非相关性
* 用户的历史信号比较稀疏，很难学好PSM

两张二部图：u-q和q-d

![gnn-personalized-video-search](https://1725978874-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwALUwRCP16VZ2I1KP%2Fuploads%2Fgit-blob-c1c2a9faec3991d2af972243aa718add4a71aebc%2Fgnn-personalized-video-search.png?alt=media)

3个塔：

* user：
  * user自己
  * 一跳邻居（u->q）的q
  * 二跳邻居（u->q->u）的u
* query：
  * query自己
  * 一跳邻居（q->d）的doc
  * 二跳邻居（q->d->q）的query
* doc：
  * doc自己的title向量（训练query-正title-负title的triplet loss）和video向量（训练video-正query-负query的triplet loss）
  * 二跳邻居（d->q->d）的doc

两个task：

* ctr预估：u和q拼一起过nn得到个性化的q，再和d过nn得到的向量算内积，得到预估值，用交叉熵
* 相关性预估：q过另一个nn，d过另一个nn，内积，用mse

## LLM模型融合

<https://github.com/arcee-ai/mergekit>

[SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling](https://arxiv.org/pdf/2312.15166)

[LLM 合并新思路：进化算法+零训练->新任务](https://mp.weixin.qq.com/s/eSWdLT0p5uyd32OOod5lKQ)

[Evolutionary Optimization of Model Merging Recipes](https://arxiv.org/pdf/2403.13187)

<https://github.com/SakanaAI/evolutionary-model-merge>

## LLM auto-ml

### LLaMA-NAS

[用神经架构搜索给LLM瘦身，模型变小，准确度有时反而更高](https://mp.weixin.qq.com/s/_cKq4a3uM4r6s5P5s9mWaA)

[LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models](https://arxiv.org/pdf/2405.18377)

### SELA

[MetaGPT开源SELA，用AI设计AI，效果超越OpenAI使用的AIDE](https://mp.weixin.qq.com/s/9m933xV95uU-cX3qOQLC6Q)

[SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning](https://arxiv.org/abs/2410.17238)

<https://github.com/geekan/MetaGPT/tree/main/metagpt/ext/sela>

## 可解释AI

[XAI有什么用？探索LLM时代利用可解释性的10种策略](https://mp.weixin.qq.com/s/V35k4UJZPtJkAHqYlZiO1A)

[Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era](https://arxiv.org/pdf/2403.08946.pdf)

<https://github.com/JacksonWuxs/UsableXAI_LLM>

### 综述

[可解释性终极追问，什么才是第一性解释？20篇CCF-A+ICLR论文给你答案](https://mp.weixin.qq.com/s/vCAw0d2uZ_MnLrl5MT9OKA)

### TransformerLens

Neel Nanda（deepmind）的项目

<https://transformerlensorg.github.io/TransformerLens/>

### ecco

<https://www.eccox.io/>

<https://jalammar.github.io/explaining-transformers/>

<https://jalammar.github.io/hidden-states/>

### interpretability in the wild

[Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small](https://arxiv.org/pdf/2211.00593)

<https://github.com/redwoodresearch/Easy-Transformer>

### activation engineering

[Activation Addition: Steering Language Models Without Optimization](https://arxiv.org/pdf/2308.10248)

### representation engineering

[Representation Engineering: A Top-Down Approach to AI Transparency](https://arxiv.org/pdf/2310.01405)

### transformer-debugger

<https://github.com/openai/transformer-debugger/tree/main>

### painter

[八问八答搞懂Transformer内部运作原理](https://mp.weixin.qq.com/s/5qhpfHfzOIdKsG_wtgTR4A)

[Transformer Layers as Painters](https://arxiv.org/pdf/2407.09298v1)

### transformer explainer

[黑匣子被打开了！能玩的Transformer可视化解释工具，本地运行GPT-2、还可实时推理](https://mp.weixin.qq.com/s/vLyIrRyoWYjhMN4gTRgA6g)

[TRANSFORMER EXPLAINER: Interactive Learning of Text-Generative Models](https://arxiv.org/pdf/2408.04619)

<http://poloclub.github.io/transformer-explainer/>

<https://bbycroft.net/llm>

### superposition

[Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)

### 3Blue1Brown

[用最直观的动画，讲解LLM如何存储事实，3Blue1Brown的这个视频又火了](https://mp.weixin.qq.com/s/PSMfQLBBQZyG2GwgzatqvA)

<https://www.youtube.com/watch?v=9-Jl0dxWQs8>

### Monitor

[他们掰开神经元，终于让大模型9.8大于9.11了：神秘创业公司，开源AI「洗脑」工具](https://mp.weixin.qq.com/s/pOOBY6cBZUn86xRtO12FtQ)

<https://transluce.org/observability-interface>

<https://monitor.transluce.org/dashboard/chat>

### llm反思

[ACL 2025｜自我怀疑还是自我纠正？清华团队揭示LLMs反思技术的暗面](https://mp.weixin.qq.com/s/Y8AOILcmnwoW68YrCR3LAQ)

[Understanding the Dark Side of LLMs’ Intrinsic Self-Correction](https://arxiv.org/abs/2412.14959)

反思失败的原因：

* 内部答案波动：在多轮问答任务上，「你确定吗？请思考后再回答」的提示语会让LLMs反复更改答案。说明反思技术会造成LLMs内部答案的波动，表现出「自我怀疑」的倾向，最终可能导致回答出错
* prompt偏差：LLMs在反思失败时会过度关注提示语「你确定吗？想一想再回答。」，而忽略问题本身；当反思失败时，LLMs在76.1%的情况下会更关注反思指令，而当坚持正确答案时，**LLMs对反思指令和问题本身的关注度非常相近**，分别为50.8%和49.2%。
* 认知偏差：
  * 过度思考：过度制定策略而不采取行动
  * 认知过载：在长文本的反思中忽略关键信息
  * 完美主义偏差：为了追求高效性而忽略环境限制

反思失败的缓解方法：

* 问题重复：在反思prompt的最后附上初始问题以引导LLMs维持对初始问题的关注。
* 少样本微调：不引入知识的少样本（4-10 个样本）微调可纠正反思失败的异常行为。

### sparse circuits

[OpenAI新论文拆解语言模型内部机制：用「稀疏电路」解释模型行为](https://mp.weixin.qq.com/s/dAUzwXkQnw7bqiKkv1T1Hw)

[OpenAI又Open了一下：发布可解释性新研究，作者来自Ilya超级对齐团队](https://mp.weixin.qq.com/s/jF4qlkMH3l7A1ZBbpe4pig)

[Weight-sparse transformers have interpretable circuits](https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf)

## 自省

[LLM 比之前预想的更像人类，竟也能「三省吾身」](https://mp.weixin.qq.com/s/Ri-Wdl_Xk5OxWF5IIJmrxg)

[Looking Inward: Language Models Can Learn About Themselves by Introspection](https://arxiv.org/pdf/2410.13787)

## LLM+芯片设计

[登上Nature的AI芯片设计屡遭质疑，谷歌发文反击，Jeff Dean：质疑者连预训练都没做](https://mp.weixin.qq.com/s/u1NNmulcykGkgZjJb_A-UA)

[That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design](https://arxiv.org/pdf/2411.10053)

## 其他

### 安全性

[Anthropic安全负责人：在超级AI「毁灭」人类之前，我们可以做这些准备](https://mp.weixin.qq.com/s/nxD8qeCfG1tjfpvlJ6uacg)

[OpenAI最新53页论文：ChatGPT看人下菜碟，对“小美”比“小帅”更友好](https://mp.weixin.qq.com/s/NnLAjHuBPHa-aBoT6IV4Pg)

[First-Person Fairness in Chatbots](https://cdn.openai.com/papers/first-person-fairness-in-chatbots.pdf)

[翁荔B站分享原文：AI安全与“培养”之道](https://mp.weixin.qq.com/s/92QyZcwteFXaKfJk3GdTcQ)

### time-LLM

[谁说大象不能起舞! 重编程大语言模型实现跨模态交互的时序预测 | ICLR 2024](https://mp.weixin.qq.com/s/K04haPMcbKiS6OkCihXAqQ)

[Time-LLM: Time Series Forecasting by Reprogramming Large Language Models](https://arxiv.org/pdf/2310.01728.pdf)

<https://github.com/KimMeen/Time-LLM>

将**时序预测任务**转换成一个可以由 LLMs 有效解决的**语言任务**，成功激活了llm做**高精度时序推理**的能力。

* 时序输入重编程
* 提示做前缀

### 长尾

[A Systematic Review on Long-Tailed Learning](https://arxiv.org/pdf/2408.00483)

### 文本匹配效果还行的模型

大多是基于sentence-bert的，m3e-base在电商语料上试过，效果不错

<https://huggingface.co/moka-ai/m3e-base>

<https://huggingface.co/shibing624/text2vec-base-chinese>

### 本地知识库

<https://github.com/chatchat-space/Langchain-Chatchat>

### llm应用合辑

* ChatGPT聚合站：<https://hokex.com>
* 游戏生成站：<https://latitude.io/>
* 家庭作业辅助站：<https://ontimeai.com/>
* 文字转语音站：<https://www.resemble.ai/>
* 艺术作画站：<https://starryai.com/>
* logo制作站：<https://www.logoai.com/>
* ai写作站：<https://www.getconch.ai/>
* 音乐制作站：<https://soundraw.io/>
* 声音模拟站：<https://fakeyou.com/>
* 一句话生成一段视频：<https://runwayml.com/>
* 文字转语音：[https://murf.ai/](https://runwayml.com/)

### swiftsage

[大语言模型在开放世界中的推理能力探索实践](https://mp.weixin.qq.com/s/LZ6lkTTOom-mbqV9IJ-OZg)

[SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks](https://arxiv.org/pdf/2305.17390.pdf)

### 达摩院大模型技术交流

<https://developer.aliyun.com/live/248332>

ppt：[链接](https://pan.baidu.com/s/1tbckFpa8W8qJ5yRw9yvJ9A#list/path=%2F) 密码：5yyf

### 回译

通过单语数据提升 NMT 模型最高效的方法之一是回译（back-translation）。如果我们的目标是训练一个英语到德语的翻译模型，那么可以首先训练一个从德语到英语的翻译模型，并利用该模型翻译所有的单语德语数据。然后基于原始的英语到德语数据，再加上新生成的数据，我们就能训练一个英语到德语的最终模型。

[Understanding Back-Translation at Scale](https://arxiv.org/pdf/1808.09381v2.pdf)

### nan问题

[解决pytorch半精度amp训练nan问题](https://zhuanlan.zhihu.com/p/443166496)

### 时间序列

[LLM用于时序预测真的不行，连推理能力都没用到](https://mp.weixin.qq.com/s/C-N0tyQrEOoNoADtH_thTA)

[Are Language Models Actually Useful for Time Series Forecasting?](https://arxiv.org/pdf/2406.16964)

### 人脑

[MIT大牛新作震惊学界！AI「长脑子」了？LLM惊现「人类脑叶」结构并有数学代码分区](https://mp.weixin.qq.com/s/6lRAS8m4XqEfFFEP1Qa43A)

[The Geometry of Concepts: Sparse Autoencoder Feature Structure](https://arxiv.org/abs/2410.19750)

### LLM for 算法设计

[调研180多篇论文，这篇综述终于把大模型做算法设计理清了](https://mp.weixin.qq.com/s/hfDzIBcw5HTxtSzpS_694g)

[A Systematic Survey on Large Language Models for Algorithm Design](https://arxiv.org/abs/2410.14716)

### 深度生成模型课程

[教授何恺明在MIT的第二门课——《深度生成模型》，讲座PPT陆续已出](https://mp.weixin.qq.com/s/t8S7cXVAXDWhS0ypzMXiCg)

<https://mit-6s978.github.io/schedule.html>

<https://mit-6s978.github.io/assets/pdfs/lec1_intro.pdf>

<https://mit-6s978.github.io/assets/pdfs/lec2_vae.pdf>

<https://mit-6s978.github.io/assets/pdfs/lec3_ar.pdf>

<https://mit-6s978.github.io/assets/pdfs/lec4_gan.pdf>

<https://mit-6s978.github.io/assets/pdfs/lec5_diffusion.pdf>

### 时序db

influxdb

<https://github.com/influxdata/influxdb>

<https://jasper-zhang1.gitbooks.io/influxdb/content/Introduction/getting_start.html>

### 其他

[原来，这些顶级大模型都是蒸馏的](https://mp.weixin.qq.com/s/GdwH7jxK2T_Vhus2ZvwQbw)

[Distillation Quantification for Large Language Models](https://arxiv.org/abs/2501.12619)

[小模型性能饱和、表现不佳，根源是因为Softmax?](https://mp.weixin.qq.com/s/bvv-frM8bKhkZiqOa9nqDA)

[Why do small language models underperform? Studying LM Saturation via the Softmax Bottleneck](https://arxiv.org/pdf/2404.07647.pdf)

[Ilya Sutskever的推荐清单](https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE)

[传说中Ilya Sutskever精选论文清单：AI领域40大论文完整版「破解」完成](https://mp.weixin.qq.com/s/7Bj_K1Vjp2FtfklfJsAMbQ)

[2024年大模型LLM还有哪些可研究的方向？](https://www.zhihu.com/question/637595961)

[Hinton万字访谈：用更大模型「预测下一个词」值得全力以赴](https://mp.weixin.qq.com/s/OydltjpVwsQ7hNBH6hq_Og)

[ChatGPT如何「思考」？心理学和神经科学破解AI大模型，Nature发文](https://mp.weixin.qq.com/s/4nO4DQE6Llfo3fiFSPSMhQ)

[适应多形态多任务，最强开源机器人学习系统「八爪鱼」诞生](https://mp.weixin.qq.com/s/HPTfOlw25F5JcvlY-Vy9Tw)

[Octo: An Open-Source Generalist Robot Policy](https://arxiv.org/pdf/2405.12213)

[LeCun新作：神经网络在实践中的灵活性到底有多大？](https://mp.weixin.qq.com/s/PjlXwwG3t5Fqp5MfrBVvBQ)

[Just How Flexible are Neural Networks in Practice?](https://arxiv.org/pdf/2406.11463)

[清华包揽最佳论文+时间检验奖，山大获荣誉提名，SIGIR 2024奖项出炉](https://mp.weixin.qq.com/s/Z2Mj7etx6KvYrSn8LhrJwg)

<https://zhuanlan.zhihu.com/p/654910335>

## 一些记录

### 打印模型参数量

<https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model>

```python

pytorch_total_params = sum(p.numel() for p in model.parameters())

pytorch_total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

# Load the model
from transformers import BartForConditionalGeneration
from transformers import T5ForConditionalGeneration
def cal(model):
  pytorch_total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
  return pytorch_total_trainable_params

model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
print("bart-base")
print(cal(model)) # 6L 139420416 139M

model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
print("bart-base")
print(cal(model)) # 12L 406291456 406M

model = T5ForConditionalGeneration.from_pretrained("t5-small")
print("t5-small")
print(cal(model)) # 6L 60506624 65M

model = T5ForConditionalGeneration.from_pretrained("t5-base")
print("t5-base")
print(cal(model)) # 12L 222903552 223M


model = T5ForConditionalGeneration.from_pretrained("t5-large")
print("t5-large")
print(cal(model)) # 24L 737668096 738M

```

### 往现有tokenizer里加一些特殊token

<https://stackoverflow.com/questions/69191305/how-to-add-new-special-token-to-the-tokenizer>

```python
num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))

###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)
```

示例

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Before")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]


special_tokens_dict = {'additional_special_tokens': ['[EOT]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
# model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tok_id = tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

print("After")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]
```

### python的读写锁

一个写，多个并行读：<https://pypi.org/project/readerwriterlock/>

### pytorch的显存泄露

<https://github.com/pytorch/pytorch/issues/13246#issuecomment-445770039>

### torch profiling

<https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>

可以拿这个来可视化：<https://ui.perfetto.dev/>

* 点击open trace file上传json文件
* timeline中有两个python进程，点击cuda kernel会出现箭头，方便找到是**哪个op调用了该kernel**
  * 靠上的python进程是host侧进程（主要是用户代码中调用的一些**API/pytorch op**，能比较方便能和训练代码对应上）
  * 靠下的python进程是device（gpu）侧进程（记录实际cuda kernel 的执行和一些性能相关的数据）

device timeline比较稀疏的情况下训练性能较差，GPU利用率较低，可能需要排查下训练代码是否有问题

### 显存泄露排查

<https://pytorch.ac.cn/docs/stable/torch_cuda_memory.html>

<https://pytorch.org/blog/understanding-gpu-memory-1/>

<https://pytorch.org/blog/understanding-gpu-memory-2/>

检查显存

```python
# (c) Meta Platforms, Inc. and affiliates. 
# https://pytorch.org/blog/understanding-gpu-memory-1/
import logging
import socket
from datetime import datetime, timedelta

import torch

from torchvision import models

logging.basicConfig(
   format="%(levelname)s:%(asctime)s %(message)s",
   level=logging.INFO,
   datefmt="%Y-%m-%d %H:%M:%S",
)
logger: logging.Logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)

TIME_FORMAT_STR: str = "%b_%d_%H_%M_%S"

# Keep a max of 100,000 alloc/free events in the recorded history
# leading up to the snapshot.
MAX_NUM_OF_MEM_EVENTS_PER_SNAPSHOT: int = 100000

def start_record_memory_history() -> None:
   if not torch.cuda.is_available():
       logger.info("CUDA unavailable. Not recording memory history")
       return

   logger.info("Starting snapshot record_memory_history")
   torch.cuda.memory._record_memory_history(
       max_entries=MAX_NUM_OF_MEM_EVENTS_PER_SNAPSHOT
   )

def stop_record_memory_history() -> None:
   if not torch.cuda.is_available():
       logger.info("CUDA unavailable. Not recording memory history")
       return

   logger.info("Stopping snapshot record_memory_history")
   torch.cuda.memory._record_memory_history(enabled=None)

def export_memory_snapshot() -> None:
   if not torch.cuda.is_available():
       logger.info("CUDA unavailable. Not exporting memory snapshot")
       return

   # Prefix for file names.
   host_name = socket.gethostname()
   timestamp = datetime.now().strftime(TIME_FORMAT_STR)
   file_prefix = f"{host_name}_{timestamp}"

   try:
       logger.info(f"Saving snapshot to local file: {file_prefix}.pickle")
       torch.cuda.memory._dump_snapshot(f"{file_prefix}.pickle")
   except Exception as e:
       logger.error(f"Failed to capture memory snapshot {e}")
       return

# Simple Resnet50 example to demonstrate how to capture memory visuals.
def run_resnet50(num_iters=5, device="cuda:0"):
   model = models.resnet50().to(device=device)
   inputs = torch.randn(1, 3, 224, 224, device=device)
   labels = torch.rand_like(model(inputs))
   optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
   loss_fn = torch.nn.CrossEntropyLoss()

   # Start recording memory snapshot history
   start_record_memory_history()

   for _ in range(num_iters):
       pred = model(inputs)
       loss_fn(pred, labels).backward()
       optimizer.step()
       optimizer.zero_grad(set_to_none=True)

   # Create the memory snapshot file
   export_memory_snapshot()

   # Stop recording memory snapshot history
   stop_record_memory_history()

if __name__ == "__main__":
    # Run the resnet50 model
    run_resnet50()
```

同时profile cpu和显存

```python
# (c) Meta Platforms, Inc. and affiliates. 
# https://pytorch.org/blog/understanding-gpu-memory-1/
import logging
import socket
from datetime import datetime, timedelta

import torch

from torch.autograd.profiler import record_function
from torchvision import models

logging.basicConfig(
   format="%(levelname)s:%(asctime)s %(message)s",
   level=logging.INFO,
   datefmt="%Y-%m-%d %H:%M:%S",
)
logger: logging.Logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)

TIME_FORMAT_STR: str = "%b_%d_%H_%M_%S"

def trace_handler(prof: torch.profiler.profile):
   # Prefix for file names.
   host_name = socket.gethostname()
   timestamp = datetime.now().strftime(TIME_FORMAT_STR)
   file_prefix = f"{host_name}_{timestamp}"

   # Construct the trace file.
   prof.export_chrome_trace(f"{file_prefix}.json.gz")

   # Construct the memory timeline file.
   prof.export_memory_timeline(f"{file_prefix}.html", device="cuda:0")

def run_resnet50(num_iters=5, device="cuda:0"):
   model = models.resnet50().to(device=device)
   inputs = torch.randn(1, 3, 224, 224, device=device)
   labels = torch.rand_like(model(inputs))
   optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
   loss_fn = torch.nn.CrossEntropyLoss()

   with torch.profiler.profile(
       activities=[
           torch.profiler.ProfilerActivity.CPU,
           torch.profiler.ProfilerActivity.CUDA,
       ],
       schedule=torch.profiler.schedule(wait=0, warmup=0, active=6, repeat=1),
       record_shapes=True,
       profile_memory=True,
       with_stack=True,
       on_trace_ready=trace_handler,
   ) as prof:
       for _ in range(num_iters):
           prof.step()
           with record_function("## forward ##"):
               pred = model(inputs)

           with record_function("## backward ##"):
               loss_fn(pred, labels).backward()

           with record_function("## optimizer ##"):
               optimizer.step()
               optimizer.zero_grad(set_to_none=True)

if __name__ == "__main__":
    # Warm up
    run_resnet50()
    # Run the resnet50 model
    run_resnet50()
```

### 各型号gpu对比

<https://zhuanlan.zhihu.com/p/441153412>

### 查看python的栈

```shell
pip install py-spy
py-spy dump --pid 1199

```

打出来：

```
Process 1199: /usr/bin/python3.10 -u torch_main.py
Python v3.10.14 (/usr/bin/python3.10)

Thread 0x7F62A2C43740 (active): "MainThread"
    _wait_for_tstate_lock (threading.py:1116)
    join (threading.py:1096)
    main (torch_main.py:776)
    <module> (torch_main.py:785)
Thread 0xAABBBCC (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (threading.py:1376)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0xAAAAA (idle): "Thread-3 (process)"
    wait (threading.py:320)
    get (queue.py:171)
    process (abase_writer.py:73)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0xA992ACDA (idle): "Thread-4 (process)"
    wait (threading.py:320)
    get (queue.py:171)
    process (abase_writer.py:73)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0xAFF11AA (active): "Thread-5 (read_file)"
    get_seq (ecom_seq_reader.py:200)
    read_file (torch_main.py:494)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x9922BCDA (idle): "Thread-6"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
```

### 国内的huggingface模型下载地址

<https://hf-mirror.com/>

### 一些报错的解法

#### flash-attention2

<https://github.com/Dao-AILab/flash-attention/issues/451>

```shell
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn
```

## GPU机型对比

* L20：<https://www.techpowerup.com/gpu-specs/l20.c4206>
  * T flops：
    * tf32：59.8
    * fp32：59.8
    * bf16：119.5
    * fp16：119.5
* A800-40G：<https://www.techpowerup.com/gpu-specs/a800-pcie-40-gb.c3964>
  * T flops：
    * tf32：156
    * fp32：19.5
    * bf16：312
    * fp16：77.97

## 一些问题和经验

坍缩（稳定召回那k个item），attention score太集中了，低秩特征（泛化特征）容易导致这个问题

看attention score的分布，如果第一层偏向target item，但第二层可能就很平均了，这种可能就释放不出收益，应该是没学好

auc离线有收益，在线没收益：

* reload正确性，warmup有没有报错
* nn的分发效率，20min降到x min，压缩、解压耗时
* 学习充分性：发现ab开得更久的时候，就看到收益了。。
  * 累积梯度太小的 不充分
  * 历史样本变多
  * 多epoch(参考快手 阿里的一些做法，例如reset emb等)
  * 加一些辅助loss，例如生成式、蒸馏
* 出nan：bf16转化等问题，加一些grad clip，norm等
