1.9.llm_others
泄露的系统提示词
https://github.com/asgeirtj/system_prompts_leaks/blob/main/OpenAI/gpt-5-thinking.md
语言物理学
(toread)
https://www.bilibili.com/video/BV1Yw4m1k7nH
蒸馏
ABKD
ICML Spotlight 2025丨追求概率质量的帕累托最优:基于广义α-β散度引导的知识蒸馏框架ABKD
https://github.com/ghwang-s/abkd
现有问题:
前向KL:概率分配过于“佛系”,学生“雨露均沾”,难专注目标类
反向KL:概率分配过于“内卷”,学生“死磕”高置信度类,忽略教师全局信息
ABKD引入α-β散度,统一前向/反向KL,并推广到此前未探索的海灵格距离和β-散度等。
LLM+math
mathscale
【LLM-数学】MathScale 用于数学推理的指令调优扩展方法
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
AlphaGeometry
奥数能力金牌级:DeepMind几何推理模型登上Nature,代码开源,菲尔兹奖得主点赞提出了AlphaGeometry
AlphaProof & AlphaGeometry 2
谷歌AI拿下IMO奥数银牌,数学推理模型AlphaProof面世,强化学习 is so back提出AlphaProof和AlphaGeometry 2
WE-Math基准
真相了!大模型解数学题和人类真不一样:死记硬背、知识欠缺明显,GPT-4o表现最佳
WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
https://github.com/We-Math/We-Math
https://huggingface.co/datasets/We-Math/We-Math
case-based or rule-based
ICML 2024|Transformer究竟如何推理?基于样例还是基于规则
Case-Based or Rule-Based: How Do Transformers Do the Math?
https://github.com/GraphPKU/Case_or_Rule
LLM常见难题
LLM as a judge
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge,meta的2025年1月的
重复生成
https://www.zhihu.com/question/616130636
https://mp.weixin.qq.com/s/cSwWapqFhxu9zafzPUeVEw
Interpreting the Repeated Token Phenomenon in Large Language Models
deepmind的文章,发现和attention sink(初始token会有很高的attn score)有关,初始注意力层负责标记序列中的第一个单词,而后期的一些特定神经元则会放大这些标记单词的隐藏状态值。当处理重复单词时,这一机制会失效,导致模型行为异常。
https://github.com/yossigandelsman/attn_sinkhole
幻觉
综述
OpenAI Lilian Weng万字长文解读LLM幻觉:从理解到克服
https://lilianweng.github.io/posts/2024-07-07-hallucination/
语义熵
Detecting hallucinations in large language models using semantic entropy
Zilliz
记忆能力
Localizing Paragraph Memorization in Language Models
对应代码:https://github.com/googleinterns/localizing-paragraph-memorization
我们能否定位出语言模型中用于记忆其训练数据中整段文字的权重和机制?
尽管记忆现象分布在模型的多个层级和组件中,但记忆段落的梯度在空间上有可辨别的模式,即在较低模型层级的梯度比非记忆example的梯度大。
通过仅微调高梯度的权重,可以使模型遗忘记忆的example。
定位了一个特别参与段落记忆的低层注意力头,它主要关注在语料库级单词频率分布中最不频繁出现的独特、罕见的token。
总的来说,相较非记忆的续写,记忆续写不仅更难以遗忘,也更难以损坏。
reasoning
Do Large Language Models Latently Perform Multi-Hop Reasoning? deepmind的
How do Language Models Bind Entities in Context? UC berkeley的,ICLR2024
memorizing
Knowledge Neurons in Pretrained Transformers ACL 2022
Language Modeling Is Compression ICLR 2024 deepmind
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models meta NeurIPS 2022
越狱
长文本之罪:Claude团队新越狱技术,Llama 2到GPT-4无一幸免
LLM compiler
开发者狂喜!Meta最新发布的LLM Compiler,实现77%自动调优效率
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization
ProLong
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models
https://github.com/October2001/ProLong
白化
在 transformer 领域里,“白化”(whitening)主要是指一种对句子嵌入进行后处理的方法,通过将句子向量的均值变为0,并将协方差矩阵变为单位矩阵,从而解决句子嵌入中的各向异性问题。这种技术能够提高句子嵌入在语义相似性任务中的表现,并且加快检索速度。
Whitening Sentence Representations for Better Semantics and Faster Retrieval
代码:https://github.com/bojone/BERT-whitening
Transformer Scale Gate for Semantic Segmentation
蒸馏
Revisiting Knowledge Distillation for Autoregressive Language Models
Meta开发System 2蒸馏技术,Llama 2对话模型任务准确率接近100%
Distilling System 2 into System 1
证明者-验证者博弈
OpenAI超级对齐团队遗作:两个大模型博弈一番,输出更好懂了
Prover-Verifier Games improve legibility of LLM outputs
参考:Learning to Give Checkable Answers with Prover-Verifier Games
道德风险
GPT-4o模仿人类声音,诡异尖叫引OpenAI研究员恐慌!32页技术报告出炉
openai的报告:GPT-4o System Card
之前deepmind也有一个报告The Ethics of Advanced AI Assistants
选择性偏差
ACL2024|大模型选择偏差在腾讯广告特征评测上的优化及应用
Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors
给定一个问题(question)及其对应的选项内容(options),大模型无法把选项内容(option content)和对应的选项标识符(symbol,特指选项标识A/B/C/D)关联到一起。例如,当把正确答案"the president"放到选项B时,模型能够正确选择出答案;当我们把正确答案放到C时,模型依然选择"B",即模型偏向于选"B"或者第二个答案,而忽略了正确答案的内容。
lost in the middle
Lost in the Middle: How Language Models Use Long Contexts
reasoning boundary
NeurIPS 2024 (Oral) | 如何量化与提升思维链的推理能力边界?
https://github.com/LightChen233/reasoning-boundary
语言≠思维
语言≠思维,大模型学不了推理:一篇Nature让AI社区炸锅了
https://www.nature.com/articles/s41586-024-07522-w
一些其他比较重要的工作
几篇出现频率比较高的论文
Scaling instruction-finetuned language models 引用数800+
How can we know what language models know? 引用数800+
Chain of thought prompting elicits reasoning in large language models引用1800+
Anthropic的一些工作
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Studying Large Language Model Generalization with Influence Functions
Measuring Faithfulness in Chain-of-Thought Reasoning
从Claude 3中提取数百万特征,首次详细理解大模型的「思维」
Scaling Dictionary Learning to Claude 3 Sonnet
LLM惊现篡改代码获得奖励,欺骗人类无法根除逆转!Anthropic新作揭露惊人真相
SYCOPHANCY TO SUBTERFUGE: INVESTIGATING REWARD TAMPERING IN LANGUAGE MODELS
个性化搜索
随便找一篇Denoising Attention for Query-aware User Modeling in Personalized Search,来看看它的参考文献:
学术界:
工业界:
Personalized Query Suggestions,sigir20,LinkedIn
q-i双塔的改进
引入location+social
Embedding-based Retrieval in Facebook Search,KDD20
q-d双塔结构,在两个塔的最底层均加入:
location:用户所处地理位置,如城市
social:facebook是社交网络,通过另一个基于graph的模型训练得到的user和item emb,直接加进来
引入用户行为序列
Encoding History with Context-aware Representation Learning for Personalized Search,sigir20,人大,提出HTPS

把用户历史的q-d pair对和当前query一起,过短期transformer和长期transformer得到输出。

把加上[mask],过transoformer得到预估的intent
然后将和通过gate nn融合得到最终的context-aware的query表示
最终doc和query的打分包括两部分,通过(一个MLP,激活是tanh)进行融合:
:q和d的语义相似度,可以用正常的nlp模型得到
:q和d的个性化得分,公式如下,其中是cos:
有两个loss:
pred loss:预估intent,即下一个query,拿与下一个query中各个词向量的avg算cos
rank loss:依据算lambda rank的pairwise loss
三塔+gnn邻居+mtl
A GNN-based Multi-task Learning Framework for Personalized Video Search,WSDM22,百度,提出MGNN-PVS
现有的PSM(g personalized search methods)大多使用用户反馈(如点击)进行训练,缺点:
反馈信号大部分表达的是吸引力而非相关性
用户的历史信号比较稀疏,很难学好PSM
两张二部图:u-q和q-d

3个塔:
user:
user自己
一跳邻居(u->q)的q
二跳邻居(u->q->u)的u
query:
query自己
一跳邻居(q->d)的doc
二跳邻居(q->d->q)的query
doc:
doc自己的title向量(训练query-正title-负title的triplet loss)和video向量(训练video-正query-负query的triplet loss)
二跳邻居(d->q->d)的doc
两个task:
ctr预估:u和q拼一起过nn得到个性化的q,再和d过nn得到的向量算内积,得到预估值,用交叉熵
相关性预估:q过另一个nn,d过另一个nn,内积,用mse
LLM模型融合
https://github.com/arcee-ai/mergekit
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
Evolutionary Optimization of Model Merging Recipes
https://github.com/SakanaAI/evolutionary-model-merge
LLM auto-ml
LLaMA-NAS
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models
SELA
MetaGPT开源SELA,用AI设计AI,效果超越OpenAI使用的AIDE
SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning
https://github.com/geekan/MetaGPT/tree/main/metagpt/ext/sela
可解释AI
Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era
https://github.com/JacksonWuxs/UsableXAI_LLM
综述
可解释性终极追问,什么才是第一性解释?20篇CCF-A+ICLR论文给你答案
TransformerLens
Neel Nanda(deepmind)的项目
https://transformerlensorg.github.io/TransformerLens/
ecco
https://jalammar.github.io/explaining-transformers/
https://jalammar.github.io/hidden-states/
interpretability in the wild
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
https://github.com/redwoodresearch/Easy-Transformer
activation engineering
Activation Addition: Steering Language Models Without Optimization
representation engineering
Representation Engineering: A Top-Down Approach to AI Transparency
transformer-debugger
https://github.com/openai/transformer-debugger/tree/main
painter
Transformer Layers as Painters
transformer explainer
黑匣子被打开了!能玩的Transformer可视化解释工具,本地运行GPT-2、还可实时推理
TRANSFORMER EXPLAINER: Interactive Learning of Text-Generative Models
http://poloclub.github.io/transformer-explainer/
superposition
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
3Blue1Brown
用最直观的动画,讲解LLM如何存储事实,3Blue1Brown的这个视频又火了
https://www.youtube.com/watch?v=9-Jl0dxWQs8
Monitor
他们掰开神经元,终于让大模型9.8大于9.11了:神秘创业公司,开源AI「洗脑」工具
https://transluce.org/observability-interface
https://monitor.transluce.org/dashboard/chat
llm反思
ACL 2025|自我怀疑还是自我纠正?清华团队揭示LLMs反思技术的暗面
Understanding the Dark Side of LLMs’ Intrinsic Self-Correction
反思失败的原因:
内部答案波动:在多轮问答任务上,「你确定吗?请思考后再回答」的提示语会让LLMs反复更改答案。说明反思技术会造成LLMs内部答案的波动,表现出「自我怀疑」的倾向,最终可能导致回答出错
prompt偏差:LLMs在反思失败时会过度关注提示语「你确定吗?想一想再回答。」,而忽略问题本身;当反思失败时,LLMs在76.1%的情况下会更关注反思指令,而当坚持正确答案时,LLMs对反思指令和问题本身的关注度非常相近,分别为50.8%和49.2%。
认知偏差:
过度思考:过度制定策略而不采取行动
认知过载:在长文本的反思中忽略关键信息
完美主义偏差:为了追求高效性而忽略环境限制
反思失败的缓解方法:
问题重复:在反思prompt的最后附上初始问题以引导LLMs维持对初始问题的关注。
少样本微调:不引入知识的少样本(4-10 个样本)微调可纠正反思失败的异常行为。
自省
Looking Inward: Language Models Can Learn About Themselves by Introspection
具身智能
大模型走向物理世界,TeleAI 发布大模型驱动的具身智能综述,覆盖300篇文献
Embodied-AI with large models: research and challenges
ReKep
李飞飞团队提出ReKep,让机器人具备空间智能,还能整合GPT-4o
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
https://github.com/huangwl18/ReKep
GR-2
GR-2登场!ByteDance Research提出机器人大模型,具备世界建模和强大泛化能力
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
https://gr2-manipulation.github.io/
RDT-1B
清华开源全球最大双臂机器人扩散大模型RDT,懂调酒能遛狗,登顶HF具身热榜
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
HIL-SERL
强化学习训练一两个小时,100%自主完成任务:机器人ChatGPT时刻真来了?
Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning
Genesis
历时2年,华人团队力作,震撼开源生成式物理引擎Genesis,可模拟世界万物
https://github.com/Genesis-Embodied-AI/Genesis
language of motion
李飞飞团队统一动作与语言,新的多模态模型不仅超懂指令,还能读懂隐含情绪
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
视觉空间智能
李飞飞、谢赛宁等探索MLLM「视觉空间智能」,网友:2025有盼头了
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
具身多模态推理
统一框架下的具身多模态推理:自变量机器人让AI放下海德格尔的锤子
PEVA
伯克利&Meta面向具身智能的世界模型:让AI通过全身动作「看见」未来
Whole-Body Conditioned Egocentric Video Prediction
https://dannytran123.github.io/PEVA/
MTU3D
ICCV 2025满分论文:一个模型实现空间理解与主动探索大统一
https://github.com/MTU3D/MTU3D
CoA
模仿学习新范式,Chain-of-Action:轨迹自回归实现动作推理
Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
https://github.com/ByteDance-Seed/Chain-of-Action
LLM+芯片设计
登上Nature的AI芯片设计屡遭质疑,谷歌发文反击,Jeff Dean:质疑者连预训练都没做
That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design
其他
安全性
Anthropic安全负责人:在超级AI「毁灭」人类之前,我们可以做这些准备
OpenAI最新53页论文:ChatGPT看人下菜碟,对“小美”比“小帅”更友好
First-Person Fairness in Chatbots
time-LLM
谁说大象不能起舞! 重编程大语言模型实现跨模态交互的时序预测 | ICLR 2024
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
https://github.com/KimMeen/Time-LLM
将时序预测任务转换成一个可以由 LLMs 有效解决的语言任务,成功激活了llm做高精度时序推理的能力。
时序输入重编程
提示做前缀
长尾
A Systematic Review on Long-Tailed Learning
文本匹配效果还行的模型
大多是基于sentence-bert的,m3e-base在电商语料上试过,效果不错
https://huggingface.co/moka-ai/m3e-base
https://huggingface.co/shibing624/text2vec-base-chinese
本地知识库
https://github.com/chatchat-space/Langchain-Chatchat
llm应用合辑
ChatGPT聚合站:https://hokex.com
游戏生成站:https://latitude.io/
家庭作业辅助站:https://ontimeai.com/
文字转语音站:https://www.resemble.ai/
艺术作画站:https://starryai.com/
logo制作站:https://www.logoai.com/
ai写作站:https://www.getconch.ai/
音乐制作站:https://soundraw.io/
声音模拟站:https://fakeyou.com/
一句话生成一段视频:https://runwayml.com/
文字转语音:https://murf.ai/
swiftsage
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks
达摩院大模型技术交流
https://developer.aliyun.com/live/248332
ppt:链接 密码:5yyf
回译
通过单语数据提升 NMT 模型最高效的方法之一是回译(back-translation)。如果我们的目标是训练一个英语到德语的翻译模型,那么可以首先训练一个从德语到英语的翻译模型,并利用该模型翻译所有的单语德语数据。然后基于原始的英语到德语数据,再加上新生成的数据,我们就能训练一个英语到德语的最终模型。
Understanding Back-Translation at Scale
nan问题
时间序列
Are Language Models Actually Useful for Time Series Forecasting?
人脑
MIT大牛新作震惊学界!AI「长脑子」了?LLM惊现「人类脑叶」结构并有数学代码分区
The Geometry of Concepts: Sparse Autoencoder Feature Structure
LLM for 算法设计
A Systematic Survey on Large Language Models for Algorithm Design
深度生成模型课程
教授何恺明在MIT的第二门课——《深度生成模型》,讲座PPT陆续已出
https://mit-6s978.github.io/schedule.html
https://mit-6s978.github.io/assets/pdfs/lec1_intro.pdf
https://mit-6s978.github.io/assets/pdfs/lec2_vae.pdf
https://mit-6s978.github.io/assets/pdfs/lec3_ar.pdf
https://mit-6s978.github.io/assets/pdfs/lec4_gan.pdf
https://mit-6s978.github.io/assets/pdfs/lec5_diffusion.pdf
时序db
influxdb
https://github.com/influxdata/influxdb
https://jasper-zhang1.gitbooks.io/influxdb/content/Introduction/getting_start.html
其他
Distillation Quantification for Large Language Models
Why do small language models underperform? Studying LM Saturation via the Softmax Bottleneck
传说中Ilya Sutskever精选论文清单:AI领域40大论文完整版「破解」完成
Hinton万字访谈:用更大模型「预测下一个词」值得全力以赴
ChatGPT如何「思考」?心理学和神经科学破解AI大模型,Nature发文
Octo: An Open-Source Generalist Robot Policy
Just How Flexible are Neural Networks in Practice?
清华包揽最佳论文+时间检验奖,山大获荣誉提名,SIGIR 2024奖项出炉
https://zhuanlan.zhihu.com/p/654910335
一些记录
打印模型参数量
https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
pytorch_total_params = sum(p.numel() for p in model.parameters())
pytorch_total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
# Load the model
from transformers import BartForConditionalGeneration
from transformers import T5ForConditionalGeneration
def cal(model):
pytorch_total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
return pytorch_total_trainable_params
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
print("bart-base")
print(cal(model)) # 6L 139420416 139M
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
print("bart-base")
print(cal(model)) # 12L 406291456 406M
model = T5ForConditionalGeneration.from_pretrained("t5-small")
print("t5-small")
print(cal(model)) # 6L 60506624 65M
model = T5ForConditionalGeneration.from_pretrained("t5-base")
print("t5-base")
print(cal(model)) # 12L 222903552 223M
model = T5ForConditionalGeneration.from_pretrained("t5-large")
print("t5-large")
print(cal(model)) # 24L 737668096 738M
往现有tokenizer里加一些特殊token
https://stackoverflow.com/questions/69191305/how-to-add-new-special-token-to-the-tokenizer
num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))
###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)示例
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print("Before")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids) # --> [100, 102, 0, 101, 103]
special_tokens_dict = {'additional_special_tokens': ['[EOT]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
# model.resize_token_embeddings(len(tokenizer)) # --> Embedding(30523, 768)
tok_id = tokenizer.convert_tokens_to_ids('[EOT]') # --> 30522
print("After")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids) # --> [100, 102, 0, 101, 103]python的读写锁
一个写,多个并行读:https://pypi.org/project/readerwriterlock/
pytorch的显存泄露
https://github.com/pytorch/pytorch/issues/13246#issuecomment-445770039
torch profiling
https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
可以拿这个来可视化:https://ui.perfetto.dev/
点击open trace file上传json文件
timeline中有两个python进程,点击cuda kernel会出现箭头,方便找到是哪个op调用了该kernel
靠上的python进程是host侧进程(主要是用户代码中调用的一些API/pytorch op,能比较方便能和训练代码对应上)
靠下的python进程是device(gpu)侧进程(记录实际cuda kernel 的执行和一些性能相关的数据)
device timeline比较稀疏的情况下训练性能较差,GPU利用率较低,可能需要排查下训练代码是否有问题
显存泄露排查
https://pytorch.ac.cn/docs/stable/torch_cuda_memory.html
https://pytorch.org/blog/understanding-gpu-memory-1/
https://pytorch.org/blog/understanding-gpu-memory-2/
检查显存
# (c) Meta Platforms, Inc. and affiliates.
# https://pytorch.org/blog/understanding-gpu-memory-1/
import logging
import socket
from datetime import datetime, timedelta
import torch
from torchvision import models
logging.basicConfig(
format="%(levelname)s:%(asctime)s %(message)s",
level=logging.INFO,
datefmt="%Y-%m-%d %H:%M:%S",
)
logger: logging.Logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)
TIME_FORMAT_STR: str = "%b_%d_%H_%M_%S"
# Keep a max of 100,000 alloc/free events in the recorded history
# leading up to the snapshot.
MAX_NUM_OF_MEM_EVENTS_PER_SNAPSHOT: int = 100000
def start_record_memory_history() -> None:
if not torch.cuda.is_available():
logger.info("CUDA unavailable. Not recording memory history")
return
logger.info("Starting snapshot record_memory_history")
torch.cuda.memory._record_memory_history(
max_entries=MAX_NUM_OF_MEM_EVENTS_PER_SNAPSHOT
)
def stop_record_memory_history() -> None:
if not torch.cuda.is_available():
logger.info("CUDA unavailable. Not recording memory history")
return
logger.info("Stopping snapshot record_memory_history")
torch.cuda.memory._record_memory_history(enabled=None)
def export_memory_snapshot() -> None:
if not torch.cuda.is_available():
logger.info("CUDA unavailable. Not exporting memory snapshot")
return
# Prefix for file names.
host_name = socket.gethostname()
timestamp = datetime.now().strftime(TIME_FORMAT_STR)
file_prefix = f"{host_name}_{timestamp}"
try:
logger.info(f"Saving snapshot to local file: {file_prefix}.pickle")
torch.cuda.memory._dump_snapshot(f"{file_prefix}.pickle")
except Exception as e:
logger.error(f"Failed to capture memory snapshot {e}")
return
# Simple Resnet50 example to demonstrate how to capture memory visuals.
def run_resnet50(num_iters=5, device="cuda:0"):
model = models.resnet50().to(device=device)
inputs = torch.randn(1, 3, 224, 224, device=device)
labels = torch.rand_like(model(inputs))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
loss_fn = torch.nn.CrossEntropyLoss()
# Start recording memory snapshot history
start_record_memory_history()
for _ in range(num_iters):
pred = model(inputs)
loss_fn(pred, labels).backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
# Create the memory snapshot file
export_memory_snapshot()
# Stop recording memory snapshot history
stop_record_memory_history()
if __name__ == "__main__":
# Run the resnet50 model
run_resnet50()同时profile cpu和显存
# (c) Meta Platforms, Inc. and affiliates.
# https://pytorch.org/blog/understanding-gpu-memory-1/
import logging
import socket
from datetime import datetime, timedelta
import torch
from torch.autograd.profiler import record_function
from torchvision import models
logging.basicConfig(
format="%(levelname)s:%(asctime)s %(message)s",
level=logging.INFO,
datefmt="%Y-%m-%d %H:%M:%S",
)
logger: logging.Logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)
TIME_FORMAT_STR: str = "%b_%d_%H_%M_%S"
def trace_handler(prof: torch.profiler.profile):
# Prefix for file names.
host_name = socket.gethostname()
timestamp = datetime.now().strftime(TIME_FORMAT_STR)
file_prefix = f"{host_name}_{timestamp}"
# Construct the trace file.
prof.export_chrome_trace(f"{file_prefix}.json.gz")
# Construct the memory timeline file.
prof.export_memory_timeline(f"{file_prefix}.html", device="cuda:0")
def run_resnet50(num_iters=5, device="cuda:0"):
model = models.resnet50().to(device=device)
inputs = torch.randn(1, 3, 224, 224, device=device)
labels = torch.rand_like(model(inputs))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
loss_fn = torch.nn.CrossEntropyLoss()
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(wait=0, warmup=0, active=6, repeat=1),
record_shapes=True,
profile_memory=True,
with_stack=True,
on_trace_ready=trace_handler,
) as prof:
for _ in range(num_iters):
prof.step()
with record_function("## forward ##"):
pred = model(inputs)
with record_function("## backward ##"):
loss_fn(pred, labels).backward()
with record_function("## optimizer ##"):
optimizer.step()
optimizer.zero_grad(set_to_none=True)
if __name__ == "__main__":
# Warm up
run_resnet50()
# Run the resnet50 model
run_resnet50()各型号gpu对比
https://zhuanlan.zhihu.com/p/441153412
查看python的栈
pip install py-spy
py-spy dump --pid 1199
打出来:
Process 1199: /usr/bin/python3.10 -u torch_main.py
Python v3.10.14 (/usr/bin/python3.10)
Thread 0x7F62A2C43740 (active): "MainThread"
_wait_for_tstate_lock (threading.py:1116)
join (threading.py:1096)
main (torch_main.py:776)
<module> (torch_main.py:785)
Thread 0xAABBBCC (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (threading.py:1376)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0xAAAAA (idle): "Thread-3 (process)"
wait (threading.py:320)
get (queue.py:171)
process (abase_writer.py:73)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0xA992ACDA (idle): "Thread-4 (process)"
wait (threading.py:320)
get (queue.py:171)
process (abase_writer.py:73)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0xAFF11AA (active): "Thread-5 (read_file)"
get_seq (ecom_seq_reader.py:200)
read_file (torch_main.py:494)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x9922BCDA (idle): "Thread-6"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)国内的huggingface模型下载地址
一些报错的解法
flash-attention2
https://github.com/Dao-AILab/flash-attention/issues/451
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attnGPU机型对比
L20:https://www.techpowerup.com/gpu-specs/l20.c4206
T flops:
tf32:59.8
fp32:59.8
bf16:119.5
fp16:119.5
A800-40G:https://www.techpowerup.com/gpu-specs/a800-pcie-40-gb.c3964
T flops:
tf32:156
fp32:19.5
bf16:312
fp16:77.97
一些问题和经验
坍缩(稳定召回那k个item),attention score太集中了,低秩特征(泛化特征)容易导致这个问题
看attention score的分布,如果第一层偏向target item,但第二层可能就很平均了,这种可能就释放不出收益,应该是没学好
auc离线有收益,在线没收益:
reload正确性,warmup有没有报错
nn的分发效率,20min降到x min,压缩、解压耗时
学习充分性:发现ab开得更久的时候,就看到收益了。。
累积梯度太小的 不充分
历史样本变多
多epoch(参考快手 阿里的一些做法,例如reset emb等)
加一些辅助loss,例如生成式、蒸馏
出nan:bf16转化等问题,加一些grad clip,norm等
最后更新于
这有帮助吗?