LLM & NLP
Notes from CSE 5525, The Ohio State University · Lecture Slides · Prof. Sachin Kumar
笔记整理自 CSE 5525, The Ohio State University · 课程讲义 · Prof. Sachin Kumar
Natural Language Processing (NLP) teaches machines to understand and generate human language. Large Language Models (LLMs) like ChatGPT have made this field mainstream — but how do they actually work? This page walks through the key ideas from the ground up: how text becomes numbers, how models learn language patterns, and how raw models become the AI assistants we use today.
自然语言处理(NLP)让机器学会理解和生成人类语言。大语言模型(LLM)如 ChatGPT 让这个领域走入大众视野——但它们到底是怎么工作的?这个页面从零开始讲解核心概念:文本如何变成数字、模型如何学习语言规律、以及原始模型如何变成我们今天使用的 AI 助手。
The Evolution of NLP
NLP 的演化
From Text to Numbers
从文本到数字
Computers only understand numbers. Before any NLP task, we must convert words into numerical form. This section covers three approaches: simple counting, learned vectors, and how to break text into pieces.
计算机只懂数字。在任何 NLP 任务之前,我们必须把文字转换成数字形式。这部分介绍三种方法:简单计数、学习出的向量,以及如何把文本拆成小单元。
Text Classification: From Naive Bayes to Neural Networks
文本分类:从朴素贝叶斯到神经网络
"Is this email spam or not?" — that's text classification: assigning a category to a piece of text. The simplest approach is Naive Bayes: count how often each word appears in each category, then use Bayes' theorem to compute probabilities. Despite its "naive" assumption that words are independent (obviously "New" and "York" aren't!), it works surprisingly well as a quick baseline.
Logistic Regression is the next step up: it learns a weight for each word (positive = more likely this category, negative = less likely). The weights are interpretable — you can see exactly which words push toward "spam."
Neural classifiers go further: instead of hand-designed features, they automatically learn which patterns matter. The modern approach is to use a pre-trained model like BERT, add a classification layer on top, and fine-tune with a small labeled dataset.
"这封邮件是垃圾邮件吗?"——这就是文本分类:给一段文本分配一个类别。最简单的方法是朴素贝叶斯:统计每个词在每个类别中出现的频率,然后用贝叶斯定理计算概率。尽管它"天真地"假设词与词之间是独立的("纽"和"约"显然不独立!),但作为快速基线效果出奇地好。
逻辑回归更进一步:它为每个词学习一个权重(正数 = 更可能是这个类别,负数 = 更不可能)。权重是可解释的——你能看到哪些词把预测推向"垃圾邮件"。
神经网络分类器再进一步:不用手动设计特征,让模型自动学习哪些模式重要。现代做法是用预训练模型(如 BERT),在上面加一个分类层,然后用少量标注数据微调。
Word Vectors & Embeddings
词向量与嵌入
The simplest way to represent a word is one-hot encoding: if your vocabulary has 50,000 words, each word becomes a vector with 49,999 zeros and one 1. Problem: "cat" and "dog" look equally different from each other as "cat" and "refrigerator" — there's no sense of meaning.
Word2Vec (Google, 2013) solved this with a brilliant insight: train a simple neural network to predict surrounding words from a center word (or vice versa). The byproduct? Each word gets a dense vector (typically 300 numbers) where similar words are close together.
This "word arithmetic" works because the model has learned that gender is a consistent direction in the vector space. GloVe achieves similar results by analyzing how often words co-occur across an entire corpus.
表示一个词最简单的方式是 One-Hot 编码:如果词汇表有 50,000 个词,每个词变成一个有 49,999 个零和 1 个一的向量。问题是:"猫"和"狗"看起来跟"猫"和"冰箱"一样不同——完全没有语义信息。
Word2Vec(Google, 2013)用一个巧妙的想法解决了这个问题:训练一个简单的神经网络,从中心词预测周围的词(或反过来)。副产品是什么?每个词得到一个稠密的向量(通常 300 个数字),意思相近的词在向量空间中靠得很近。
这种"词汇算术"之所以成立,是因为模型学到了性别是向量空间中一个一致的方向。GloVe 通过分析整个语料库中词的共现频率,达到类似效果。
Tokenization: Breaking Text into Pieces
分词:把文本拆成小块
Before a model can process text, it needs to break it into pieces called tokens. Split by words? Problem: new words like "ChatGPT" aren't in the dictionary. Split by characters? Every word becomes a very long sequence, making learning hard.
The modern solution is subword tokenization (e.g., BPE — Byte Pair Encoding): start with individual characters, then repeatedly merge the most frequent pairs. Common words stay whole ("the", "and"), while rare words get split into meaningful pieces ("unbelievable" → "un" + "believ" + "able"). This means "un-" and "-able" can be shared across many words.
GPT uses BPE, BERT uses WordPiece (similar idea), and SentencePiece works for any language — including Chinese, Japanese, and Korean, which don't use spaces between words.
模型处理文本之前,需要先把它拆成叫做 token 的小块。按词拆?问题是:"ChatGPT"这样的新词不在词典里。按字符拆?每个词变成很长的序列,学习非常困难。
现代的解决方案是子词分词(如 BPE——字节对编码):从单个字符开始,反复合并最频繁出现的相邻对。常见词保持完整("the"、"and"),罕见词被拆成有意义的片段("unbelievable" → "un" + "believ" + "able")。这意味着"un-"和"-able"可以在多个词之间共享。
GPT 使用 BPE,BERT 使用 WordPiece(类似思路),而 SentencePiece 适用于任何语言——包括中文、日语和韩语这些词之间不用空格的语言。
How Language Models Work
语言模型的工作原理
A language model predicts what word comes next. That's it. Your phone's autocomplete is a simple language model. ChatGPT is a very advanced one. The difference is in how well they predict — and that depends on the architecture.
语言模型的本质就是预测下一个词。就是这么简单。你手机的自动补全就是一个简单的语言模型,ChatGPT 是一个非常高级的语言模型。区别在于预测得有多好——这取决于模型架构。
N-gram: The Simplest Language Model
N-gram:最简单的语言模型
The most basic approach: predict the next word based on the previous N−1 words. A bigram model (N=2) only looks at the last word; a trigram (N=3) looks at the last two. Prediction is just counting: "How often does word X appear after words A B?"
We measure language models using perplexity — think of it as "how many words the model is choosing between on average." Lower = better. A random guess over 50,000 words gives perplexity of 50,000; a decent N-gram model might get 200.
最基础的方法:根据前面 N−1 个词预测下一个词。二元模型(N=2)只看最后一个词;三元模型(N=3)看最后两个。预测就是数频率:"在'天气很'后面,'好'出现了多少次?"
我们用困惑度(Perplexity)来衡量语言模型——可以理解为"模型平均在多少个词之间犹豫不决"。越低越好。在 50,000 个词中随机猜的困惑度是 50,000;一个不错的 N-gram 模型可能达到 200。
The Transformer: Attention Is All You Need
Transformer:注意力就是你需要的一切
The 2017 paper that changed everything. The key idea is self-attention: instead of reading text one word at a time (like older models), every word can directly look at every other word in the sentence at once.
How? Each word creates three things: a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what information do I carry?"). Attention scores are computed by matching Queries with Keys, then using those scores to weight the Values:
Multi-Head Attention runs this process multiple times in parallel (usually 8-16 "heads"), so the model can pay attention to different types of relationships simultaneously — one head might focus on grammar, another on meaning.
2017 年改变一切的论文。核心思想是自注意力(Self-Attention):不再像老模型那样逐词阅读文本,而是让每个词同时直接看到句子中的所有其他词。
怎么做到的?每个词生成三样东西:Query("我在找什么?")、Key("我包含什么?")、Value("我携带什么信息?")。通过 Query 和 Key 的匹配计算注意力分数,然后用分数对 Value 加权求和:
Multi-Head Attention 把这个过程并行跑多次(通常 8-16 个"头"),这样模型可以同时关注不同类型的关系——一个头可能关注语法,另一个关注语义。
BERT vs. GPT: Two Ways to Use a Transformer
BERT vs. GPT:Transformer 的两种用法
The Transformer architecture can be used in different ways, and this choice has huge implications:
Transformer 架构有不同的使用方式,这个选择影响深远:
Sees: Both left and right context (bidirectional)
Pre-training: Fill in the blank — mask 15% of words, predict what's missing
Good at: Understanding text — classification, question answering, information extraction
Usage: Pre-train once, then fine-tune with labeled data for your task
能看到:左右两边的上下文(双向)
预训练方式:完形填空——遮住 15% 的词,预测被遮的是什么
擅长:理解文本——分类、问答、信息提取
使用方式:预训练一次,然后用标注数据为你的任务微调
Sees: Only what came before (left-to-right)
Pre-training: Predict the next word — read left to right, guess what follows
Good at: Generating text — writing, dialogue, code, creative tasks
Usage: Pre-train once, then use with prompts (no fine-tuning needed)
能看到:只有前面的内容(从左到右)
预训练方式:预测下一个词——从左到右阅读,猜后面是什么
擅长:生成文本——写作、对话、代码、创意任务
使用方式:预训练一次,然后用提示词使用(不需要微调)
From Raw Model to AI Assistant
从原始模型到 AI 助手
A pre-trained model only knows how to complete text — it doesn't know how to be helpful. Turning it into ChatGPT requires three more steps: scale it up, teach it to follow instructions, and align it with human preferences.
预训练模型只会补全文本——它不知道怎么才有用。把它变成 ChatGPT 还需要三步:扩大规模、教它遵循指令、让它符合人类偏好。
Scaling Laws: Bigger = Better (Predictably)
缩放定律:越大越好(而且可预测)
One of the most surprising discoveries in AI: model performance improves predictably as you increase three things — model size, training data, and compute. Double the parameters? Performance improves by a predictable amount. This is called a scaling law.
Even more surprising: at certain scale thresholds, models suddenly gain abilities they completely lacked before — multi-step reasoning, complex math, understanding jokes. These emergent abilities can't be predicted from smaller models. It's as if the model "clicks" at a certain size.
The Chinchilla law (DeepMind, 2022) added an important insight: many early models were too big for their training data. The optimal strategy is to scale model size and data equally.
AI 领域最令人惊讶的发现之一:模型性能随着三个因素可预测地提升——模型大小、训练数据量、计算量。参数翻倍?性能提升可预测的幅度。这就是缩放定律。
更惊人的是:在某些规模临界点上,模型突然获得之前完全没有的能力——多步推理、复杂数学、理解笑话。这些涌现能力无法从小模型的表现中预测。就好像模型在某个大小上"开窍"了。
Chinchilla 定律(DeepMind, 2022)补充了一个重要发现:很多早期模型对于它们的训练数据来说太大了。最优策略是让模型大小和数据量等比例增长。
Prompting: Using LLMs Without Training
提示词:不用训练就能使用 LLM
One of the most magical properties of large models: you can get them to do new tasks just by describing the task in words, with no additional training. This is called prompting, and it comes in three flavors:
Zero-shot: Just describe what you want. "Translate this to French: Hello" → "Bonjour"
Few-shot: Give a few examples, and the model learns the pattern. "cat→chat, dog→chien, bird→?" → "oiseau"
Chain-of-Thought: Ask the model to "think step by step." This dramatically improves reasoning, especially for math and logic.
大模型最神奇的特性之一:你只需要用文字描述任务,不需要额外训练,模型就能完成新任务。这叫做提示词(Prompting),有三种方式:
零样本(Zero-shot):直接描述你想要什么。"把这句翻译成法语:你好" → "Bonjour"
少样本(Few-shot):给几个例子,模型学习模式。"猫→cat, 狗→dog, 鸟→?" → "bird"
思维链(Chain-of-Thought):让模型"一步步想"。这对推理能力有巨大提升,尤其是数学和逻辑题。
Instruction Tuning & RLHF: Teaching Models to Be Helpful
指令微调与 RLHF:教模型变得有用
A raw GPT model is like a very well-read person who only knows how to continue a conversation, but doesn't know how to help you. Two more training stages fix this:
Step 1 — Supervised Fine-Tuning (SFT): Show the model thousands of examples of "good assistant behavior" — questions paired with high-quality answers. This teaches it to respond helpfully instead of just completing text. InstructGPT used only ~13,000 such examples yet dramatically improved response quality.
Step 2 — RLHF (Reinforcement Learning from Human Feedback): Human preferences are subtle — it's hard to write down exactly what makes a "good" answer. So instead, humans rank multiple model responses ("Response A is better than Response B"), and the model learns from these preferences.
DPO (Direct Preference Optimization) is a simpler alternative to RLHF: skip the reward model, directly optimize from preference pairs. Mathematically equivalent, but much easier to implement.
原始 GPT 模型就像一个博览群书的人,只会接着别人说的话往下讲,但不知道怎么帮助你。两个额外的训练阶段解决了这个问题:
第一步——有监督微调(SFT):给模型看成千上万个"好助手行为"的示例——问题配上高质量回答。这教会它有用地回答,而不是只续写文本。InstructGPT 只用了约 13,000 条这样的示例,就大幅提升了回答质量。
第二步——RLHF(基于人类反馈的强化学习):人类偏好是微妙的——很难明确写下什么是"好"回答。所以让人类对模型的多个回答排名("回答 A 比回答 B 好"),模型从这些偏好中学习。
DPO(直接偏好优化)是 RLHF 的更简洁替代方案:跳过奖励模型,直接从偏好对中优化。数学上等价,但实现简单得多。
Benchmarks (MMLU for knowledge, HumanEval for coding, GSM8K for math) provide standardized tests, but models can overfit to them. Human evaluation (like Chatbot Arena where people vote on which response is better) is the gold standard but expensive. LLM-as-Judge uses a strong model like GPT-4 to evaluate others — cheap but potentially biased. Beware of data contamination: if test questions leaked into training data, scores are meaningless.
基准测试(MMLU 测知识、HumanEval 测编程、GSM8K 测数学)提供标准化考试,但模型可能过拟合。人类评估(如 Chatbot Arena 让人投票哪个回答更好)是金标准但昂贵。LLM 当评委用 GPT-4 之类的强模型评判其他模型——便宜但可能有偏差。要警惕数据污染:如果测试题泄漏到了训练数据里,分数就没有意义。
Making LLMs Practical
让 LLM 变得实用
LLMs are powerful but face three real-world problems: they're expensive to customize, their knowledge has a cutoff date, and they can only generate text. Here's how each is being solved.
LLM 很强大,但面临三个现实问题:定制成本高、知识有截止日期、只能生成文字。以下是每个问题的解决方案。
LoRA: Fine-Tuning on a Budget
LoRA:低成本微调
Fine-tuning a 70-billion-parameter model normally requires hundreds of GB of GPU memory — out of reach for most people. LoRA (Low-Rank Adaptation) offers an elegant shortcut: freeze the original model and only train tiny "adapter" matrices added beside each layer.
Result: 10× less memory, training costs drop dramatically, and performance is nearly identical to full fine-tuning. QLoRA goes even further — combining 4-bit quantization with LoRA to fine-tune a 65B model on a single consumer GPU.
微调一个 700 亿参数的模型通常需要数百 GB 的 GPU 显存——大多数人根本做不到。LoRA(低秩适应)提供了一个优雅的捷径:冻结原始模型,只训练附加在每层旁边的微小"适配器"矩阵。
效果:显存减少 10 倍以上,训练成本大幅下降,性能几乎和全量微调一样好。QLoRA 更进一步——把 4-bit 量化和 LoRA 结合,在一张消费级 GPU 上微调 650 亿参数的模型。
RAG: Giving LLMs Access to Fresh Knowledge
RAG:给 LLM 接入最新知识
LLMs have a knowledge cutoff date and sometimes "hallucinate" — confidently stating things that aren't true. RAG (Retrieval-Augmented Generation) fixes this by having the model look up information before answering.
How it works: Your question gets converted into a vector → The system searches a database of documents for the most relevant passages → Those passages are fed to the LLM as context → The LLM generates an answer based on the retrieved information.
LLM 的知识有截止日期,有时还会"幻觉"——自信地说出不真实的事情。RAG(检索增强生成)通过让模型在回答前先查找信息来解决这个问题。
工作流程:你的问题被转换成向量 → 系统在文档数据库中搜索最相关的段落 → 这些段落作为上下文输入给 LLM → LLM 基于检索到的信息生成回答。
Multimodal LLMs: Beyond Text
多模态 LLM:超越文本
Modern LLMs aren't limited to text. By connecting a vision encoder (like CLIP) to the language model, they can understand images too. Show it a photo of a restaurant menu and ask "What vegetarian options are there?" — it reads the image, understands the menu, and answers.
Representative models: GPT-4V, Gemini, LLaVA. They can describe images, answer visual questions, analyze charts, and even understand memes.
现代 LLM 不限于文本。通过连接一个视觉编码器(如 CLIP)到语言模型,它们也能理解图像。给它看一张餐厅菜单的照片问"有什么素食选项?"——它读取图像、理解菜单、然后回答。
代表模型:GPT-4V、Gemini、LLaVA。它们能描述图片、回答视觉问题、分析图表,甚至理解表情包。
Language Agents: LLMs That Take Action
语言智能体:能采取行动的 LLM
Regular LLMs can only generate text. Language agents give LLMs the ability to use tools — search the web, run code, call APIs, operate software. This transforms them from "can talk" to "can do."
The core framework is ReAct (Reasoning + Acting): the model alternates between thinking ("I need to find the current price") and acting ("Let me search the web for..."), looping until the task is done.
普通 LLM 只能生成文字。语言智能体赋予 LLM 使用工具的能力——搜索网页、运行代码、调用 API、操作软件。这让它们从"会说话"变成"会做事"。
核心框架是 ReAct(推理 + 行动):模型在"思考"("我需要查找当前价格")和"行动"("让我搜索一下...")之间交替循环,直到任务完成。
The NLP Toolkit
NLP 工具箱
Beyond LLMs, traditional NLP tools handle essential tasks like identifying names, analyzing grammar, and extracting structured information from text.
除了 LLM,传统 NLP 工具处理识别人名、分析语法、从文本中提取结构化信息等基础任务。
NER, POS Tagging & Dependency Parsing
命名实体识别、词性标注与依存句法分析
The classic NLP pipeline: Tokenization (split into words) → POS Tagging (noun? verb? adjective?) → NER (is "Apple" a fruit or a company?) → Dependency Parsing (which word modifies which?).
Modern tools like spaCy handle this entire pipeline in one line of code. While LLMs can also do these tasks, dedicated NLP pipelines are faster, cheaper, and more reliable for structured extraction.
经典 NLP 流水线:分词(拆成词)→ 词性标注(名词?动词?形容词?)→ 命名实体识别 NER("苹果"是水果还是公司?)→ 依存句法分析(哪个词修饰哪个词?)。
现代工具如 spaCy 一行代码就能完成整个流水线。虽然 LLM 也能做这些任务,但专用 NLP 流水线在结构化提取方面更快、更便宜、更可靠。
Ethics & Broader Implications
伦理与更广泛的影响
LLMs bring enormous capabilities but also serious questions that researchers and practitioners must grapple with.
LLM 带来了巨大能力,也带来了研究者和从业者必须面对的严肃问题。
Bias
偏见
- Models learn from internet text, which contains societal biases
- Word2Vec: "doctor" closer to "male", "nurse" to "female"
- Mitigation: debiased data, RLHF fairness constraints, continuous auditing
- 模型从互联网文本中学习,而互联网充满社会偏见
- Word2Vec 中"医生"更接近"男性","护士"更接近"女性"
- 缓解措施:去偏数据、RLHF 公平性约束、持续审计
Interpretability
可解释性
- Billions of parameters = "black box"
- Methods: attention visualization, probing, feature attribution
- Caveat: high attention ≠ causal explanation
- 数十亿参数 = "黑箱"
- 方法:注意力可视化、探针实验、特征归因
- 注意:高注意力 ≠ 因果解释
Multilinguality
多语言
- 7000+ languages, but most LLMs are English-centric
- Surprise: multilingual models enable zero-shot cross-lingual transfer
- Challenge: low-resource languages still perform poorly — a digital divide
- 全球 7000+ 种语言,但多数 LLM 以英语为中心
- 惊喜:多语言模型实现零样本跨语言迁移
- 挑战:低资源语言效果仍然很差——数字鸿沟
References & Software
参考文献与软件
Key References
核心文献
- BeginnerJurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft). https://web.stanford.edu/~jurafsky/slp3/
- TransformerVaswani, A. et al. (2017). Attention is all you need. NeurIPS.
- BERTDevlin, J. et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.
- ScalingKaplan, J. et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
- RLHFOuyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
- LoRAHu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR.
- RAGLewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS.
Software
软件工具
HuggingFace Transformers
The hub for pre-trained models. BERT, GPT-2, LLaMA, and thousands more. Fine-tuning and inference in Python.
预训练模型中心。BERT、GPT-2、LLaMA 及数千种模型。Python 微调和推理。
spaCy
Production NLP pipeline. Tokenization, POS, NER, dependency parsing. Fast and easy to use.
生产级 NLP 流水线。分词、词性标注、NER、依存分析。快速易用。
OpenAI / Anthropic APIs
Access GPT-4, Claude, etc. via API. Zero-shot classification, text generation, embeddings.
通过 API 使用 GPT-4、Claude 等。零样本分类、文本生成、嵌入向量。