LLM & NLP — Research Methods Notebook

Natural Language Processing (NLP) teaches machines to understand and generate human language. Large Language Models (LLMs) like ChatGPT have made this field mainstream — but how do they actually work? This page walks through the key ideas from the ground up: how text becomes numbers, how models learn language patterns, and how raw models become the AI assistants we use today.

自然语言处理（NLP）让机器学会理解和生成人类语言。大语言模型（LLM）如 ChatGPT 让这个领域走入大众视野——但它们到底是怎么工作的？这个页面从零开始讲解核心概念：文本如何变成数字、模型如何学习语言规律、以及原始模型如何变成我们今天使用的 AI 助手。

The Evolution of NLP

NLP 的演化

~2013

Rule-based

Hand-written rules

人工规则

2013

Word2Vec

Words as vectors

词变向量

2017

Transformer

Attention mechanism

注意力机制

2018

BERT

Pre-train + fine-tune

预训练+微调

2020+

GPT / LLMs

Scale + prompting

规模+提示词

From Text to Numbers

从文本到数字

Computers only understand numbers. Before any NLP task, we must convert words into numerical form. This section covers three approaches: simple counting, learned vectors, and how to break text into pieces.

计算机只懂数字。在任何 NLP 任务之前，我们必须把文字转换成数字形式。这部分介绍三种方法：简单计数、学习出的向量，以及如何把文本拆成小单元。

Representation

Text Classification: From Naive Bayes to Neural Networks

文本分类：从朴素贝叶斯到神经网络

"Is this email spam or not?" — that's text classification: assigning a category to a piece of text. The simplest approach is Naive Bayes: count how often each word appears in each category, then use Bayes' theorem to compute probabilities. Despite its "naive" assumption that words are independent (obviously "New" and "York" aren't!), it works surprisingly well as a quick baseline.

P(spam | email) ∝ P(spam) × P("free"|spam) × P("win"|spam) × ...

Logistic Regression is the next step up: it learns a weight for each word (positive = more likely this category, negative = less likely). The weights are interpretable — you can see exactly which words push toward "spam."

Neural classifiers go further: instead of hand-designed features, they automatically learn which patterns matter. The modern approach is to use a pre-trained model like BERT, add a classification layer on top, and fine-tune with a small labeled dataset.

Sentiment analysis: "This movie was absolutely brilliant!" → The words "absolutely" and "brilliant" push the prediction toward positive. A neural model can even understand that "not bad" is positive, which word-counting methods miss.

"这封邮件是垃圾邮件吗？"——这就是文本分类：给一段文本分配一个类别。最简单的方法是朴素贝叶斯：统计每个词在每个类别中出现的频率，然后用贝叶斯定理计算概率。尽管它"天真地"假设词与词之间是独立的（"纽"和"约"显然不独立！），但作为快速基线效果出奇地好。

P(垃圾邮件 | 邮件) ∝ P(垃圾邮件) × P("免费"|垃圾邮件) × P("中奖"|垃圾邮件) × ...

逻辑回归更进一步：它为每个词学习一个权重（正数 = 更可能是这个类别，负数 = 更不可能）。权重是可解释的——你能看到哪些词把预测推向"垃圾邮件"。

神经网络分类器再进一步：不用手动设计特征，让模型自动学习哪些模式重要。现代做法是用预训练模型（如 BERT），在上面加一个分类层，然后用少量标注数据微调。

情感分析："这部电影简直太精彩了！" → "简直"和"精彩"这些词把预测推向正面。神经网络甚至能理解"还不错"是正面的，而简单的词频统计方法做不到。

Representation

Word Vectors & Embeddings

词向量与嵌入

The simplest way to represent a word is one-hot encoding: if your vocabulary has 50,000 words, each word becomes a vector with 49,999 zeros and one 1. Problem: "cat" and "dog" look equally different from each other as "cat" and "refrigerator" — there's no sense of meaning.

Imagine giving every student a locker number. Locker #234 and #235 are next to each other, but that tells you nothing about whether those students are similar. One-hot encoding is like locker numbers — arbitrary labels with no meaning.

Word2Vec (Google, 2013) solved this with a brilliant insight: train a simple neural network to predict surrounding words from a center word (or vice versa). The byproduct? Each word gets a dense vector (typically 300 numbers) where similar words are close together.

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

This "word arithmetic" works because the model has learned that gender is a consistent direction in the vector space. GloVe achieves similar results by analyzing how often words co-occur across an entire corpus.

"Paris" is to "France" as "Tokyo" is to "Japan" — Word2Vec encodes the capital-country relationship as a geometric pattern in vector space.

表示一个词最简单的方式是 One-Hot 编码：如果词汇表有 50,000 个词，每个词变成一个有 49,999 个零和 1 个一的向量。问题是："猫"和"狗"看起来跟"猫"和"冰箱"一样不同——完全没有语义信息。

想象给每个学生分配一个储物柜号。234 号和 235 号挨着，但这完全不能说明这两个学生相似。One-Hot 编码就像储物柜号——没有含义的随机标签。

Word2Vec（Google, 2013）用一个巧妙的想法解决了这个问题：训练一个简单的神经网络，从中心词预测周围的词（或反过来）。副产品是什么？每个词得到一个稠密的向量（通常 300 个数字），意思相近的词在向量空间中靠得很近。

vec("国王") − vec("男人") + vec("女人") ≈ vec("女王")

这种"词汇算术"之所以成立，是因为模型学到了性别是向量空间中一个一致的方向。GloVe 通过分析整个语料库中词的共现频率，达到类似效果。

"巴黎"之于"法国"就像"东京"之于"日本"——Word2Vec 把首都-国家关系编码成了向量空间中的几何模式。

Representation

Tokenization: Breaking Text into Pieces

分词：把文本拆成小块

Before a model can process text, it needs to break it into pieces called tokens. Split by words? Problem: new words like "ChatGPT" aren't in the dictionary. Split by characters? Every word becomes a very long sequence, making learning hard.

The modern solution is subword tokenization (e.g., BPE — Byte Pair Encoding): start with individual characters, then repeatedly merge the most frequent pairs. Common words stay whole ("the", "and"), while rare words get split into meaningful pieces ("unbelievable" → "un" + "believ" + "able"). This means "un-" and "-able" can be shared across many words.

It's like how Chinese characters work: you don't memorize a unique symbol for every concept. Instead, you combine radicals — "木" (wood) appears in "林" (forest) and "桌" (table). Subword tokenization does the same for English.

GPT uses BPE, BERT uses WordPiece (similar idea), and SentencePiece works for any language — including Chinese, Japanese, and Korean, which don't use spaces between words.

模型处理文本之前，需要先把它拆成叫做 token 的小块。按词拆？问题是："ChatGPT"这样的新词不在词典里。按字符拆？每个词变成很长的序列，学习非常困难。

现代的解决方案是子词分词（如 BPE——字节对编码）：从单个字符开始，反复合并最频繁出现的相邻对。常见词保持完整（"the"、"and"），罕见词被拆成有意义的片段（"unbelievable" → "un" + "believ" + "able"）。这意味着"un-"和"-able"可以在多个词之间共享。

就像中文汉字的工作方式：你不需要为每个概念记一个独特的符号，而是组合偏旁部首——"木"（木头）出现在"林"（森林）和"桌"（桌子）中。子词分词对英语做了同样的事。

GPT 使用 BPE，BERT 使用 WordPiece（类似思路），而 SentencePiece 适用于任何语言——包括中文、日语和韩语这些词之间不用空格的语言。

How Language Models Work

语言模型的工作原理

A language model predicts what word comes next. That's it. Your phone's autocomplete is a simple language model. ChatGPT is a very advanced one. The difference is in how well they predict — and that depends on the architecture.

语言模型的本质就是预测下一个词。就是这么简单。你手机的自动补全就是一个简单的语言模型，ChatGPT 是一个非常高级的语言模型。区别在于预测得有多好——这取决于模型架构。

Architecture

N-gram: The Simplest Language Model

N-gram：最简单的语言模型

The most basic approach: predict the next word based on the previous N−1 words. A bigram model (N=2) only looks at the last word; a trigram (N=3) looks at the last two. Prediction is just counting: "How often does word X appear after words A B?"

P("good" | "weather is") = count("weather is good") / count("weather is")

We measure language models using perplexity — think of it as "how many words the model is choosing between on average." Lower = better. A random guess over 50,000 words gives perplexity of 50,000; a decent N-gram model might get 200.

Limitation: N-grams can only look back 3-5 words. They can't understand "The cat that the dog chased ran away" because "cat" and "ran" are too far apart.

最基础的方法：根据前面 N−1 个词预测下一个词。二元模型（N=2）只看最后一个词；三元模型（N=3）看最后两个。预测就是数频率："在'天气很'后面，'好'出现了多少次？"

P("好" | "天气很") = count("天气很好") / count("天气很")

我们用困惑度（Perplexity）来衡量语言模型——可以理解为"模型平均在多少个词之间犹豫不决"。越低越好。在 50,000 个词中随机猜的困惑度是 50,000；一个不错的 N-gram 模型可能达到 200。

局限：N-gram 只能回看 3-5 个词。它无法理解"那只被狗追的猫跑掉了"，因为"猫"和"跑"之间隔得太远。

Architecture

The Transformer: Attention Is All You Need

Transformer：注意力就是你需要的一切

The 2017 paper that changed everything. The key idea is self-attention: instead of reading text one word at a time (like older models), every word can directly look at every other word in the sentence at once.

How? Each word creates three things: a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what information do I carry?"). Attention scores are computed by matching Queries with Keys, then using those scores to weight the Values:

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

Imagine a classroom where every student can whisper to every other student at the same time. Each student asks a question (Query), everyone else holds up a sign saying how relevant they are (Key), and the asker collects answers (Value) weighted by relevance. That's self-attention.

Multi-Head Attention runs this process multiple times in parallel (usually 8-16 "heads"), so the model can pay attention to different types of relationships simultaneously — one head might focus on grammar, another on meaning.

2017 年改变一切的论文。核心思想是自注意力（Self-Attention）：不再像老模型那样逐词阅读文本，而是让每个词同时直接看到句子中的所有其他词。

怎么做到的？每个词生成三样东西：Query（"我在找什么？"）、Key（"我包含什么？"）、Value（"我携带什么信息？"）。通过 Query 和 Key 的匹配计算注意力分数，然后用分数对 Value 加权求和：

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

想象一个教室里，每个学生可以同时跟所有其他学生低声交流。每个学生提出一个问题（Query），其他人举牌表示自己的相关性（Key），提问者根据相关性收集答案（Value）。这就是自注意力。

Multi-Head Attention 把这个过程并行跑多次（通常 8-16 个"头"），这样模型可以同时关注不同类型的关系——一个头可能关注语法，另一个关注语义。

Architecture

BERT vs. GPT: Two Ways to Use a Transformer

BERT vs. GPT：Transformer 的两种用法

The Transformer architecture can be used in different ways, and this choice has huge implications:

Transformer 架构有不同的使用方式，这个选择影响深远：

BERT (Encoder)

Sees: Both left and right context (bidirectional)

Pre-training: Fill in the blank — mask 15% of words, predict what's missing

Good at: Understanding text — classification, question answering, information extraction

Usage: Pre-train once, then fine-tune with labeled data for your task

能看到：左右两边的上下文（双向）

预训练方式：完形填空——遮住 15% 的词，预测被遮的是什么

擅长：理解文本——分类、问答、信息提取

使用方式：预训练一次，然后用标注数据为你的任务微调

GPT (Decoder)

Sees: Only what came before (left-to-right)

Pre-training: Predict the next word — read left to right, guess what follows

Good at: Generating text — writing, dialogue, code, creative tasks

Usage: Pre-train once, then use with prompts (no fine-tuning needed)

能看到：只有前面的内容（从左到右）

预训练方式：预测下一个词——从左到右阅读，猜后面是什么

擅长：生成文本——写作、对话、代码、创意任务

使用方式：预训练一次，然后用提示词使用（不需要微调）

From Raw Model to AI Assistant

从原始模型到 AI 助手

A pre-trained model only knows how to complete text — it doesn't know how to be helpful. Turning it into ChatGPT requires three more steps: scale it up, teach it to follow instructions, and align it with human preferences.

预训练模型只会补全文本——它不知道怎么才有用。把它变成 ChatGPT 还需要三步：扩大规模、教它遵循指令、让它符合人类偏好。

Training

Scaling Laws: Bigger = Better (Predictably)

缩放定律：越大越好（而且可预测）

One of the most surprising discoveries in AI: model performance improves predictably as you increase three things — model size, training data, and compute. Double the parameters? Performance improves by a predictable amount. This is called a scaling law.

Even more surprising: at certain scale thresholds, models suddenly gain abilities they completely lacked before — multi-step reasoning, complex math, understanding jokes. These emergent abilities can't be predicted from smaller models. It's as if the model "clicks" at a certain size.

Think of learning a language: for months you understand nothing, then suddenly things start clicking and you can hold a conversation. LLMs seem to experience similar "phase transitions" as they scale up.

The Chinchilla law (DeepMind, 2022) added an important insight: many early models were too big for their training data. The optimal strategy is to scale model size and data equally.

AI 领域最令人惊讶的发现之一：模型性能随着三个因素可预测地提升——模型大小、训练数据量、计算量。参数翻倍？性能提升可预测的幅度。这就是缩放定律。

更惊人的是：在某些规模临界点上，模型突然获得之前完全没有的能力——多步推理、复杂数学、理解笑话。这些涌现能力无法从小模型的表现中预测。就好像模型在某个大小上"开窍"了。

想想学外语的过程：几个月什么都听不懂，然后突然有一天开始"顿悟"，能进行对话了。LLM 在规模扩大时似乎经历类似的"相变"。

Chinchilla 定律（DeepMind, 2022）补充了一个重要发现：很多早期模型对于它们的训练数据来说太大了。最优策略是让模型大小和数据量等比例增长。

Training

Prompting: Using LLMs Without Training

提示词：不用训练就能使用 LLM

One of the most magical properties of large models: you can get them to do new tasks just by describing the task in words, with no additional training. This is called prompting, and it comes in three flavors:

Zero-shot: Just describe what you want. "Translate this to French: Hello" → "Bonjour"

Few-shot: Give a few examples, and the model learns the pattern. "cat→chat, dog→chien, bird→?" → "oiseau"

Chain-of-Thought: Ask the model to "think step by step." This dramatically improves reasoning, especially for math and logic.

"Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have now? Let's think step by step: He started with 5. He bought 2×3 = 6 more. Total: 5 + 6 = 11."

Key insight: In-context learning doesn't change any model parameters. The model simply "reads" your examples through its attention mechanism and mimics the pattern — a remarkable emergent property of the Transformer.

大模型最神奇的特性之一：你只需要用文字描述任务，不需要额外训练，模型就能完成新任务。这叫做提示词（Prompting），有三种方式：

零样本（Zero-shot）：直接描述你想要什么。"把这句翻译成法语：你好" → "Bonjour"

少样本（Few-shot）：给几个例子，模型学习模式。"猫→cat, 狗→dog, 鸟→?" → "bird"

思维链（Chain-of-Thought）：让模型"一步步想"。这对推理能力有巨大提升，尤其是数学和逻辑题。

"Roger 有 5 个网球。他又买了 2 罐，每罐 3 个。他现在有多少？让我们一步步想：他原来有 5 个，买了 2×3 = 6 个，总共 5 + 6 = 11 个。"

关键洞察：上下文学习不改变任何模型参数。模型只是通过注意力机制"读懂"你给的例子并模仿模式——这是 Transformer 架构一个了不起的涌现特性。

Training

Instruction Tuning & RLHF: Teaching Models to Be Helpful

指令微调与 RLHF：教模型变得有用

A raw GPT model is like a very well-read person who only knows how to continue a conversation, but doesn't know how to help you. Two more training stages fix this:

Step 1 — Supervised Fine-Tuning (SFT): Show the model thousands of examples of "good assistant behavior" — questions paired with high-quality answers. This teaches it to respond helpfully instead of just completing text. InstructGPT used only ~13,000 such examples yet dramatically improved response quality.

Step 2 — RLHF (Reinforcement Learning from Human Feedback): Human preferences are subtle — it's hard to write down exactly what makes a "good" answer. So instead, humans rank multiple model responses ("Response A is better than Response B"), and the model learns from these preferences.

SFT is like showing a new employee examples of good customer service emails. RLHF is like having a manager review their drafts and say "this version is better" — the employee gradually learns the company's communication style without anyone writing explicit rules.

DPO (Direct Preference Optimization) is a simpler alternative to RLHF: skip the reward model, directly optimize from preference pairs. Mathematically equivalent, but much easier to implement.

原始 GPT 模型就像一个博览群书的人，只会接着别人说的话往下讲，但不知道怎么帮助你。两个额外的训练阶段解决了这个问题：

第一步——有监督微调（SFT）：给模型看成千上万个"好助手行为"的示例——问题配上高质量回答。这教会它有用地回答，而不是只续写文本。InstructGPT 只用了约 13,000 条这样的示例，就大幅提升了回答质量。

第二步——RLHF（基于人类反馈的强化学习）：人类偏好是微妙的——很难明确写下什么是"好"回答。所以让人类对模型的多个回答排名（"回答 A 比回答 B 好"），模型从这些偏好中学习。

SFT 就像给新员工看优秀客服邮件的范例。RLHF 就像让经理审阅他们的草稿说"这个版本更好"——员工逐渐学会公司的沟通风格，而不需要任何人写明确的规则。

DPO（直接偏好优化）是 RLHF 的更简洁替代方案：跳过奖励模型，直接从偏好对中优化。数学上等价，但实现简单得多。

How Do We Know If an LLM Is Good?

怎么判断 LLM 好不好？

Benchmarks (MMLU for knowledge, HumanEval for coding, GSM8K for math) provide standardized tests, but models can overfit to them. Human evaluation (like Chatbot Arena where people vote on which response is better) is the gold standard but expensive. LLM-as-Judge uses a strong model like GPT-4 to evaluate others — cheap but potentially biased. Beware of data contamination: if test questions leaked into training data, scores are meaningless.

基准测试（MMLU 测知识、HumanEval 测编程、GSM8K 测数学）提供标准化考试，但模型可能过拟合。人类评估（如 Chatbot Arena 让人投票哪个回答更好）是金标准但昂贵。LLM 当评委用 GPT-4 之类的强模型评判其他模型——便宜但可能有偏差。要警惕数据污染：如果测试题泄漏到了训练数据里，分数就没有意义。

Making LLMs Practical

让 LLM 变得实用

LLMs are powerful but face three real-world problems: they're expensive to customize, their knowledge has a cutoff date, and they can only generate text. Here's how each is being solved.

LLM 很强大，但面临三个现实问题：定制成本高、知识有截止日期、只能生成文字。以下是每个问题的解决方案。

Application

LoRA: Fine-Tuning on a Budget

LoRA：低成本微调

Fine-tuning a 70-billion-parameter model normally requires hundreds of GB of GPU memory — out of reach for most people. LoRA (Low-Rank Adaptation) offers an elegant shortcut: freeze the original model and only train tiny "adapter" matrices added beside each layer.

W' = W + BA, where B is d×r and A is r×d, with r typically 4-64 (vs. d in the thousands)

Instead of remodeling your entire house for a new purpose, you just add a small extension. The original structure stays intact, but you get the new functionality you need.

Result: 10× less memory, training costs drop dramatically, and performance is nearly identical to full fine-tuning. QLoRA goes even further — combining 4-bit quantization with LoRA to fine-tune a 65B model on a single consumer GPU.

微调一个 700 亿参数的模型通常需要数百 GB 的 GPU 显存——大多数人根本做不到。LoRA（低秩适应）提供了一个优雅的捷径：冻结原始模型，只训练附加在每层旁边的微小"适配器"矩阵。

W' = W + BA，其中 B 是 d×r，A 是 r×d，r 通常只有 4-64（而 d 是几千）

不用为了新用途重建整栋房子，只需加一个小扩建。原始结构完好，但你获得了需要的新功能。

效果：显存减少 10 倍以上，训练成本大幅下降，性能几乎和全量微调一样好。QLoRA 更进一步——把 4-bit 量化和 LoRA 结合，在一张消费级 GPU 上微调 650 亿参数的模型。

Application

RAG: Giving LLMs Access to Fresh Knowledge

RAG：给 LLM 接入最新知识

LLMs have a knowledge cutoff date and sometimes "hallucinate" — confidently stating things that aren't true. RAG (Retrieval-Augmented Generation) fixes this by having the model look up information before answering.

How it works: Your question gets converted into a vector → The system searches a database of documents for the most relevant passages → Those passages are fed to the LLM as context → The LLM generates an answer based on the retrieved information.

It's the difference between answering from memory (might be wrong or outdated) vs. looking it up in a reference book first, then answering. RAG gives the LLM a "reference library" to consult.

When to use: Need up-to-date information, need to cite sources, need to reduce hallucinations, enterprise Q&A over internal documents.

LLM 的知识有截止日期，有时还会"幻觉"——自信地说出不真实的事情。RAG（检索增强生成）通过让模型在回答前先查找信息来解决这个问题。

工作流程：你的问题被转换成向量 → 系统在文档数据库中搜索最相关的段落 → 这些段落作为上下文输入给 LLM → LLM 基于检索到的信息生成回答。

就像凭记忆回答问题（可能出错或过时）和先翻参考书再回答的区别。RAG 给了 LLM 一个可以查阅的"参考图书馆"。

适用场景：需要最新信息、需要引用来源、需要减少幻觉、企业内部文档问答。

Application

Multimodal LLMs: Beyond Text

多模态 LLM：超越文本

Modern LLMs aren't limited to text. By connecting a vision encoder (like CLIP) to the language model, they can understand images too. Show it a photo of a restaurant menu and ask "What vegetarian options are there?" — it reads the image, understands the menu, and answers.

Representative models: GPT-4V, Gemini, LLaVA. They can describe images, answer visual questions, analyze charts, and even understand memes.

现代 LLM 不限于文本。通过连接一个视觉编码器（如 CLIP）到语言模型，它们也能理解图像。给它看一张餐厅菜单的照片问"有什么素食选项？"——它读取图像、理解菜单、然后回答。

代表模型：GPT-4V、Gemini、LLaVA。它们能描述图片、回答视觉问题、分析图表，甚至理解表情包。

Application

Language Agents: LLMs That Take Action

语言智能体：能采取行动的 LLM

Regular LLMs can only generate text. Language agents give LLMs the ability to use tools — search the web, run code, call APIs, operate software. This transforms them from "can talk" to "can do."

The core framework is ReAct (Reasoning + Acting): the model alternates between thinking ("I need to find the current price") and acting ("Let me search the web for..."), looping until the task is done.

"Book me the cheapest flight to Beijing tomorrow" — the agent searches flight options, compares prices, selects the best one, and completes the booking. The entire workflow is planned and executed by the LLM.

Trend: Agents are the frontier of LLM applications — from coding assistants to research automation to personal productivity tools.

普通 LLM 只能生成文字。语言智能体赋予 LLM 使用工具的能力——搜索网页、运行代码、调用 API、操作软件。这让它们从"会说话"变成"会做事"。

核心框架是 ReAct（推理 + 行动）：模型在"思考"（"我需要查找当前价格"）和"行动"（"让我搜索一下..."）之间交替循环，直到任务完成。

"帮我订明天飞北京最便宜的机票" — 智能体搜索航班选项、比较价格、选出最佳、完成预订。整个流程由 LLM 规划和执行。

趋势：智能体是 LLM 应用的前沿方向——从编程助手到科研自动化到个人生产力工具。

The NLP Toolkit

NLP 工具箱

Beyond LLMs, traditional NLP tools handle essential tasks like identifying names, analyzing grammar, and extracting structured information from text.

除了 LLM，传统 NLP 工具处理识别人名、分析语法、从文本中提取结构化信息等基础任务。

NLP Pipeline

NER, POS Tagging & Dependency Parsing

命名实体识别、词性标注与依存句法分析

The classic NLP pipeline: Tokenization (split into words) → POS Tagging (noun? verb? adjective?) → NER (is "Apple" a fruit or a company?) → Dependency Parsing (which word modifies which?).

Modern tools like spaCy handle this entire pipeline in one line of code. While LLMs can also do these tasks, dedicated NLP pipelines are faster, cheaper, and more reliable for structured extraction.

From the sentence "Apple CEO Tim Cook announced the new iPhone in Cupertino" — NER extracts: Apple (ORG), Tim Cook (PERSON), iPhone (PRODUCT), Cupertino (LOC).

经典 NLP 流水线：分词（拆成词）→ 词性标注（名词？动词？形容词？）→ 命名实体识别 NER（"苹果"是水果还是公司？）→ 依存句法分析（哪个词修饰哪个词？）。

现代工具如 spaCy 一行代码就能完成整个流水线。虽然 LLM 也能做这些任务，但专用 NLP 流水线在结构化提取方面更快、更便宜、更可靠。

从句子"苹果公司 CEO 库克在库比蒂诺发布了新 iPhone"中——NER 提取：苹果公司（机构）、库克（人名）、iPhone（产品）、库比蒂诺（地名）。

Ethics & Broader Implications

伦理与更广泛的影响

LLMs bring enormous capabilities but also serious questions that researchers and practitioners must grapple with.

LLM 带来了巨大能力，也带来了研究者和从业者必须面对的严肃问题。

Bias

偏见

Models learn from internet text, which contains societal biases
Word2Vec: "doctor" closer to "male", "nurse" to "female"
Mitigation: debiased data, RLHF fairness constraints, continuous auditing

模型从互联网文本中学习，而互联网充满社会偏见
Word2Vec 中"医生"更接近"男性"，"护士"更接近"女性"
缓解措施：去偏数据、RLHF 公平性约束、持续审计

Open Challenge
开放挑战
Interpretability可解释性
              Billions of parameters = "black box"
Methods: attention visualization, probing, feature attribution
Caveat: high attention ≠ causal explanation

            

              数十亿参数 = "黑箱"
方法：注意力可视化、探针实验、特征归因
注意：高注意力 ≠ 因果解释

            

Multilinguality

多语言

7000+ languages, but most LLMs are English-centric
Surprise: multilingual models enable zero-shot cross-lingual transfer
Challenge: low-resource languages still perform poorly — a digital divide

全球 7000+ 种语言，但多数 LLM 以英语为中心
惊喜：多语言模型实现零样本跨语言迁移
挑战：低资源语言效果仍然很差——数字鸿沟

References & Software

参考文献与软件

Key References

核心文献

BeginnerJurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft). https://web.stanford.edu/~jurafsky/slp3/
TransformerVaswani, A. et al. (2017). Attention is all you need. NeurIPS.
BERTDevlin, J. et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.
ScalingKaplan, J. et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
RLHFOuyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
LoRAHu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR.
RAGLewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS.

Software

软件工具

HuggingFace Transformers

The hub for pre-trained models. BERT, GPT-2, LLaMA, and thousands more. Fine-tuning and inference in Python.

预训练模型中心。BERT、GPT-2、LLaMA 及数千种模型。Python 微调和推理。

spaCy

Production NLP pipeline. Tokenization, POS, NER, dependency parsing. Fast and easy to use.

生产级 NLP 流水线。分词、词性标注、NER、依存分析。快速易用。

OpenAI / Anthropic APIs

Access GPT-4, Claude, etc. via API. Zero-shot classification, text generation, embeddings.

通过 API 使用 GPT-4、Claude 等。零样本分类、文本生成、嵌入向量。