Text as Data

Computational Social Science · Foundations 03

Notes from ICPSR 2024 "Data Science & Text Analysis", University of Michigan · Prof. Yaoyao Dai

笔记整理自 ICPSR 2024「Data Science & Text Analysis」, University of Michigan · Prof. Yaoyao Dai

The Text Analysis Pipeline

文本分析流程

Imagine you have 10,000 newspaper articles and want to find out what topics they discuss. You can't read them all — but a computer can, if we first turn words into numbers. That's what text analysis does: it converts messy human language into structured data a computer can crunch. The process follows a clear pipeline: collect text, clean it up, turn it into numbers, analyze it, and validate the results.

假设你手上有 10,000 篇新闻报道,想知道它们都在讨论什么话题。你没法一篇篇读完——但计算机可以,前提是我们先把文字变成数字。这就是文本分析做的事:把杂乱的人类语言转化为计算机能处理的结构化数据。整个过程遵循一条清晰的流水线:收集文本、清洗整理、转化为数字、分析建模、验证结果。

01
Collect
数据收集
APIs, scraping, corpora
API、爬虫、语料库
02
Preprocess
预处理
Tokenize, clean, normalize
分词、清洗、标准化
03
Represent
特征表示
BoW, TF-IDF, embeddings
词袋、TF-IDF、词嵌入
04
Analyze
分析建模
Classify, cluster, scale
分类、聚类、量表化
05
Validate
验证评估
Evaluate, interpret
评价、解读

Text Preprocessing

文本预处理

Think of raw text like vegetables fresh from the market — before cooking, you need to wash, peel, and chop them. Text preprocessing is similar: we clean and standardize raw text so the computer can work with it effectively. The main steps are: tokenization (chopping a sentence into individual words), lowercasing ("The" and "the" become the same word), removing stop words (common words like "the," "is," "and" that carry little meaning), and stemming/lemmatization (reducing words to their root form, so "running," "ran," and "runs" all become "run"). These choices matter a lot — different preprocessing can lead to different conclusions (Denny & Spirling, 2018).

把原始文本想象成刚从菜市场买回来的蔬菜——做菜之前要洗净、削皮、切好。文本预处理也是如此:我们清洗和标准化原始文本,让计算机能有效处理。主要步骤包括:分词(把一句话切成一个个词语)、小写化("The"和"the"统一成同一个词)、去除停用词("the""is""and"这类没有实际含义的常见词)、词干提取/词形还原(把词还原到词根形式,让"running""ran""runs"都变成"run")。这些选择非常重要——不同的预处理方式可能导致不同的结论(Denny & Spirling, 2018)。

Stemming
词干提取 (Stemming)

Method: Rule-based suffix stripping (e.g., Porter Stemmer)

Example: "running" → "run", "studies" → "studi"

Pro: Fast, simple

Con: Can produce non-words ("studi")

方法:基于规则的后缀截断(如 Porter Stemmer)

示例:"running" → "run", "studies" → "studi"

优点:速度快、简单

缺点:可能产生非词形式("studi")

Lemmatization
词形还原 (Lemmatization)

Method: Dictionary-based reduction to canonical form

Example: "running" → "run", "better" → "good"

Pro: Produces real words, context-aware

Con: Slower, requires POS tagging

方法:基于词典还原为标准词形

示例:"running" → "run", "better" → "good"

优点:产生真实词汇、考虑上下文

缺点:较慢,需要词性标注


Text Representation

文本表示

Computers only understand numbers, not words. So before we can analyze text, we need to convert it into numerical form. There are two broad strategies: counting words (simple but effective) and learning "word meanings" as vectors (more powerful but complex).

计算机只懂数字,不懂文字。所以在分析文本之前,我们需要把文字转成数字。主要有两种策略:数词频(简单但有效)和把"词义"学习为向量(更强大但更复杂)。

Representation

Bag of Words & TF-IDF

Bag of Words (BoW) is the simplest idea: throw all the words of a document into a "bag," shake it up (forget the order), and just count how many times each word appears. The result is a big table (called a Document-Term Matrix) where each row is a document and each column is a word.

Analogy: Imagine dumping all the words of an essay onto a table and sorting them into piles. You know what words were used and how often, but you've lost the order they appeared in — "dog bites man" and "man bites dog" look the same.

TF-IDF improves on raw word counts by asking: "Is this word special to THIS document, or does it appear everywhere?" Words that appear in almost every document (like "the" or "is") get downweighted, while words unique to a few documents get boosted. This helps surface the words that truly distinguish one document from another.

词袋模型 (BoW) 是最简单的思路:把一篇文档的所有词丢进一个"袋子",摇一摇(忘掉顺序),然后数每个词出现了几次。结果是一张大表格(叫文档-词项矩阵),每一行是一篇文档,每一列是一个词。

类比:想象把一篇作文的所有字词倒在桌上,按词分堆。你知道用了哪些词、用了几次,但丢失了它们出现的顺序——"狗咬人"和"人咬狗"看起来一样。

TF-IDF 在原始词频的基础上多问了一个问题:"这个词是这篇文档独有的,还是到处都有?"几乎每篇文档都出现的词(如"的""是""了")被降权,而只在少数文档中出现的词被升权。这样就能找出真正让一篇文档区别于其他文档的关键词。

TF-IDF(t, d) = TF(t, d) × log(N / DF(t))
When to use: A great starting point for most text projects. Works well for document classification and search. Limitation: treats words as independent — doesn't understand that "happy" and "joyful" are similar.
适用场景:大多数文本项目的绝佳起点。适合文档分类和搜索。局限:把每个词视为独立的——不理解"开心"和"高兴"是近义词。
Representation

Word Embeddings (Word2Vec, GloVe)

BoW treats every word as completely unrelated — but we know "happy" and "joyful" are similar! Word embeddings fix this by representing each word as a list of numbers (a "vector") where similar words have similar numbers. The core idea is beautifully simple: "you shall know a word by the company it keeps" (Firth, 1957). If "cat" and "dog" often appear near the same words ("pet," "feed," "cute"), they must mean similar things.

Analogy: Think of words as people at a party. If two people always hang out in the same social circles, they probably have similar interests — even if they've never met each other directly.

Word2Vec (Mikolov et al., 2013) learns these vectors by training a simple neural network to predict either a word from its neighbors (CBOW) or neighbors from a word (Skip-gram). The famous result: vec("king") − vec("man") + vec("woman") ≈ vec("queen") — the model learned gender relationships just from reading text!

GloVe (Pennington et al., 2014) takes a different approach: instead of looking at local word windows, it analyzes the overall co-occurrence statistics of the entire corpus.

词袋模型把每个词都当作完全无关的——但我们知道"开心"和"高兴"是近义词!词嵌入解决了这个问题:把每个词表示为一串数字("向量"),意思相近的词,数字也相近。核心思想非常优美:"通过一个词的伙伴来了解它"(Firth, 1957)。如果"猫"和"狗"经常出现在相同的词旁边("宠物""喂""可爱"),它们的意思一定很接近。

类比:把词想象成派对上的人。如果两个人总在同一个社交圈里出现,他们大概有相似的兴趣——即使他们从未直接见过面。

Word2Vec(Mikolov et al., 2013)通过训练一个简单的神经网络来学习这些向量:要么从邻居词预测中心词(CBOW),要么从中心词预测邻居词(Skip-gram)。著名的结果:vec("king") − vec("man") + vec("woman") ≈ vec("queen")——模型仅仅通过阅读文本就学会了性别关系!

GloVe(Pennington et al., 2014)采用不同的方法:不是看局部词窗口,而是分析整个语料库的全局共现统计。

When to use: When you need the computer to understand that words can be similar. Great for measuring word similarity, detecting bias in language, or as input features for more advanced models.
适用场景:当你需要计算机理解词与词之间可以相似时。非常适合度量词汇相似度、检测语言中的偏见,或作为更高级模型的输入特征。

Unsupervised Methods

无监督方法

What if you have thousands of documents but no labels — nobody has told you which category each one belongs to? Unsupervised methods let the data speak for itself. They discover hidden patterns, topics, and dimensions without any labeled training data. Think of it as asking "what's in here?" rather than "is this positive or negative?"

如果你有成千上万的文档,但没有标签——没人告诉你每篇属于什么类别怎么办?无监督方法让数据自己"说话"。它们在没有任何标注训练数据的情况下发现隐藏的模式、主题和维度。可以把它想象成在问"这里面有什么?"而不是"这是正面还是负面?"

Unsupervised

Dictionary Methods & Sentiment Analysis

词典方法与情感分析

The simplest text analysis you can do: make a word list, then count. For example, a sentiment dictionary might say "happy" = +1, "terrible" = -1, "okay" = 0. To score a movie review, just add up the scores of all its words. That's it!

Analogy: It's like grading an essay by counting how many "good" words vs. "bad" words it uses — crude, but surprisingly useful as a first pass.

Popular dictionaries include LIWC (psychology), AFINN (sentiment), and Hu & Liu (opinion mining). Strengths: Transparent (you can see exactly why a document got its score), reproducible, and requires zero training data. Limitations: Can't handle context — it thinks "not good" is positive (it sees "good"!), misses sarcasm entirely, and the same word can mean different things in different domains ("sick" is negative in healthcare but positive in slang).

最简单的文本分析:建个词表,然后数数。比如,一个情感词典可能规定"开心"= +1,"糟糕"= -1,"一般"= 0。给一条影评打分,就是把所有词的分数加起来。就这么简单!

类比:就像通过数一篇文章里有多少"好词"和"坏词"来打分——粗糙,但作为初步分析效果出奇地好。

常见词典包括 LIWC(心理学)、AFINN(情感)和 Hu & Liu(意见挖掘)。优势:透明(你能清楚看到为什么一篇文档得了这个分数)、可复现、完全不需要训练数据。局限:不理解上下文——它会认为"不好"是正面的(因为它看到了"好"!),完全忽略讽刺,而且同一个词在不同领域含义不同("有毒"在化学里是危险的,在网络用语里可能是"太搞笑了")。

Unsupervised

Topic Models (LDA)

主题模型 (LDA)

"What are people talking about?" — that's the question topic models answer. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) imagines that every document was written by first picking a few topics, then picking words from those topics. LDA works backwards: given the words we observe, it figures out what the hidden topics must be.

Analogy: Imagine a newspaper. Each article mixes a few themes — an article about climate policy might be 60% "environment," 30% "politics," 10% "economics." LDA discovers these themes automatically by noticing that words like "carbon," "emissions," "temperature" tend to appear together.

The researcher must choose how many topics K to look for — there's no single right answer, just like there's no single right number of folders to organize your files. You try different values and see which gives the most interpretable results. Common tools: held-out likelihood, topic coherence scores, and good old-fashioned "do these topics make sense to a human?"

"大家都在讨论什么?"——这就是主题模型要回答的问题。潜在狄利克雷分配(LDA)(Blei et al., 2003)想象每篇文档是这样写出来的:先选几个主题,再从这些主题里选词。LDA 做的是反向推理:给定我们看到的词语,推断出背后隐藏的主题是什么。

类比:想象一份报纸。每篇文章混合了几个主题——一篇关于气候政策的文章可能是 60%"环境"、30%"政治"、10%"经济"。LDA 通过发现"碳""排放""温度"这些词总是同时出现,自动发现这些主题。

研究者需要选择寻找多少个主题 K——没有唯一正确的答案,就像整理文件时没有唯一正确的文件夹数量一样。你试不同的值,看哪个给出最容易解读的结果。常用工具:留出似然、主题连贯性分数,以及最经典的"这些主题对人类来说有道理吗?"

When to use: You have a large collection of texts and want to discover what topics they cover — without reading them all yourself. Great for content analysis at scale.
适用场景:你有大量文本,想知道它们涵盖了哪些话题——不用自己全部读完。非常适合大规模内容分析。
Unsupervised

Structural Topic Model (STM)

结构主题模型 (STM)

Regular LDA discovers topics but can't tell you why topics vary across documents. STM (Roberts et al., 2014) is LDA's smarter cousin: it lets you plug in extra information about each document — like who wrote it, when, or where — and then the model can tell you how topics differ across those categories.

Example: Feed STM a collection of Congressional speeches with party labels. STM can tell you not just what topics are discussed, but that Democrats talk about healthcare 3x more than Republicans, and when they both discuss "economy," they use different words.

普通的 LDA 能发现主题,但没法告诉你主题为什么在不同文档间变化。STM(Roberts et al., 2014)是 LDA 的"升级版":它允许你加入每篇文档的额外信息——比如谁写的、什么时候写的、在哪里发表的——然后模型就能告诉你主题是如何随这些因素变化的。

例子:把一批国会演讲和政党标签一起输入 STM。STM 不仅能发现讨论了哪些话题,还能告诉你民主党讨论医疗的频率是共和党的 3 倍,而且两党讨论"经济"时用的词都不一样。
When to use: When you have metadata (party, time, source) and want to know how topics or language change across groups or over time. The go-to tool for social scientists.
适用场景:当你有元数据(党派、时间、来源)并想知道主题或语言如何在群体间或随时间变化时。社会科学家的首选工具。
Scaling

Text Scaling: Wordscores & Wordfish

文本量表化:Wordscores 与 Wordfish

Sometimes you don't want topics — you want to know where someone stands. Text scaling places documents (or their authors) on a spectrum, like a political left–right scale, based purely on word choices.

Analogy: If someone keeps saying "freedom," "market," and "deregulation," they're probably on the right. If they say "equality," "welfare," and "regulation," they're probably on the left. Text scaling automates this intuition.

Wordscores (Laver et al., 2003) needs reference texts — you give it a few documents with known positions (e.g., a clearly left-wing and a clearly right-wing manifesto), and it figures out where new documents fall by looking at which reference they use more similar words to.

Wordfish (Slapin & Proksch, 2008) doesn't need references at all — it discovers the latent dimension purely from how word frequencies differ across documents. Think of it as "let the data tell me who's on which end."

有时候你关心的不是主题,而是想知道某人站在哪里。文本量表化根据用词选择把文档(或作者)放在一条光谱上,比如政治左右轴。

类比:如果一个人总说"自由""市场""放松管制",他大概偏右。如果总说"平等""福利""监管",大概偏左。文本量表化把这种直觉自动化了。

Wordscores(Laver et al., 2003)需要参考文本——你给它几篇已知立场的文档(比如一篇明确左翼和一篇明确右翼的宣言),它通过看新文档与哪个参考更像来确定位置。

Wordfish(Slapin & Proksch, 2008)完全不需要参考——它纯粹从词频在文档间的差异中发现潜在维度。可以理解为"让数据告诉我谁在哪一端"。

When to use: Political science — estimating party positions from manifestos, legislative speeches, or policy documents along a single dimension (e.g., left vs. right, pro vs. anti).
适用场景:政治学——沿单一维度(如左vs右、支持vs反对)从宣言、立法演讲或政策文件中估计政党位置。

Supervised Methods

有监督方法

Unlike unsupervised methods, supervised methods need a "teacher" — a set of documents that humans have already labeled (e.g., "positive" / "negative," or "about healthcare" / "about economy"). The model learns from these examples and then predicts labels for new, unseen documents. Think of it as training with an answer key, then taking the test solo.

和无监督方法不同,有监督方法需要一个"老师"——一批人类已经标好标签的文档(比如"正面"/"负面",或"关于医疗"/"关于经济")。模型从这些例子中学习,然后预测新文档的标签。就像先用答案集训练,再独立考试。

Supervised

Random Forest

Imagine asking 100 people to classify a document, but each person only sees a random portion of the words. Each person makes an imperfect judgment, but when you take a majority vote, the crowd is surprisingly accurate. That's Random Forest: it builds hundreds of decision trees, each looking at a random subset of features, then averages their predictions.

Analogy: One quiz from a single question is unreliable. But average 100 quizzes, each with different random questions, and you get a very reliable assessment.

A bonus: Random Forest tells you which words matter most for the prediction (called feature importance), so you can peek inside and understand why the model classifies things the way it does.

想象让 100 个人给一篇文档分类,但每个人只看到随机一部分词。每个人的判断都不完美,但投票表决后,群体的准确率出奇地高。这就是随机森林:它构建几百棵决策树,每棵树只看随机一部分特征,然后综合它们的预测。

类比:一道题的随堂测验不可靠。但把 100 次测验(每次随机出不同题)的结果平均,你就能得到非常可靠的评估。

额外好处:随机森林会告诉你哪些词对预测最重要(叫特征重要性),让你能窥探模型内部,理解它为什么这样分类。

Deep Learning

Neural Networks for Text

用于文本的神经网络

Neural networks don't need you to tell them what features to look for — they figure it out themselves from the data. For text, three architectures matter:

CNNs — Like a sliding magnifying glass that scans text for useful local patterns. Good at spotting short phrases like "not good" or "highly recommended."

RNNs / LSTMs — Read text word by word, remembering what came before. Like a person reading a book from start to finish, keeping track of the plot. LSTMs are the improved version that can remember things from much earlier in the text.

Transformers — The breakthrough architecture behind ChatGPT. Instead of reading word by word, transformers look at ALL words at once and figure out which words are important for understanding each other. (More details on the LLM & NLP page.)

神经网络不需要你告诉它该关注什么特征——它自己从数据中摸索出来。对于文本,有三种重要的架构:

CNN——像一个滑动的放大镜,扫描文本寻找有用的局部模式。擅长发现短语,如"not good"或"强烈推荐"。

RNN / LSTM——逐词阅读文本,记住之前读过的内容。就像一个人从头到尾读一本书,一直跟踪剧情。LSTM 是改进版,能记住文本中更早出现的信息。

Transformer——ChatGPT 背后的突破性架构。不是逐词阅读,而是同时看所有词,搞清楚哪些词对理解彼此最重要。(详见 LLM & NLP 页面。)

Deep Learning

Large Language Models (BERT, GPT)

大语言模型 (BERT, GPT)

LLMs have read billions of web pages during pre-training, so they already "understand" language before you give them any task-specific data. Two key players:

BERT — Reads text in both directions (like reading a sentence forwards AND backwards at once). Give it a few hundred labeled examples and it quickly learns to classify your specific task. The "fine-tuning" approach.

GPT (ChatGPT, GPT-4) — A text generator trained to predict the next word. The magic: you can simply describe what you want in natural language (a "prompt") and it does it — no training data needed. Give it zero examples (zero-shot), a few examples (few-shot), or step-by-step instructions (chain-of-thought).

For text-as-data research: "Classify this tweet as pro-climate or anti-climate" → GPT can do this with zero training examples. Need higher accuracy? Fine-tune BERT on a few hundred hand-labeled tweets. Need even better? Combine both: use GPT to pre-label thousands of documents cheaply, then fine-tune BERT on the best ones.

LLM 在预训练阶段已经读过了数十亿网页,所以在你给它任何特定任务数据之前,它就已经"理解"语言了。两个关键角色:

BERT——双向阅读文本(像同时正着和倒着读一句话)。给它几百个标注样本,它就能快速学会你的特定分类任务。这是"微调"路线。

GPT(ChatGPT、GPT-4)——一个文本生成器,训练来预测下一个词。神奇之处:你只需用自然语言描述你想要什么(一个"提示词"),它就能做到——不需要训练数据。给零个例子(零样本)、几个例子(少样本),或者逐步说明(思维链)都行。

用于文本即数据研究:"请把这条推文分类为支持气候行动或反对气候行动"→ GPT 不需要任何训练样本就能做这件事。需要更高准确率?用几百条人工标注的推文微调 BERT。还要更好?两者结合:用 GPT 低成本地预标注数千文档,然后在最佳样本上微调 BERT。
How Do We Know If It's Working? — Evaluation Metrics
怎么知道模型好不好?——评估指标

Accuracy = how many did you get right overall? Simple but misleading if your data is imbalanced (99% accuracy is easy if 99% of documents are in one class). Precision = "of the documents you labeled positive, how many actually were?" (avoiding false alarms). Recall = "of all the actual positives, how many did you catch?" (avoiding misses). F1 = the balance between precision and recall. Always report multiple metrics and use cross-validation!

准确率 = 总共答对了多少?简单但在数据不平衡时会误导(如果 99% 的文档属于同一类,99% 的准确率很容易)。精确率 ="你标为正面的文档中,有多少真的是正面?"(避免误报)。召回率 ="所有真正的正面文档中,你抓到了多少?"(避免遗漏)。F1 = 精确率和召回率的平衡。一定要报告多个指标,并使用交叉验证!


Annotation & Human-in-the-Loop

标注与人在回路中

Supervised learning is only as good as its labels — "garbage in, garbage out." If your training labels are wrong or inconsistent, even the fanciest model will learn the wrong patterns. Good annotation is like building a solid foundation for a house: invisible but essential.

有监督学习的效果取决于标签质量——"垃圾进,垃圾出"。如果你的训练标签有误或不一致,再花哨的模型也会学到错误的模式。好的标注就像盖房子打地基:看不见但至关重要。

How to Measure Agreement
如何衡量一致性

If two people label the same 100 tweets, how often do they agree? Raw agreement isn't enough — two people flipping coins would agree 50% of the time by chance!

Cohen's Kappa (κ): Corrects for chance agreement between two annotators. κ > .80 = good, κ < .40 = poor.

Krippendorff's Alpha (α): More flexible — works with multiple annotators, missing data, and different label types. α > .80 = reliable; .67–.80 = tentative conclusions only.

如果两个人标注同样的 100 条推文,他们多久达成一致?单纯看一致率不够——两个人随机扔硬币也有 50% 的概率一致!

Cohen's Kappa (κ): 校正了随机一致的两人一致性。κ > .80 = 好, κ < .40 = 差。

Krippendorff's Alpha (α): 更灵活——可处理多人、缺失数据和不同标签类型。α > .80 = 可靠;.67–.80 = 只能得出初步结论。

LLMs as Annotators
LLM 作为标注者

Why hire 10 research assistants when GPT-4 can label 10,000 documents overnight? Recent research shows LLMs can match or exceed crowd-worker quality on many tasks — and they never get tired or bored.

But be careful: Prompt design matters enormously (clear instructions + a few examples = much better results), and you still need to validate against expert human labels. LLMs have biases too, and they might be systematically wrong in ways that are hard to detect.

为什么雇 10 个研究助理?让 GPT-4 一晚上标注 10,000 篇文档就行了。最新研究表明 LLM 在许多任务上能达到甚至超过众包工人的质量——而且它们永远不会累或无聊。

但要小心:提示设计极为重要(清晰的指示 + 几个例子 = 好得多的结果),而且你仍然需要对照专家人工标签进行验证。LLM 也有偏见,它们可能在难以察觉的方面系统性地出错。

Active Learning — Be Smart About What You Label
主动学习——聪明地选择标注什么

Labeling data is expensive and slow. Active learning is a clever shortcut: instead of randomly picking documents to label, let the model tell you which ones it's most confused about, and label those first. This way, every label you add teaches the model the most. It's like studying for an exam by focusing on the questions you got wrong, not reviewing what you already know.

标注数据既贵又慢。主动学习是一个聪明的捷径:不随机选文档来标注,而是让模型告诉你它最困惑的是哪些文档,先标注那些。这样每一个标签都能让模型学到最多。就像备考时专注做错题,而不是复习已经会的内容。


Ethics in Text Analysis

文本分析中的伦理

Just because you can analyze text doesn't mean you always should — or that it's safe to do so carelessly. Working with text data raises real ethical concerns that researchers must think about from day one, not as an afterthought.

分析文本,不代表你总应该这么做——也不代表可以粗心大意。文本数据研究带来了真实的伦理问题,研究者必须从一开始就考虑,而不是事后补救。

Privacy

隐私

  • Text often reveals who people are — names, locations, writing style
  • Just because a tweet is public doesn't mean the person agreed to be in your study
  • Even "anonymized" text can be traced back — people have unique writing fingerprints
  • 文本常常暴露身份——姓名、地点、写作风格
  • 一条推文是公开的,不代表发布者同意被纳入你的研究
  • 即使"匿名化"了也可能被追溯——每个人的写作风格都是独特的指纹
Critical
关键

Bias

偏见

  • Models learn from human-written text, which reflects historical biases — and then amplify them
  • Word embeddings literally encode stereotypes: "doctor" closer to "man," "nurse" closer to "woman"
  • Sentiment tools may rate African American English as more negative — the tool isn't neutral
  • 模型从人类写的文本中学习,而这些文本反映了历史偏见——然后模型还会放大这些偏见
  • 词嵌入真的编码了刻板印象:"医生"离"男性"更近,"护士"离"女性"更近
  • 情感分析工具可能把非裔美国人英语评为更负面——工具并不中立

Reproducibility

可重复性

  • Share your code and data (when possible) so others can check your work
  • Record every choice: which model, which version, which random seed, which preprocessing
  • LLM outputs change between runs — always report your exact prompt and model version (e.g., "GPT-4-turbo, Jan 2024")
  • 共享代码和数据(可能时),让别人可以检查你的工作
  • 记录每个选择:哪个模型、哪个版本、哪个随机种子、哪种预处理
  • LLM 输出每次运行都可能不同——务必报告确切的提示词和模型版本(如"GPT-4-turbo, 2024年1月")

Choosing a Method

方法选择指南

With so many methods, how do you pick? The biggest deciding factor is often the simplest: do you have labeled data or not? Here's a quick guide:

这么多方法,怎么选?最大的决定因素往往也是最简单的:你有标注数据吗?这里有一个快速指南:

No Labeled Data
无标注数据

Dictionary methods — "I know what categories I want, just count the words" (simplest)

Topic models (LDA/STM) — "What topics are hiding in this pile of text?" (exploratory)

Wordfish — "Where does each author stand on a left-right scale?" (scaling)

LLM zero-shot — "Just ask ChatGPT to classify it" (quick & surprisingly good)

词典方法——"我知道想要什么类别,数词就行"(最简单)

主题模型 (LDA/STM)——"这堆文本里藏着什么话题?"(探索性)

Wordfish——"每个作者在左右光谱上站在哪?"(量表化)

LLM 零样本——"直接让 ChatGPT 分类"(快速且效果出奇地好)

With Labeled Data
有标注数据

Wordscores — "I have reference texts, tell me where new ones fall" (scaling)

Random Forest / SVM — "Give me a solid, interpretable baseline" (classic ML)

Fine-tuned BERT — "I want the best accuracy possible" (state-of-the-art)

LLM few-shot — "I only have a handful of labeled examples" (flexible)

Wordscores——"我有参考文本,告诉我新文本在哪个位置"(量表化)

随机森林 / SVM——"给我一个靠谱、可解释的基线"(经典 ML)

微调 BERT——"我要最高的准确率"(最先进)

LLM 少样本——"我只有几个标注样本"(灵活)


References & Software

参考文献与软件

Key References

核心文献

  • BeginnerGrimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
  • PreprocessingDenny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168–189.
  • Topic ModelsBlei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
  • STMRoberts, M. E., Stewart, B. M., Tingley, D., et al. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), 1064–1082.
  • ScalingSlapin, J. B., & Proksch, S.-O. (2008). A scaling model for estimating time-series party positions from texts. American Journal of Political Science, 52(3), 705–722.
  • ScalingLaver, M., Benoit, K., & Garry, J. (2003). Extracting policy positions from political texts using words as data. American Political Science Review, 97(2), 311–331.
  • EmbeddingsMikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
  • LLMDevlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.

Software

软件工具

Python — NLTK / spaCy

Core NLP libraries. NLTK for learning, spaCy for production. Tokenization, POS, NER, lemmatization.

核心 NLP 库。NLTK 适合学习,spaCy 适合生产。分词、词性标注、NER、词形还原。

Python — scikit-learn

TF-IDF vectorizer, Random Forest, SVM, evaluation metrics, cross-validation. The ML workhorse.

TF-IDF 向量化器、随机森林、SVM、评估指标、交叉验证。ML 主力工具。

Python — Gensim

Topic modeling (LDA), Word2Vec, Doc2Vec. Efficient for large corpora.

主题模型 (LDA)、Word2Vec、Doc2Vec。大型语料库高效处理。

R — quanteda

Text analysis in R. DTM construction, dictionaries, Wordfish, Wordscores. Social science standard.

R 中的文本分析。DTM 构建、词典方法、Wordfish、Wordscores。社会科学标准工具。

R — stm

Structural Topic Models with covariate effects. Roberts et al.'s implementation.

带协变量效应的结构主题模型。Roberts et al. 的实现。

HuggingFace Transformers

Pre-trained BERT, GPT, and other LLMs. Fine-tuning and inference. Python ecosystem.

预训练的 BERT、GPT 及其他 LLM。微调和推理。Python 生态。