Empirical Modeling of Social Science Theory

New to Empirical Modeling?

什么是经验建模？

This course bridges theory and data. In social science, we often want to know not just whether X causes Y, but how — how does Y respond when X changes, given everything else going on? The "Model It" strategy means using theory to specify your empirical model precisely: what variables matter, how they interact, and how they change over time. Prerequisites: familiarity with basic regression is helpful but we review everything. Software: Stata and R code provided.

这门课是理论与数据之间的桥梁。在社会科学中，我们不仅想知道 X 是否影响 Y，还想知道"怎样影响"——当 X 变化时，Y 如何回应，考虑到其他所有因素？"模型化"策略意味着用理论来精确指定你的经验模型：哪些变量重要、它们如何互动、如何随时间变化。前置知识：了解基础回归有帮助但我们会复习所有内容。软件：提供 Stata 和 R 代码。

Four Modes of Empirical Analysis

经验分析的四种模式

Not all empirical questions are the same. Some ask "What is this?" (measurement). Others ask "What comes next?" (prediction). Still others ask "Did X cause Y?" (causal inference). And some ask "How much does Y change when X changes?" (causal response). Each has different standards for success.

Why this distinction matters: An RCT is perfect for answering Mode III — "Did the treatment work?" But if you want to know the magnitude of response across different contexts, or how effects propagate through a system, Mode III evidence may not be enough. Mode IV requires understanding not just treatment effects in isolation, but how variables respond to each other in the full system.

并非所有经验问题都是一样的。有些问"这是什么？"（测量）。有些问"接下来会怎样？"（预测）。有些问"X 是否导致 Y？"（因果推断）。还有些问"当 X 变化时，Y 变化多少？"（因果反应）。每一种都有不同的成功标准。

为什么这个区分很重要：RCT 完美地回答了模式三——"处理有效吗？"但如果你想知道反应的幅度在不同环境中，或效应如何通过系统传播，模式三证据可能不够。模式四需要理解不仅是隔离中的处理效应，还有变量如何在完整系统中相互反应。

Mode I: Measurement & Description

模式一：测量与描述

Question: What is X? How often does Y occur?

Gold Standard: Usefulness — do practitioners find the measure useful?

Methods: Descriptive statistics, index construction, validation studies

Example: Building a corruption index from multiple indicators

问题：X 是什么？Y 发生多频繁？

黄金标准：有用性 — 实践者觉得这个度量有用吗？

方法：描述统计、指标构建、验证研究

例子：从多个指标构建腐败指数

Mode II: Prediction & Forecasting

模式二：预测与预报

Question: What happens next? Can we forecast?

Gold Standard: Out-of-sample error — accuracy on new, unseen data

Methods: Time series models, ML classifiers, ensemble methods

Example: Predicting stock market movements or election outcomes

问题：接下来会怎样？我们能预测吗？

黄金标准：样本外误差 — 在新的、未见过的数据上的准确性

方法：时间序列模型、ML 分类器、集成方法

例子：预测股票市场或选举结果

Mode III: Causal Effect Inference

模式三：因果效应推断

Question: Does X cause Y? What is the treatment effect?

Gold Standard: RCT — randomized controlled trial

Methods: IV estimation, matching, DID, RDD, experiments

Example: Does increasing class size hurt student outcomes?

Key insight: Mode III tests whether causation exists (eliminates reverse causality and confounders), but often in artificially isolated contexts. The RCT that proves X causes Y is excellent for establishing causality direction, but may tell you less about how Y responds to X in the broader system where X and Y cause each other.

问题：X 是否导致 Y？处理效应是什么？

黄金标准：RCT — 随机对照实验

方法：工具变量、配对、双重差分、断点回归、实验

例子：增加班级规模是否伤害学生成绩？

关键见解：模式三测试因果关系是否存在（排除反向因果和混淆因子），但通常在人为隔离的环境中。证明 X 导致 Y 的 RCT 非常适合建立因果方向，但可能告诉你更少关于在 X 和 Y 相互影响的更广泛系统中 Y 如何对 X 反应。

Mode IV: Causal-Model & Causal-Response

模式四：因果模型与因果反应

Question: How does Y respond to X in the full system? What are the short-run and long-run effects?

Gold Standard: Out-of-sample estimated-response error — can we predict new situations accurately?

Methods: Structural estimation, systems of equations, dynamic models, simulation

Example: How much do wages adjust when unemployment rises? This isn't just "does unemployment affect wages?" but "what is the trajectory of wage adjustment, given labor market equilibration and expectations?"

The crucial difference from Mode III: "The isolation that makes RCT good for testing WHETHER makes it poor for estimating HOW." When democracies become more developed (or vice versa), you can't randomize away the feedback effects. In the real system, Democracy = f(Development) AND Development = g(Democracy) — a simultaneous system. Mode IV asks: given these mutual influences, how do the variables adjust when one changes?

问题：在完整系统中 Y 如何对 X 作出反应？短期和长期效应是什么？

黄金标准：样本外估计反应误差——我们能准确预测新情况吗？

方法：结构估计、方程组系统、动态模型、模拟

例子：失业率上升时，工资调整多少？这不仅仅是"失业是否影响工资？"而是"考虑到劳动力市场均衡和预期，工资调整的轨迹是什么？"

与模式三的关键差异："使 RCT 适合测试'是否'的隔离使其不适合估计'如何'。"当民主变得更发达（反之亦然）时，你不能随机化掉反馈效应。在真实系统中，民主 = f(发展) 且发展 = g(民主)——一个同时系统。模式四问：给定这些相互影响，当一个变化时，变量如何调整？

The Five Fundamental Challenges

五大根本挑战

Social systems are complicated. Building empirical models that honestly capture this complexity — without overfitting or making silly simplifications — requires grappling with five fundamental challenges:

社会系统很复杂。构建能真实捕捉这种复杂性的经验模型——不过度拟合、不做傻里傻气的简化——需要应对五大根本挑战：

01

Multicausality

多因性

Just about everything matters. Predicting exam scores? Income, family support, sleep, teacher quality, peer effects, prior knowledge, and hundreds of other factors all play a role. Your model can't include everything — so which variables are most important?

The mathematical insight: In OLS, your coefficient on X₁ is: b₁ = Cov(Y,X₁)/Var(X₁). But when multiple X's exist, each coefficient also depends on correlations with other variables. If you omit X₂, then b₁ will absorb X₂'s effect: b₁ ≈ β₁ + β₂·Corr(X₁,X₂). Your estimate is biased by the omitted variable bias formula. This is why variable selection guided by theory (not data) matters critically.

几乎什么都重要。预测考试分数？收入、家庭支持、睡眠、教师质量、同伴效应、前置知识和数百个其他因素都起作用。你的模型不能什么都包含——那么哪些变量最重要呢？

数学见解：在 OLS 中，你的 X₁ 系数是：b₁ = Cov(Y,X₁)/Var(X₁)。但当多个 X 存在时，每个系数也取决于与其他变量的相关性。如果你遗漏 X₂，那么 b₁ 将吸收 X₂ 的效应：b₁ ≈ β₁ + β₂·Corr(X₁,X₂)。你的估计受遗漏变量偏差公式的影响。这就是为什么由理论指导的变量选择（不是数据）至关重要。

02

Context Conditionality

情境条件性

How everything matters depends on everything else. Education's effect on earnings differs by gender, region, and time period. Democracy's stability depends on income distribution, ethnic diversity, and institutional design. Effects are not universal — they're conditional.

The key implication: If effects vary by context, you need MORE contexts, not fewer. Some scholars pool across contexts to get more observations — this is backward! If you believe the effect of X on Y is different in Context A vs Context B, then pooling them only masks the variation. Instead, you need to estimate separate effects for each context and understand why they differ. This often requires interactions (β₁ + β₃Z, where Z is context) or even separate models.

什么如何重要取决于其他什么。教育对收入的影响因性别、地区和时期而异。民主的稳定性取决于收入分配、种族多样性和制度设计。效应不是普遍的——它们是有条件的。

关键含义：如果效应因情境而异，你需要更多情境，而不是更少。一些学者在各种情境中汇总以获得更多观测值——这是倒退！如果你相信 X 对 Y 的影响在情境 A 与情境 B 中不同，那么汇总它们只会掩盖变异。相反，你需要为每个情境估计单独的效应并理解为什么它们不同。这通常需要交互（β₁ + β₃Z，其中 Z 是情境）甚至单独的模型。

03

Dynamics & Dependence

动态与依赖

Everything is dynamic. Yesterday's inflation affects today's interest rates. Last year's investment drives this year's productivity. Nothing stands still. Observations are not independent — what happened before shapes what happens now.

一切都是动态的。昨天的通货膨胀影响今天的利率。去年的投资推动今年的生产力。没有什么是静止的。观测值不独立——过去发生的事塑造现在发生的事。

04

Ubiquitous Endogeneity

普遍内生性

Everything causes everything else. Does education increase wages, or do talented people both get educated and earn more? Does unemployment cause crime, or do criminals avoid employment? Cause and effect run both directions. Your right-hand-side variables are probably correlated with your errors.

The identification problem: Suppose Y = a·X and X = b·Y (simultaneous causation). Solving these: Y = a·(b·Y), so ab = 1. Any pair (a, b) satisfying ab = 1 is consistent with the observed correlation between X and Y. Without additional information, you cannot identify the individual effects a and b — you only know their product is 1. This is why instrumental variables are needed: they provide external information that breaks the simultaneity.

什么都相互影响。教育是否增加工资，还是有才华的人既受教育又赚钱更多？失业是否导致犯罪，还是犯罪分子回避就业？因果关系双向流动。你的自变量可能与你的误差项相关。

识别问题：假设 Y = a·X 和 X = b·Y（同时因果）。求解这些：Y = a·(b·Y)，所以 ab = 1。任何满足 ab = 1 的对 (a, b) 与 X 和 Y 之间的观测相关性一致。没有额外信息，你无法识别个别效应 a 和 b——你只知道它们的乘积是 1。这就是为什么需要工具变量：它们提供打破同时性的外部信息。

05

Micronumerosity

样本稀缺

Usually far too little data to figure it all out. You want to model inflation in 50 countries over 40 years with interactions and lag structures? That's 2,000 observations but thousands of potential parameters. The data is stingy — you must choose what matters most.

The matrix algebra perspective: Suppose you have M units (countries, firms, etc.) and measure y₁, y₂, ..., yₘ for each. A fully general model allows every yᵢ to affect every yⱼ. There are M² - M possible relationships (excluding self-loops), but only ½(M² - M) observable pairwise correlations (since correlation is symmetric). By counting, you need at least ½(M² - M) additional pieces of information (exclusion restrictions, theoretical constraints) just to identify the system. As M grows, this becomes crippling: with 10 units, you need 45 restrictions; with 100 units, you need 4,950 restrictions. Theory is your only way out.

通常数据远不足以搞清楚一切。你想用交互项和滞后结构对 50 个国家 40 年的通胀建模？那是 2,000 个观测值但成千上万的潜在参数。数据很少——你必须选择什么最重要。

矩阵代数视角：假设你有 M 个单位（国家、公司等）并为每个测量 y₁, y₂, ..., yₘ。完全一般模型允许每个 yᵢ 影响每个 yⱼ。有 M² - M 种可能的关系（排除自循环），但只有 ½(M² - M) 个观测到的两两相关性（因为相关性是对称的）。通过计数，你至少需要 ½(M² - M) 个额外的信息片段（排除限制、理论约束）来识别系统。当 M 增长时，这变得令人窒息：10 个单位需要 45 个限制；100 个单位需要 4,950 个限制。理论是你唯一的出路。

The EMTI Strategy: Model It!

EMTI 策略：模型化！

When faced with these five challenges, how do you build a credible empirical model? The answer is theory-guided specification. Rather than let the data choose everything (which leads to overfitting and capitalization on chance), you use substantive theory to guide which variables matter, how they interact, and what functional forms make sense. This is the idea behind EMTI — theory and intuitions inform your empirical model.

Clarke & Primo's insight: Theoretical models are useful simplifications of reality. Similarly, empirical models are useful simplifications of the data. Neither the theory nor the empirics can capture everything. The goal is to simplify thoughtfully, using theory to decide which simplifications matter and which don't.

Why not use "robust" or "nonparametric" approaches? Robust methods (like those without distributional assumptions) and nonparametric methods (like kernel regression) sound appealing — they relax restrictive assumptions. But they relax structural impositions at the cost of statistical efficiency. You need more data to estimate effects as precisely. Given Challenge 05 (micronumerosity), you usually have too little data to afford this efficiency loss. Theory tells you which impositions are reasonable, and you should use them.

面对这五大挑战，你如何构建可信的经验模型？答案是理论指导的设定。与其让数据选择一切（导致过度拟合和运气成分），你应该用实质性理论指导哪些变量重要、如何互动、什么函数形式有意义。这就是 EMTI 的思想——理论和直觉指导你的经验模型。

Clarke & Primo 的见解：理论模型是现实的有用简化。同样，经验模型是数据的有用简化。理论和经验都不能捕捉一切。目标是周到地简化，使用理论决定哪些简化重要，哪些不重要。

为什么不使用"稳健"或"非参数"方法？稳健方法（如那些没有分布假设的）和非参数方法（如核回归）听起来很吸引人——它们放松了限制性假设。但它们以统计效率成本放松结构限制。你需要更多数据来估计效应并做到同样精确。考虑到挑战 05（样本稀缺），你通常没有足够的数据来承受这种效率损失。理论告诉你哪些限制是合理的，你应该使用它们。

EITM (Classic)

EITM（经典）

Theory → Predictions → Tests

Sharper predictions from theory allow stronger empirical tests. Does the data falsify the theory?

理论 → 预测 → 检验

来自理论的更锐利预测允许更强的经验检验。数据是否反驳理论？

TMEI (Strong)

TMEI（强有力）

Theory → Model → Estimation → Interpretation

Theory fully structures the empirical model. Every variable choice, interaction, and functional form comes from theory.

理论 → 模型 → 估计 → 解释

理论完全结构化经验模型。每个变量选择、交互和函数形式都来自理论。

TIEM (Learning)

TIEM（学习）

Theory ← Empirical Findings ← Data

Empirical results inform and refine theory. Surprising findings prompt theoretical revision.

理论 ← 经验发现 ← 数据

经验结果告知和完善理论。令人惊讶的发现促进理论修订。

EMTI (This Course)

EMTI（本课程）

Empirical → Model → Theory → Intuitions

Intuitions from theory guide specification. Model specification choices are theory-informed, not data-driven.

经验 → 模型 → 理论 → 直觉

来自理论的直觉指导设定。模型设定选择是理论知情的，不是数据驱动的。

The Course Structure

课程结构

1

Reviews

基础回顾

OLS, GLS, MLE

OLS、GLS、MLE

2

Interactions

交互项

Multilevel models

多层次模型

3

Dynamics

动态性

Time series, ECM

时间序列、误差修正

4

Endogeneity

内生性

IV, systems, GMM

工具变量、系统、GMM

Module 1: Multicausality & Reviews

模块一：多因性与基础复习

We start by reviewing the core regression methods. These are not new — you've seen them before. But we'll frame them through the lens of causal-response estimation, emphasizing how to specify them correctly given your theory.

我们从复习核心回归方法开始。这些不是新的——你以前见过。但我们将通过因果反应估计的镜头来框架化它们，强调如何根据你的理论正确地指定它们。

Foundation

OLS & Classical Linear Regression

OLS 与经典线性回归

The workhorse of empirical social science. You specify a linear relationship: y = Xβ + ε. The coefficients β represent how Y changes with each X (holding others constant). OLS minimizes the sum of squared residuals — geometrically, it fits a hyperplane that minimizes vertical distance to the data.

y = Xβ + ε, β̂ = (X'X)⁻¹X'y

OLS is like finding the best flat surface to lay on a bumpy landscape of data points. You're not trying to fit every bump perfectly — that would overfit. Instead, you find the one flat plane that minimizes total distance to all points, balancing errors up and down.

经验社会科学的主要工具。你指定线性关系：y = Xβ + ε。系数 β 表示 Y 与每个 X 的变化（固定其他变量）。OLS 最小化残差平方和——从几何上讲，它拟合一个超平面，使到数据的垂直距离最小。

y = Xβ + ε, β̂ = (X'X)⁻¹X'y

OLS 就像在凹凸不平的数据点景观上找到最佳的平坦表面。你不是试图完美地适应每个凹凸——那会过度拟合。相反，你找到一个平面，最小化到所有点的总距离，平衡上下误差。

The Gauss-Markov Theorem: Under five classical assumptions, OLS produces the Best Linear Unbiased Estimator (BLUE) — no other linear estimator is more efficient.

The Five Classical Assumptions:

Linearity: The true relationship is linear in parameters (y = Xβ + ε).
Exogeneity: E(ε|X) = 0 — errors are uncorrelated with regressors.
No Multicollinearity: X'X is invertible (no perfect linear dependence among regressors).
Homoscedasticity: Var(ε|X) = σ² — errors have constant variance.
No Autocorrelation: Cov(εᵢ, εⱼ|X) = 0 for i ≠ j — errors are uncorrelated with each other.

Gauss-Markov 定理：在五个经典假设下，OLS 产生最佳线性无偏估计量 (BLUE)——没有其他线性估计量更有效。

五个经典假设：

线性性：真实关系在参数中是线性的 (y = Xβ + ε)。
外生性：E(ε|X) = 0 — 误差与回归变量不相关。
无多重共线性：X'X 是可逆的（回归变量间无完全线性依赖）。
同方差性：Var(ε|X) = σ² — 误差有常数方差。
无自相关：Cov(εᵢ, εⱼ|X) = 0 对于 i ≠ j — 误差相互不相关。

Three Important Intuitions:

(1) Omitted Variable Bias (OVB): If you omit variable Z that both affects Y and correlates with X, then plim(b₁) = β₁ + β₃·(Cov(X,Z)/Var(X)) — your coefficient absorbs Z's effect, scaled by their correlation. The direction and magnitude of bias depends on the signs and magnitudes of these correlations.

(2) Measurement Error & Attenuation Bias: If X is measured with error (X* = X + u), then the observed regressor X* has added noise. The coefficient on X* becomes b ≈ β·(Var(X)/(Var(X) + Var(u))) — biased toward zero. This is called attenuation bias, and it's especially problematic for concluding "no effect" when in fact there's just too much noise.

(3) Simultaneity Bias: If X and Y cause each other, then the error ε in the Y equation is correlated with X (because Y appears in the X equation's error). This creates bias in both directions of causality, and the direction/magnitude depends on the system parameters.

三个重要的直觉：

(1) 遗漏变量偏差 (OVB)：如果你遗漏变量 Z 既影响 Y 又与 X 相关，那么 plim(b₁) = β₁ + β₃·(Cov(X,Z)/Var(X)) — 你的系数吸收 Z 的效应，按它们的相关性缩放。偏差的方向和幅度取决于这些相关性的符号和幅度。

(2) 测量误差与衰减偏差：如果 X 用误差测量 (X* = X + u)，那么观测到的回归变量 X* 有额外噪声。X* 上的系数变成 b ≈ β·(Var(X)/(Var(X) + Var(u))) — 向零有偏。这称为衰减偏差，对于结论"无效应"特别有问题，而实际上只是噪声太多。

(3) 同时性偏差：如果 X 和 Y 相互影响，那么 Y 方程中的误差 ε 与 X 相关（因为 Y 出现在 X 方程的误差中）。这导致因果关系两个方向的偏差，方向/幅度取决于系统参数。

When to use: Default choice for continuous outcomes, when you believe the exogeneity assumption holds. Fast, interpretable. Always check: are your X variables truly exogenous? Have you included the right variables? If not, coefficients are meaningless — think carefully about whether your model is correctly specified. Report robust standard errors (Huber-White) to account for heteroscedasticity and clustering.

何时使用：连续结果变量的默认选择，当你相信外生性假设成立时。快速、可解释。总是检查：你的 X 变量真的是外生的吗？你包含了正确的变量吗？如果没有，系数毫无意义——仔细思考你的模型是否正确指定。报告稳健标准误（Huber-White）以说明异方差和聚集。

Foundation

Generalized Least Squares (GLS)

广义最小二乘法 (GLS)

OLS assumes errors are independent with constant variance. Reality is messier: errors might be correlated across time (autocorrelation) or across groups (clustering). GLS down-weights observations with more noise and accounts for error correlations. The idea: if you know something is measured with more error, don't trust it as much.

β̂_GLS = (X'Ω⁻¹X)⁻¹X'Ω⁻¹y, where Ω is the error covariance matrix

GLS is like wearing noise-cancelling headphones while listening to data. Some observations are "loud" (high variance, or highly correlated with neighbors), so you turn down the volume on those. Other observations are "clear signals" (low variance, independent), so you trust them more. The weight matrix Ω⁻¹ tunes this down-weighting.

OLS 假设误差独立且方差恒定。现实更复杂：误差可能随时间相关（自相关）或跨组相关（聚集）。GLS 降低噪声较大的观测的权重并说明误差相关性。思想：如果你知道某物测量噪声较大，就不要那么相信它。

β̂_GLS = (X'Ω⁻¹X)⁻¹X'Ω⁻¹y, 其中 Ω 是误差协方差矩阵

GLS 就像在听数据时戴着降噪耳机。一些观测"很吵"（高方差或与邻近相关性很高），所以你降低这些的音量。其他观测是"清晰信号"（低方差、独立），所以你更信任它们。权重矩阵 Ω⁻¹ 调整这个降权。

Specific Cases:

Heteroscedasticity (unequal variances): If Ω is diagonal but not σ²I, use Weighted Least Squares (WLS). Weight each observation inversely by its variance: observations with higher variance get lower weight.

Autocorrelation (time series): If Ω has off-diagonal elements due to temporal correlation, use methods like Cochrane-Orcutt or Prais-Winsten that iteratively estimate the correlation coefficient ρ and transform the data.

Clustering (grouped data): If Ω has blocks of correlation within groups, use Feasible GLS (FGLS) that estimates Ω from the data, or simply report cluster-robust standard errors (down-weight within-cluster variance).

Research scenario: You survey students nested in 50 schools. Students in the same school are more alike than students in different schools (positive intraclass correlation). OLS standard errors are artificially small because you treat each student as independent, when they're not. Use multilevel GLS or cluster-robust standard errors to correct for this.

特殊情况：

异方差（不等方差）：如果 Ω 是对角线但不是 σ²I，使用加权最小二乘法 (WLS)。通过方差的倒数加权每个观测：方差更高的观测获得较低权重。

自相关（时间序列）：如果 Ω 因时间相关性具有非对角线元素，使用 Cochrane-Orcutt 或 Prais-Winsten 等方法，迭代估计相关系数 ρ 并转换数据。

聚集（分组数据）：如果 Ω 在组内具有相关性块，使用可行 GLS (FGLS) 从数据估计 Ω，或简单地报告聚集稳健标准误（降低聚集内方差）。

研究场景：你在 50 所学校中调查学生。同一学校的学生比不同学校的学生更相似（正的类内相关）。OLS 标准误人为地较小，因为你认为每个学生是独立的，但他们不是。使用多层次 GLS 或聚集稳健标准误来纠正这一点。

When to use: When errors are not independent (panel data with clustering, time series with autocorrelation). Always use robust standard errors or specify the correlation structure explicitly. In practice, cluster-robust SEs are the most commonly used GLS approach.

何时使用：当误差不独立时（有聚集的面板数据，有自相关的时间序列）。总是使用稳健标准误或显式指定相关结构。在实践中，聚集稳健 SE 是最常用的 GLS 方法。

Foundation

Maximum Likelihood Estimation (MLE) for Nonlinear Models

非线性模型的最大似然估计 (MLE)

When your outcome is not continuous (binary: yes/no, count: 0, 1, 2, ..., or categorical: A, B, C, D), OLS is inappropriate. Instead, you specify a likelihood function based on the theoretical distribution of your outcome, then find the parameters that maximize it.

L(θ) = ∏ᵢ f(yᵢ | xᵢ, θ), θ̂ = argmax log L(θ)

当你的结果不是连续的（二元：是/否，计数：0、1、2 等，或分类：A、B、C、D），OLS 不合适。相反，你基于结果的理论分布指定似然函数，然后找到最大化它的参数。

L(θ) = ∏ᵢ f(yᵢ | xᵢ, θ), θ̂ = argmax log L(θ)

How MLE Works Conceptually: You propose a model where observations come from a probability distribution, parameterized by θ. Each observed data point has some probability under this model. The likelihood is the joint probability of observing all the data, given θ. MLE finds the θ that makes the observed data most probable. We maximize the log-likelihood (equivalent but numerically stable) using numerical optimization (Newton-Raphson, BFGS, etc.).

The Log-Likelihood Trick: The log-likelihood is ℓ(θ) = Σᵢ log f(yᵢ | xᵢ, θ). Why use logs? (1) Products become sums, easier to compute; (2) log is monotonic, so maximizing ℓ is equivalent to maximizing L; (3) avoids numerical underflow when likelihoods are very small.

Standard Errors & Information Matrix: The information matrix I(θ) ≈ -∂²ℓ/∂θ² tells you the curvature of the likelihood surface. Sharper curvature = more certainty about θ. Standard errors are roughly proportional to the inverse of the information matrix: SE(θ̂) ≈ √(Var(θ̂)), where Var(θ̂) ≈ I(θ̂)⁻¹.

MLE 工作的概念方式：你提出一个模型，其中观测来自概率分布，由 θ 参数化。每个观测数据点在此模型下有某个概率。似然是观测所有数据的联合概率，给定 θ。MLE 找到使观测数据最可能的 θ。我们使用数值优化（Newton-Raphson、BFGS 等）最大化对数似然（等价但数值稳定）。

对数似然技巧：对数似然是 ℓ(θ) = Σᵢ log f(yᵢ | xᵢ, θ)。为什么使用对数？(1) 乘积变成和，更容易计算；(2) 对数是单调的，所以最大化 ℓ 等同于最大化 L；(3) 避免当似然非常小时的数值下溢。

标准误与信息矩阵：信息矩阵 I(θ) ≈ -∂²ℓ/∂θ² 告诉你似然表面的曲率。更锐利的曲率 = 关于 θ 更多的确定性。标准误大致与信息矩阵的逆成比例：SE(θ̂) ≈ √(Var(θ̂))，其中 Var(θ̂) ≈ I(θ̂)⁻¹。

Common Models:

Logit (binary outcomes): P(y=1|x) = Λ(xβ) = 1/(1 + e^{-xβ}). The latent variable interpretation: assume y* = xβ + ε where ε ~ Logistic, and observe y = 1 if y* > 0, else 0. The logistic CDF becomes Λ.

Probit (binary outcomes): P(y=1|x) = Φ(xβ), where Φ is the standard normal CDF. Similar to logit but with normal errors instead of logistic. Probit and logit give very similar predictions; choose based on theory or convention.

Poisson (count outcomes): E(y|x) = λ(x) = e^{xβ}. Assumes y follows a Poisson distribution with mean λ. Good for counts (0, 1, 2, ...) where variance might be small. If variance >> mean (overdispersion), use Negative Binomial instead.

Negative Binomial (overdispersed counts): Extends Poisson by allowing var(y) ≠ E(y). Uses an additional dispersion parameter α.

Example: Modeling whether a country transitions to democracy. You can't use OLS because (1) y is binary (0 or 1), and (2) OLS would predict probabilities outside [0, 1]. Logit constrains predictions to (0, 1) via the logistic function. You estimate P(transition) = Λ(β₀ + β₁·GDP + β₂·Education + ...), find the maximum likelihood estimates of β, and interpret as: a 1-unit increase in Education increases the log-odds of transition by β₂.

常见模型：

Logit（二元结果）： P(y=1|x) = Λ(xβ) = 1/(1 + e^{-xβ})。潜在变量解释：假设 y* = xβ + ε 其中 ε ~ Logistic，观测 y = 1 如果 y* > 0，否则 0。逻辑 CDF 变成 Λ。

Probit（二元结果）： P(y=1|x) = Φ(xβ)，其中 Φ 是标准正态 CDF。类似于 logit 但具有正态误差而不是逻辑误差。Probit 和 logit 给出非常相似的预测；根据理论或惯例选择。

Poisson（计数结果）： E(y|x) = λ(x) = e^{xβ}。假设 y 遵循具有平均值 λ 的 Poisson 分布。适合计数 (0, 1, 2, ...)，方差可能很小。如果方差 >> 平均值（过度离散），使用负二项分布。

负二项分布（过度离散计数）：通过允许 var(y) ≠ E(y) 扩展 Poisson。使用额外的离散参数 α。

例子：对民主转变进行建模。你不能使用 OLS，因为 (1) y 是二元的 (0 或 1)，(2) OLS 会预测超出 [0, 1] 的概率。Logit 通过逻辑函数约束预测为 (0, 1)。你估计 P(转变) = Λ(β₀ + β₁·GDP + β₂·教育 + ...)，找到 β 的最大似然估计，并解释为：教育中的 1 单位增加增加转变的对数赔率 β₂。

When to use: Always, when your outcome is not continuous. The functional form comes from theory (what is your outcome's true distribution?), not convenience. Report both coefficients and average marginal effects (the slope ∂P/∂x averaged over x), since nonlinear models make interpretation trickier than OLS.

何时使用：总是，当你的结果不是连续的。函数形式来自理论（你的结果的真实分布是什么？），而不是便利。报告系数和平均边际效应（斜率 ∂P/∂x 在 x 上平均化），因为非线性模型使解释比 OLS 更棘手。

Module 2: Context Conditionality

模块二：情境条件性

Effects are rarely uniform. Gender gaps in income vary by occupation. Democratic stability depends on income equality. The effect of competition on firm innovation differs by industry maturity. How do you model these conditional effects?

效应很少是一致的。收入性别差距因职业而异。民主稳定性取决于收入平等。竞争对公司创新的影响因行业成熟度而异。你如何建模这些条件效应？

Interactions

Linear Interaction Models

线性交互模型

The simplest way: add a term that multiplies two variables. If you think "the effect of X on Y depends on Z," you write:

y = β₀ + β₁x + β₂z + β₃(x × z) + ε

The marginal effect of X on Y is now: ∂y/∂x = β₁ + β₃z. It depends on z! When z increases, so does X's effect on Y (if β₃ > 0).

最简单的方式：加一个乘以两个变量的项。如果你认为"X 对 Y 的影响取决于 Z"，你写：

y = β₀ + β₁x + β₂z + β₃(x × z) + ε

X 对 Y 的边际效应现在是：∂y/∂x = β₁ + β₃z。它取决于 z！当 z 增加时，X 对 Y 的影响也增加（如果 β₃ > 0）。

Three Things You Must Report for Interactions:

The Marginal Effect: ∂y/∂x = β₁ + β₃z (this is what the effect actually is, not just β₁)
Its Standard Error: SE(∂y/∂x) — varies with z, so report it for meaningful values of z (e.g., mean, ±1 SD)
The Range of Significance: For what values of z is the effect statistically significant? (Solve: β₁ + β₃z ≈ 1.96·SE for the bounds)

Common Mistake: Looking at the coefficient β₃ on the interaction term alone is meaningless. β₃ tells you how the effect of X changes with Z, but you need β₁ too to understand the effect at baseline. Similarly, ignoring β₁ and β₂ (the "constitutive terms") leads to wrong interpretations.

Brambor, Clark & Golder (2006) Rules: (1) Always include the main terms (β₁x and β₂z); (2) Interpret effects using the marginal effect formula; (3) Plot the relationship for visual clarity; (4) Test whether the interaction is significant using a hypothesis test, not just looking at β₃'s p-value.

Central Bank Independence (CBI) and Inflation: Franzese's research shows that CBI reduces inflation, but this effect depends on exchange rate regime (fixed vs. floating). Under fixed exchange rates, CBI's inflation-control power is limited (because the exchange rate constrains monetary policy). Under floating rates, CBI can operate more freely. The interaction captures this: ∂Inflation/∂CBI = β₁ + β₃·(Regime), where β₃ > 0 means the effect is weaker under fixed regimes.

你必须为交互报告三件事：

边际效应： ∂y/∂x = β₁ + β₃z（这是效应实际上是什么，不仅仅是 β₁）
其标准误： SE(∂y/∂x) — 随 z 变化，所以为 z 的有意义值报告它（例如，均值、±1 SD）
显著性范围：对于 z 的哪些值，效应在统计上显著？（求解：β₁ + β₃z ≈ 1.96·SE 对于边界）

常见错误：单独查看交互项的系数 β₃ 毫无意义。β₃ 告诉你 X 的效应随 Z 如何变化，但你也需要 β₁ 来理解基线处的效应。同样，忽略 β₁ 和 β₂（"构成项"）导致错误解释。

Brambor、Clark & Golder (2006) 规则： (1) 总是包含主项 (β₁x 和 β₂z)；(2) 使用边际效应公式解释效应；(3) 为清晰起见绘制关系；(4) 使用假设检验测试交互是否显著，不仅仅是查看 β₃ 的 p 值。

央行独立性 (CBI) 和通胀：Franzese 的研究表明 CBI 降低通胀，但这个效应取决于汇率制度（固定 vs. 浮动）。在固定汇率下，CBI 的通胀控制力有限（因为汇率约束货币政策）。在浮动汇率下，CBI 可以更自由地运作。交互捕捉了这一点：∂通胀/∂CBI = β₁ + β₃·(制度)，其中 β₃ > 0 意味着在固定制度下效应更弱。

When to use: Theory suggests one variable modifies another's effect. Always center your variables (subtract the mean) to reduce multicollinearity and make the main effects interpretable. Report effects at meaningful values of the moderator (Z), not just at the regression table.

何时使用：理论提示一个变量改变另一个的效应。总是使变量居中（减去均值）以减少多重共线性并使主效应可解释。在主持变量 (Z) 的有意义值报告效应，而不仅仅是回归表。

Interactions

Nonlinear Interaction Models (NLS)

非线性交互模型 (NLS)

Linear interactions work great for continuous outcomes. But what if your model is inherently nonlinear? For instance, a demand function might be multiplicative: Q = A·P^β·I^γ (quantity depends on price and income). Or a production function: Y = A·K^α·L^β. You can't just multiply the variables — you need nonlinear least squares or maximum likelihood to estimate the exponents directly.

线性交互适用于连续结果。但如果你的模型本质上是非线性的呢？例如，需求函数可能是乘法的：Q = A·P^β·I^γ（数量取决于价格和收入）。或生产函数：Y = A·K^α·L^β。你不能只是乘以变量——你需要非线性最小二乘法或最大似然法来直接估计指数。

Additive vs. Non-Additive Error Structures: A linear model has an additive error: y = Xβ + ε. A nonlinear model might be multiplicative: y = f(X, β)·ε (errors scale with fitted values). Or purely nonlinear: y = f(X, β) + ε. The error structure matters for likelihood specification and standard errors.

How NLS Works: You specify a nonlinear function f(X, β) and minimize the sum of squared residuals: S(β) = Σ(yᵢ - f(Xᵢ, β))². Unlike OLS, there's no closed-form solution — you use iterative algorithms: Grid search + gradient descent (Newton-Raphson, BFGS). Start with multiple initial guesses because the objective function is not convex — you might get stuck in local optima.

Multiple Local Optima Problem: The function f(X, β) might have many combinations of β that fit the data nearly equally well. To handle this: (1) try many starting values; (2) plot the likelihood surface if possible; (3) use grid search to narrow the region, then refine; (4) check second-order conditions (Hessian) to verify you found a global optimum.

Cobb-Douglas Production Function: Y = A·K^α·L^β·ε, where Y is output, K is capital, L is labor. Log-linearizing: log Y = log A + α·log K + β·log L + log ε, which looks linear! But if you estimate it as linear OLS on log-transformed variables, you're implicitly assuming log ε ~ N(0, σ²). If ε ~ Lognormal, log ε is not normally distributed, and you should use MLE with the Cobb-Douglas likelihood instead. The nonlinear approach also lets you directly estimate A (scale), not just log A.

可加与非可加错误结构：线性模型有可加误差：y = Xβ + ε。非线性模型可能是乘法的：y = f(X, β)·ε（误差按拟合值缩放）。或纯非线性：y = f(X, β) + ε。错误结构对似然规范和标准误很重要。

NLS 工作原理：你指定非线性函数 f(X, β) 并最小化残差平方和：S(β) = Σ(yᵢ - f(Xᵢ, β))²。与 OLS 不同，没有闭式解——你使用迭代算法：网格搜索+梯度下降（Newton-Raphson、BFGS）。从多个初始猜测开始，因为目标函数不是凸的——你可能被困在局部最优值中。

多个局部最优值问题：函数 f(X, β) 可能有许多 β 组合几乎同样好地适应数据。处理方法：(1) 尝试许多起始值；(2) 如果可能，绘制似然表面；(3) 使用网格搜索缩小区域，然后细化；(4) 检查二阶条件（Hessian）以验证你找到了全局最优值。

Cobb-Douglas 生产函数：Y = A·K^α·L^β·ε，其中 Y 是产出，K 是资本，L 是劳动力。对数线性化：log Y = log A + α·log K + β·log L + log ε，看起来线性！但如果你在对数转换变量上估计为线性 OLS，你隐式假设 log ε ~ N(0, σ²)。如果 ε ~ Lognormal，log ε 不是正态分布，你应该使用具有 Cobb-Douglas 似然的 MLE。非线性方法也让你直接估计 A（规模），不仅仅是 log A。

When to use: Theory suggests a multiplicative or exponential form. Use NLS or MLE. Be careful: nonlinear models can have multiple local optima — try many starting values, use grid search to find a good initial region, and always verify you found a global optimum. For the Cobb-Douglas example, using MLE on the original (non-log) form ensures proper error specification.

何时使用：理论提示乘法或指数形式。使用 NLS 或 MLE。小心：非线性模型可能有多个局部最优值——尝试许多起始值，使用网格搜索找到一个好的初始区域，并始终验证你找到了全局最优值。对于 Cobb-Douglas 示例，在原始（非对数）形式上使用 MLE 确保适当的错误规范。

Multilevel

Multilevel & Random-Coefficient Models

多层次与随机系数模型

Students are nested in classrooms, classrooms in schools, schools in districts. Observations are not independent — students in the same classroom are more alike than students in different classrooms. Standard regression ignores this clustering. Multilevel models explicitly account for it, allowing both fixed effects (intercepts and slopes that differ by group) and random effects (intercepts and slopes that are drawn from a distribution).

Think of a random-intercept model as: each school has its own baseline achievement level, but the effect of teacher quality is the same across schools. A random-slope model: the effect of teacher quality varies by school (some schools benefit more than others).

学生嵌套在教室，教室嵌套在学校，学校嵌套在地区。观测值不独立——同一教室的学生比不同教室的学生更相似。标准回归忽视了这种聚集。多层次模型显式考虑了它，允许固定效应（因群体不同的截距和斜率）和随机效应（从分布中抽取的截距和斜率）。

把随机截距模型看作：每所学校都有自己的基线成就水平，但教师质量的影响在学校间相同。随机斜率模型：教师质量的影响因学校而异（某些学校受益更多）。

Intraclass Correlation Coefficient (ICC): How much variance is between groups vs. within groups? ICC = τ/(τ + σ²), where τ is between-group variance and σ² is within-group variance. If ICC = 0, observations are completely independent; if ICC = 1, all variation is between groups. Rule of thumb: If ICC > 0.05, you should use multilevel modeling. If ICC > 0.10, ignoring it leads to very misleading standard errors.

Random Intercept Model: yᵢⱼ = β₀ + u₀ⱼ + β₁xᵢⱼ + εᵢⱼ, where u₀ⱼ ~ N(0, τ) is the group-level deviation. Each group j has its own intercept (β₀ + u₀ⱼ), but the slope β₁ is the same across groups. Useful when groups differ in baseline but respond similarly to predictors.

Random Slope Model: yᵢⱼ = β₀ + u₀ⱼ + (β₁ + u₁ⱼ)xᵢⱼ + εᵢⱼ. Both intercept and slope vary by group. The slope u₁ⱼ captures group-specific responses to x. More complex but needed when theory suggests the effect of x differs by group.

Shrinkage/Partial Pooling: Multilevel models borrow information across groups. A small group (e.g., a school with only 5 students) gets its estimate "pulled" toward the grand mean, weighted by the ICC. More extreme estimates for small groups are down-weighted because they're likely just noise. This is more principled than listwise averaging within each group.

Students nested in schools: Suppose you have 50 schools, but one school has only 5 students while others have 200. In standard OLS per school, the small school's estimated mean could be quite noisy. In a multilevel model, the small school's estimate gets shrunk toward the overall mean, more weight placed on the larger schools' estimates, and uncertainty quantified properly. This is not "artificial" — it reflects genuine uncertainty.

类内相关系数 (ICC)：有多少方差在组之间 vs. 组内？ICC = τ/(τ + σ²)，其中 τ 是组间方差，σ² 是组内方差。如果 ICC = 0，观测完全独立；如果 ICC = 1，所有变异在组间。经验法则：如果 ICC > 0.05，你应该使用多层次建模。如果 ICC > 0.10，忽视它导致非常误导的标准误。

随机截距模型： yᵢⱼ = β₀ + u₀ⱼ + β₁xᵢⱼ + εᵢⱼ，其中 u₀ⱼ ~ N(0, τ) 是组级偏差。每个组 j 有其自己的截距 (β₀ + u₀ⱼ)，但斜率 β₁ 在组间相同。当组在基线上不同但对预测器的反应相似时很有用。

随机斜率模型： yᵢⱼ = β₀ + u₀ⱼ + (β₁ + u₁ⱼ)xᵢⱼ + εᵢⱼ。截距和斜率都按组变化。斜率 u₁ⱼ 捕捉组特定的对 x 的反应。更复杂但当理论提示 x 的效应因组而异时需要。

收缩/部分汇总：多层次模型在组间借用信息。一个小组（例如，只有 5 个学生的学校）其估计被"拉向"整体均值，按 ICC 加权。小组的更极端估计被降权，因为它们可能只是噪声。这比在每个组内列表平均更原则性。

学生嵌套在学校：假设你有 50 所学校，但一所学校只有 5 个学生，而其他学校有 200 个。在标准 OLS 每学校，小学校的估计均值可能相当有噪声。在多层次模型中，小学校的估计被收缩向整体均值，更多权重放在较大学校的估计上，不确定性被正确量化。这不是"人为的"——它反映了真实的不确定性。

When to use: Data is hierarchical (students in schools, firms in industries, regions in countries). Always compute ICC first. If ICC > 0.05, use multilevel modeling. Ignoring clustering leads to artificially small standard errors — your confidence intervals will be too narrow, and you'll find false positives. Multilevel models also allow you to ask: "How much do effects vary by group?" This is often the substantive question.

何时使用：数据是分层的（学校中的学生、行业中的公司、国家中的地区）。始终先计算 ICC。如果 ICC > 0.05，使用多层次建模。忽视聚集导致人为较小的标准误——你的置信区间将过于狭窄，你会找到假阳性。多层次模型也让你问："效应因组的变化有多大？"这通常是实质性问题。

Module 3: Dynamics & Dependence

模块三：动态与依赖

In time series data, the past affects the present. Inflation yesterday influences interest rates today. Unemployment last quarter shapes employment this quarter. Lagged dependent variables, autoregressive structures, and error-correction models let you capture these dynamics.

在时间序列数据中，过去影响现在。昨天的通胀影响今天的利率。上季度的失业塑造本季度的就业。滞后因变量、自回归结构和误差修正模型让你捕捉这些动态。

Temporal

Temporal Dynamic Models (LDV, ADL, ECM)

时间动态模型 (LDV、ADL、ECM)

Lagged Dependent Variable (LDV): The simplest approach — include y_(t-1) as a predictor of y_t. Captures persistence (some things continue because they were already happening). But LDV models are controversial: the lagged y is endogenous if errors are correlated over time, requiring IV estimation.

Autoregressive Distributed Lag (ADL): Include both lagged values of the dependent variable AND lagged values of regressors: y_t = β₀ + β₁y_(t-1) + γ₁x_t + γ₂x_(t-1) + ε_t. Richer dynamics — how long does an X shock take to affect Y? Compute impulse responses by simulating the system forward.

Error Correction Model (ECM): If X and Y are cointegrated (they share a long-run equilibrium relationship), an ECM lets you model both short-run adjustment and long-run equilibrium. Variables that drift together will eventually correct back toward their equilibrium.

滞后因变量 (LDV)：最简单的方法——包含 y_(t-1) 作为 y_t 的预测器。捕捉持久性（某些事情继续因为它们已经在发生）。但 LDV 模型是有争议的：如果误差随时间相关，滞后的 y 是内生的，需要 IV 估计。

自回归分布滞后 (ADL)：同时包含因变量的滞后值和回归变量的滞后值：y_t = β₀ + β₁y_(t-1) + γ₁x_t + γ₂x_(t-1) + ε_t。更丰富的动态——X 冲击需要多长时间才能影响 Y？通过向前模拟系统计算脉冲响应。

误差修正模型 (ECM)：如果 X 和 Y 协整（它们共享长期均衡关系），ECM 让你同时建模短期调整和长期均衡。一起漂移的变量最终会向其均衡调整回来。

Unit Roots & Stationarity: A series is stationary if it has a constant mean and variance, and autocorrelations decay over time. A unit-root process (I(1)) is nonstationary — it's a random walk where shocks permanently affect the level. Spurious Regression Problem: If you regress two independent random walks on each other, you'll find a "significant" relationship just by chance. The solution: test for unit roots (Augmented Dickey-Fuller test), difference the data if nonstationary, or use cointegration methods.

Cointegration: Two I(1) series are cointegrated if some linear combination of them is I(0). Think of it as: the variables wander together like "a drunk and her dog" — they don't move independently, but they don't have a fixed relationship either. An ECM models this relationship.

Long-Run Multiplier in ADL: In an ADL(1,1): y_t = β₀ + β₁y_(t-1) + γ₁x_t + γ₂x_(t-1) + ε_t, the long-run effect of X on Y is the "long-run multiplier": (γ₁ + γ₂)/(1 - β₁). This is the total impact after all adjustments have played out. The immediate impact is γ₁, but future lags add γ₂, and the lagged y_(t-1) perpetuates the effect dynamically.

Government spending and GDP over time: A one-time increase in spending increases GDP immediately (impact multiplier). But higher GDP increases tax revenue and future spending (feedback), which increases GDP further (dynamic multiplier). Over time, prices rise and the effect diminishes (long-run equilibrium). An ADL or ECM captures all three: short-run shock, dynamic feedback, and long-run equilibrium.

单位根与平稳性：如果序列有常数均值和方差，自相关随时间衰减，则该序列是平稳的。单位根过程 (I(1)) 是非平稳的——这是一个随机游走，其中冲击永久影响水平。虚假回归问题：如果你对两个独立的随机游走进行回归，你只会凭机会找到"显著"关系。解决方案：测试单位根（增强 Dickey-Fuller 检验），如果非平稳差分数据，或使用协整方法。

协整：两个 I(1) 序列是协整的，如果它们的某个线性组合是 I(0)。把它看作：变量一起漂移，像"一个醉汉和她的狗"——它们不独立移动，但也没有固定关系。ECM 建模这个关系。

ADL 中的长期乘数：在 ADL(1,1)：y_t = β₀ + β₁y_(t-1) + γ₁x_t + γ₂x_(t-1) + ε_t 中，X 对 Y 的长期效应是"长期乘数"：(γ₁ + γ₂)/(1 - β₁)。这是所有调整都已发生后的总体影响。立即影响是 γ₁，但未来滞后添加 γ₂，滞后 y_(t-1) 动态地延续效应。

政府支出和 GDP 随时间：支出的一次性增加立即增加 GDP（冲击乘数）。但更高的 GDP 增加税收和未来支出（反馈），进一步增加 GDP（动态乘数）。随时间，价格上升，效应减少（长期均衡）。ADL 或 ECM 捕捉所有三个：短期冲击、动态反馈和长期均衡。

When to use: Time series or panel data with temporal ordering. Always check for stationarity using ADF test. If nonstationary (unit root), difference the data or test for cointegration. For ADL, compute impulse responses and long-run multipliers, not just coefficients. For ECM, establish cointegration first (Johansen test) before estimating.

何时使用：具有时间顺序的时间序列或面板数据。始终使用 ADF 检验检查平稳性。如果非平稳（单位根），差分数据或测试协整。对于 ADL，计算脉冲响应和长期乘数，不仅仅是系数。对于 ECM，在估计前先建立协整（Johansen 检验）。

Spatial

Spatial & Spatiotemporal Models (SAR, STAR, STADL)

空间与时空模型 (SAR、STAR、STADL)

Just as past events affect the present, nearby places affect each other. Countries' trade policies influence neighbors. Riots spread geographically. Pollution drifts across borders. Spatial econometrics lets you model these spillovers.

If you ignore spatial dependence, you're assuming Idaho's economy is unaffected by Washington's — false. Neighbors' outcomes should enter your equation, just like lagged values do for time series.

就如过去事件影响现在一样，附近的地方相互影响。国家的贸易政策影响邻近国家。骚乱地理上传播。污染跨越边界漂移。空间计量经济学让你建模这些溢出效应。

如果你忽视空间依赖，你就假设爱达荷州的经济不受华盛顿州的影响——这是假的。邻近的结果应该进入你的方程，就像滞后值对时间序列那样。

The Spatial Weight Matrix W: Define which units are "neighbors." Common approaches: (1) Contiguity: W_ij = 1 if i and j share a border, 0 otherwise; (2) Inverse distance: W_ij = 1/distance_ij, down-weighting far units; (3) k-nearest neighbors: connect each unit to its k nearest neighbors. Row-standardize W so each row sums to 1 for interpretability.

Spatial Autoregressive Model (SAR): y = ρWy + Xβ + ε. The outcome y depends on X (as in OLS), but also on neighbors' outcomes Wy, scaled by ρ (spatial coefficient). If ρ = 0, no spatial dependence. If ρ > 0, neighbors' high outcomes pull up your outcome. This is like a spatial analog of a lagged-y model in time series.

Spatial Error Model (SEM): y = Xβ + ε, where ε = λWε + u. The errors (not the outcomes themselves) are spatially correlated. This arises when unmeasured confounders are spatially clustered. Less commonly used than SAR but sometimes more appropriate theoretically.

Moran's I Test: Detects spatial autocorrelation in the residuals. I = (n/(W sum)) · (ε'Wε / ε'ε). If I is significantly different from 0, you have spatial dependence. Always run this test on OLS residuals to check whether spatial methods are needed.

Spatiotemporal Models (STAR, STADL): Combine both time and space: y_it = ρ₁W_sy_(t-1) + ρ₂ y_i(t-1) + Xβ + ε. This says: your outcome depends on neighbors last period, your own lag last period, and contemporary X. More complex but captures both spatial spillovers and temporal dynamics simultaneously.

Tax competition among states: Do neighboring states' tax rates affect your state's tax rate? Build W as contiguity (neighbors = states sharing a border). Estimate SAR with y = state tax rate. Find ρ > 0: yes, states undercut neighbors' tax rates (tax competition). STAR extends this: past neighbors' tax cuts also affect your current rate (tax competition dynamics).

空间权重矩阵 W：定义哪些单位是"邻近"。常见方法：(1) 邻接性：W_ij = 1 如果 i 和 j 共享边界，否则 0；(2) 逆距离：W_ij = 1/distance_ij，降低远单位的权重；(3) k-最近邻：连接每个单位到其 k 个最近邻。行标准化 W 使每行求和为 1，便于解释。

空间自回归模型 (SAR)： y = ρWy + Xβ + ε。结果 y 依赖于 X（如 OLS），但也依赖于邻近结果 Wy，按 ρ（空间系数）缩放。如果 ρ = 0，无空间依赖。如果 ρ > 0，邻近高结果拉高你的结果。这就像时间序列中滞后 y 模型的空间类似物。

空间误差模型 (SEM)： y = Xβ + ε，其中 ε = λWε + u。误差（不是结果本身）是空间相关的。这发生在未测量的混淆因子空间聚集时。比 SAR 使用较少，但有时在理论上更合适。

Moran 的 I 检验：检测残差中的空间自相关。I = (n/(W sum)) · (ε'Wε / ε'ε)。如果 I 显著不同于 0，你有空间依赖。始终在 OLS 残差上运行此检验以检查是否需要空间方法。

时空模型 (STAR、STADL)：同时结合时间和空间：y_it = ρ₁W_sy_(t-1) + ρ₂ y_i(t-1) + Xβ + ε。这说：你的结果取决于上一期邻近，上一期你自己的滞后，和当代 X。更复杂但同时捕捉空间溢出和时间动态。

州之间的税收竞争：邻近州的税率是否影响你的州的税率？将 W 构建为邻接性（邻近 = 共享边界的州）。用 y = 州税率估计 SAR。找到 ρ > 0：是的，州下调邻近的税率（税收竞争）。STAR 扩展这一点：过去邻近的税率削减也影响你当前的税率（税收竞争动态）。

When to use: Cross-sectional or panel data with geographic units (countries, regions, cities). Build the spatial weight matrix W thoughtfully (contiguity, inverse distance, or k-NN). Test for spatial autocorrelation in OLS residuals (Moran's I). If present, estimate SAR or SEM. For panel data, use STAR or STADL to capture both spatial and temporal spillovers simultaneously.

何时使用：带有地理单位（国家、地区、城市）的截面或面板数据。深思熟虑地构建空间权重矩阵 W（邻接性、逆距离或 k-NN）。测试 OLS 残差中的空间自相关（Moran 的 I）。如果存在，估计 SAR 或 SEM。对于面板数据，使用 STAR 或 STADL 同时捕捉空间和时间溢出。

Module 4: Endogeneity & Causal Estimation

模块四：内生性与因果估计

The deepest challenge: when X and Y cause each other (or are both caused by an unmeasured confounder), regular regression breaks down. Your coefficient estimates are biased and potentially meaningless. This is where systems of equations, instrumental variables, and the concept of causal multipliers come in.

最深层的挑战：当 X 和 Y 相互影响（或都由未测量的混淆因子引起）时，常规回归崩溃。你的系数估计有偏且可能无意义。这是方程组系统、工具变量和因果乘数概念发挥作用的地方。

Endogeneity

Systems & IV Estimation (2SLS, 3SLS, GMM)

系统与 IV 估计 (2SLS、3SLS、GMM)

The Problem: If X and Y cause each other simultaneously, OLS is biased. Example: higher wages might increase education demand (Y on X), but more education also increases wages (X on Y). Which direction is the effect? Without external information, you can't identify them separately (as Challenge 04 showed).

The Solution: Instrumental Variables. Find a variable Z that affects X but does NOT directly affect Y (an "instrument"). Then use Z to predict X, and use those predictions (not X itself) in your regression. This breaks the simultaneity because Z affects Y only through X.

Two Conditions for a Valid Instrument: (1) Relevance: Cov(Z, X) ≠ 0 (Z must be correlated with X). (2) Exclusion Restriction: Cov(Z, ε) = 0 (Z must be uncorrelated with the error in the Y equation, i.e., affect Y only through X).

Two-Stage Least Squares (2SLS): Stage 1: regress X on Z to get predicted values X̂ = Z(Z'Z)⁻¹Z'X. Stage 2: regress Y on X̂ to get β̂₂SLS = (X̂'X̂)⁻¹X̂'y. The coefficient on X̂ is consistent and asymptotically normal under the two conditions above.

Weak Instruments Problem: If Z is weakly correlated with X (low R² in first stage), the 2SLS estimates become very noisy. Stock & Yogo recommend: if first-stage F < 10, your instruments are "weak" and inference can be severely distorted. Report first-stage F-statistics always.

Over-identification Test (Sargan/Hansen J-test): If you have more instruments than endogenous variables (over-identified), you can test whether the excluded instruments are truly exogenous. This tests the exclusion restrictions. If J-test rejects, some instruments may be invalid.

Three-Stage Least Squares (3SLS): When you have multiple equations simultaneously, 3SLS estimates them jointly, accounting for correlations in errors across equations. More efficient than estimating each equation by 2SLS separately.

Generalized Method of Moments (GMM): A more flexible framework that works with general moment conditions E[g(y, X, θ)] = 0. Handles endogeneity, dynamics (lagged dependent variables), and weak instruments better than 2SLS. Requires choosing how to weight different moment conditions (via weighting matrix).

Do literacy programs reduce poverty? The endogeneity problem: poor countries invest less in literacy, but literacy also reduces poverty (X ← Y and Y ← X). Solution: Find Z = "legacy of colonial education policy" which affected historical literacy investments, persisting to today, but doesn't directly affect current poverty except through current literacy. Use 2SLS: Stage 1: literacy = f(colonial legacy, other controls). Stage 2: poverty = f(predicted literacy, other controls). Report first-stage F. If F < 10, colonial legacy is a weak instrument — estimates unreliable.

问题：如果 X 和 Y 同时相互影响，OLS 有偏。例子：更高工资可能增加对教育的需求（Y 在 X 上），但更多教育也增加工资（X 在 Y 上）。哪个方向是效应？没有外部信息，你不能单独识别它们（如挑战 04 所示）。

解决方案：工具变量。找一个变量 Z 影响 X 但不直接影响 Y（一个"工具"）。然后用 Z 预测 X，用这些预测（不是 X 本身）在你的回归中。这打破了同时性，因为 Z 只通过 X 影响 Y。

有效工具的两个条件： (1) 相关性：Cov(Z, X) ≠ 0（Z 必须与 X 相关）。(2) 排除限制：Cov(Z, ε) = 0（Z 必须与 Y 方程的误差不相关，即仅通过 X 影响 Y）。

两阶段最小二乘法 (2SLS)：阶段 1：对 Z 回归 X 得到预测值 X̂ = Z(Z'Z)⁻¹Z'X。阶段 2：对 X̂ 回归 Y 得到 β̂₂SLS = (X̂'X̂)⁻¹X̂'y。X̂ 上的系数在上述两个条件下是一致的且渐近正态的。

弱工具问题：如果 Z 与 X 的相关性弱（第一阶段 R² 低），2SLS 估计变得非常噪声。Stock & Yogo 建议：如果第一阶段 F < 10，你的工具是"弱"的，推断可能严重扭曲。总是报告第一阶段 F 统计量。

过度识别检验（Sargan/Hansen J 检验）：如果你有比内生变量更多的工具（过度识别），你可以测试被排除的工具是否真正外生。这测试排除限制。如果 J 检验拒绝，某些工具可能无效。

三阶段最小二乘法 (3SLS)：当你有多个方程同时时，3SLS 联合估计它们，考虑方程间误差中的相关性。比单独为每个方程估计 2SLS 更有效。

广义矩量法 (GMM)：一个更灵活的框架，使用一般矩条件 E[g(y, X, θ)] = 0。比 2SLS 更好地处理内生性、动态性（滞后因变量）和弱工具。需要选择如何加权不同的矩条件（通过加权矩阵）。

扫盲项目是否减少贫困？内生性问题：贫困国家投资较少于扫盲，但扫盲也减少贫困（X ← Y 和 Y ← X）。解决方案：找到 Z =「殖民地教育政策遗留」影响历史扫盲投资，持续到今天，但不直接影响当前贫困，除非通过当前扫盲。使用 2SLS：阶段 1：扫盲 = f(殖民地遗留、其他控制)。阶段 2：贫困 = f(预测扫盲、其他控制)。报告第一阶段 F。如果 F < 10，殖民地遗留是弱工具——估计不可靠。

When to use: When you suspect endogeneity (simultaneity, reverse causality, or unmeasured confounders). Finding good instruments is hard — they must be relevant (F > 10) and exogenous (exclusion restriction). Always report first-stage F-statistics, test for weak instruments (Stock & Yogo), and run over-identification tests (if over-identified). Document why you believe your instruments satisfy the exclusion restriction.

何时使用：当你怀疑内生性（同时性、反向因果或未测量的混淆因子）时。找好的工具很难——它们必须是相关的（F > 10）和外生的（排除限制）。总是报告第一阶段 F 统计量，测试弱工具（Stock & Yogo），并运行过度识别检验（如果过度识别）。记录为什么你相信你的工具满足排除限制。

Systems

Vector Autoregression (VAR)

向量自回归 (VAR)

When you have multiple time series that cause each other, VAR is ideal. Instead of trying to figure out which variable causes which, VAR treats all variables symmetrically: each variable is regressed on its own lags and the lags of all other variables. The system captures feedback loops.

Imagine GDP, inflation, and unemployment. They all affect each other over time. Rather than argue about causality, VAR asks: "How do these series move together? When inflation spikes, what happens to GDP and unemployment in the next quarter?"

当你有多个相互影响的时间序列时，VAR 是理想的。与其试图弄清楚哪个变量导致哪个，VAR 对所有变量平等对待：每个变量对其自身滞后和所有其他变量的滞后回归。该系统捕捉反馈环。

想象 GDP、通胀和失业。他们都随时间相互影响。与其争论因果性，VAR 问："这些序列如何一起移动？通胀飙升时，下一季度 GDP 和失业会发生什么？"

VAR Specification: With M variables and p lags, the reduced-form VAR(p) is: Y_t = A₁Y_(t-1) + A₂Y_(t-2) + ... + ApY_(t-p) + ε_t, where Y_t is M×1, each A_i is M×M, and ε_t ~ N(0, Σ). Estimate each equation by OLS (feasible because regressors are the same across equations).

Impulse Response Functions (IRFs): Trace the effect of a one-unit shock to one variable through time. Example: Federal Reserve raises interest rates by 100 basis points. How does GDP respond? Inflation? Unemployment? IRFs show the time path of responses over the next 8-12 quarters. Computed by the Moving Average representation of the VAR.

Forecast Error Variance Decomposition (FEVD): At a given horizon (e.g., 8 quarters ahead), how much of the forecast error for GDP is explained by shocks to interest rates vs. other variables? FEVD tells you the relative importance of different shocks for forecasting each variable.

Granger Causality: X "Granger-causes" Y if past values of X help predict Y better than Y's own past alone. Test: regress Y on its own lags, then add lags of X and test if they're jointly significant (F-test). Note: Granger causality is predictive precedence, not true causality — it's controversial because reverse causality can produce spurious Granger causality.

Structural VAR (SVAR) & Identification: The reduced-form VAR has contemporaneous correlations in ε_t (errors across equations are correlated). To interpret IRFs as causal, you need to identify the structural shocks. Common approach: Choleski decomposition — assume a recursive causal structure among variables (e.g., Fed doesn't respond to current inflation, only lagged inflation; inflation doesn't respond to current unemployment, only lagged). This orders the variables and identifies the shocks.

Central bank policy transmission: 3-variable VAR with interest rates (r), inflation (π), and unemployment (u). Reduced-form VAR captures correlations. Choleski ordering: assume r leads π and u (Fed sets rates based on lagged info), π doesn't respond contemporaneously to u. IRFs show: 1% rate increase → inflation falls over 2-4 quarters, unemployment rises after lag. Total effects visible through all feedback channels.

VAR 规范：有 M 个变量和 p 个滞后，约化形式 VAR(p) 是：Y_t = A₁Y_(t-1) + A₂Y_(t-2) + ... + ApY_(t-p) + ε_t，其中 Y_t 是 M×1，每个 A_i 是 M×M，ε_t ~ N(0, Σ)。通过 OLS 估计每个方程（可行因为回归变量在方程间相同）。

脉冲反应函数 (IRF)：追踪一个变量的一单位冲击通过时间的效应。例子：联邦储备加息 100 个基点。GDP 如何反应？通胀？失业？IRF 显示未来 8-12 季度的响应时间路径。通过 VAR 的动态平均表示计算。

预测误差方差分解 (FEVD)：在给定地平线上（例如，8 季度前），GDP 预测误差有多少由利率冲击解释 vs. 其他变量？FEVD 告诉你不同冲击对预测每个变量的相对重要性。

Granger 因果性：X "Granger 导致"Y 如果 X 的过去值比 Y 仅自身过去更好地预测 Y。测试：对 Y 仅其自身滞后回归，然后添加 X 的滞后并测试它们是否联合显著（F 检验）。注意：Granger 因果是预测优先级，不是真正的因果——有争议因为反向因果可能产生虚假的 Granger 因果。

结构 VAR (SVAR) 和识别：约化形式 VAR 在 ε_t 中有当代相关性（方程间的误差相关）。要解释 IRF 为因果，你需要识别结构冲击。常见方法：Choleski 分解——假设变量间的递归因果结构（例如，联邦不对当前通胀反应，仅对滞后通胀；通胀不对当前失业反应，仅对滞后失业）。这排序变量并识别冲击。

央行政策传导：3 变量 VAR，含利率 (r)、通胀 (π) 和失业 (u)。约化形式 VAR 捕捉相关性。Choleski 排序：假设 r 导致 π 和 u（联邦基于滞后信息设置利率），π 不对 u 当代反应。IRF 显示：1% 利率增加 → 通胀在 2-4 季度下降，失业在滞后后上升。通过所有反馈渠道可见总效应。

When to use: Multiple time series with feedback. VAR requires stationarity (difference your series if they have unit roots). Choose lag order (p) using information criteria (AIC, BIC). Estimate by OLS, compute IRFs with confidence bands, FEVD, and Granger causality tests. For structural interpretation, use SVAR with Choleski or sign restrictions for identification. Interpret IRFs as illustrative of system dynamics, not as definitive causal claims.

何时使用：具有反馈的多个时间序列。VAR 需要平稳性（如果有单位根差分你的序列）。使用信息标准 (AIC, BIC) 选择滞后阶数 (p)。通过 OLS 估计，计算带置信区间的 IRF、FEVD 和 Granger 因果检验。为了结构解释，使用带 Choleski 或符号限制的 SVAR 来识别。将 IRF 解释为系统动态的说明性，而不是明确的因果声明。

Systems

The Systems Multiplier & Causal Responses

系统乘数与因果反应

Here's a crucial insight: coefficients are impulses, not responses. In a system where variables cause each other, the full causal response includes feedback through the system.

Imagine a Keynesian system: I (investment) causes Y (output), and Y causes C (consumption). If I increases by 1 dollar:

Directly: Y increases by some amount (say, 1.5 times the investment — the multiplier)
Indirectly: Higher Y causes higher C, which causes even higher Y (feedback loop)
The ultimate response is larger than the direct effect

这里有一个关键的见解：系数是冲击，不是反应。在变量相互影响的系统中，完全因果反应包括通过系统的反馈。

想象一个凯恩斯制系统：I（投资）导致 Y（产出），Y 导致 C（消费）。如果 I 增加 1 美元：

直接：Y 增加某个数量（比如投资的 1.5 倍——乘数）
间接：更高的 Y 导致更高的 C，这导致更高的 Y（反馈环）
最终反应大于直接效应

The Multiplier Matrix Formula: Consider a simultaneous system: Ay = x, where A is M×M matrix of contemporaneous coefficients, y is the M×1 endogenous variables, x is the exogenous shocks. The solution is:

y = A⁻¹ x

The matrix A⁻¹ is the "impact multiplier" — it converts exogenous shocks into equilibrium responses. For dynamic systems with lags, you must compute (I - A₁L - A₂L² - ...)⁻¹ where L is the lag operator, then evaluate at different horizons to get impact, interim, and long-run multipliers.

Why Not Just Look at Coefficients? Suppose in a trade model, when China increases tariffs by 1%, US tariffs also increase (retaliation). But that 1% increase in US tariffs affects US GDP, which affects demand for imports, which feeds back to China. The total US GDP response is much larger than any single coefficient. You must trace all these feedback loops.

Distinction: Impact vs. Long-Run Multipliers: Impact multiplier: immediate effect (same period). Interim multiplier: cumulative effect over k periods. Long-run multiplier: final equilibrium after all adjustments. Example: fiscal stimulus → immediate GDP boost, but inflation builds → interest rates rise → private investment crowded out → long-run boost smaller than impact.

Static vs. Dynamic Multipliers: A static multiplier compares two equilibria (before vs. after tariff change). A dynamic multiplier traces the time path of adjustment (quarter-by-quarter). For policy analysis, you usually want the dynamic multiplier — how does the economy actually transition, not just where it ends up?

Δy = (I - A)⁻¹ Δt, or in dynamics: y(t) = (I - A₁L)⁻¹ ε(t)

where A is the matrix of direct effects and (I - A)⁻¹ is the "multiplier matrix" — it translates impulses into full responses through the system.

乘数矩阵公式：考虑一个同时系统：Ay = x，其中 A 是 M×M 当代系数矩阵，y 是 M×1 内生变量，x 是外生冲击。解是：

y = A⁻¹ x

矩阵 A⁻¹ 是"冲击乘数"——它将外生冲击转化为均衡反应。对于具有滞后的动态系统，你必须计算 (I - A₁L - A₂L² - ...)⁻¹，其中 L 是滞后算子，然后在不同地平线评估以获得冲击、临时和长期乘数。

为什么不仅查看系数？假设在贸易模型中，当中国将关税增加 1% 时，美国关税也增加（报复）。但美国关税的 1% 增加影响美国 GDP，这影响进口需求，反馈给中国。美国 GDP 的总反应比任何单一系数大得多。你必须追踪所有这些反馈环。

区分：冲击 vs. 长期乘数：冲击乘数：立即效应（同一时期）。临时乘数：k 个时期内的累积效应。长期乘数：所有调整后的最终均衡。例子：财政刺激 → 立即 GDP 提升，但通胀建立 → 利率上升 → 私人投资挤出 → 长期提升小于冲击。

静态 vs. 动态乘数：静态乘数比较两个均衡（前 vs. 后关税变化）。动态乘数追踪调整的时间路径（逐季度）。对于政策分析，你通常想要动态乘数——经济如何实际过渡，而不仅仅是最终位置。

Δy = (I - A)⁻¹ Δt，或在动态中：y(t) = (I - A₁L)⁻¹ ε(t)

其中 A 是直接效应矩阵，(I - A)⁻¹ 是"乘数矩阵"——它通过系统将冲击转化为完整反应。

Key lesson: If you want to know how Y responds to X in a system where variables cause each other, you can't just look at the coefficient on X. You must (1) specify all relevant equations (theory guides this); (2) solve the full system; (3) account for equilibrium restrictions; (4) compute impact, interim, and long-run multipliers. This is where theory becomes essential — it tells you which equations belong in the system and what restrictions the model satisfies.

关键教训：如果你想知道在变量相互影响的系统中 Y 如何对 X 反应，你不能仅查看 X 的系数。你必须 (1) 指定所有相关方程（理论指导这一点）；(2) 求解完整系统；(3) 说明均衡限制；(4) 计算冲击、临时和长期乘数。这是理论变得至关重要的地方——它告诉你哪些方程属于系统以及模型满足什么限制。

Software & Resources

软件与资源

This course provides code in Stata and R. Both are powerful, but have different strengths. Python is growing for causal inference work. Choose the one that fits your research community.

这门课程提供 Stata 和 R 代码。两者都很强大，但优势不同。Python 在因果推断工作中增长。选择适合你的研究社区的那个。

Stata

Dominant in political science, economics, sociology. Excellent for regression, time series, and spatial models. Large community, many user-written packages. Point-and-click interface or do-file programming.

在政治学、经济学、社会学中占主导。适合回归、时间序列和空间模型。大社区、许多用户编写的包。点击界面或 do 文件编程。

R

Free, open-source. Incredible packages for multilevel models (lme4), spatial econometrics (sf, spdep), and time series (forecast, vars). Growing in popularity. Steeper learning curve but more flexibility.

免费、开源。多层次模型 (lme4)、空间计量经济学 (sf、spdep) 和时间序列 (forecast、vars) 的不可思议的包。日益流行。学习曲线更陡但更灵活。

Python

Growing for causal inference (causality, DoWhy, EconML). Excellent for machine learning + causal work. Integration with deep learning frameworks. Most flexible but requires more coding.

在因果推断（因果关系、DoWhy、EconML）中增长。适合机器学习+因果工作。与深度学习框架的集成。最灵活但需要更多编码。

Stata Commands Quick Reference

Stata 命令快速参考

reg y x1 x2 x3 — OLS regression
reg y x1 x2 x3, robust — OLS with robust standard errors
xtreg y x1 x2, fe — Fixed effects (within estimator)
xtreg y x1 x2, re — Random effects (GLS)
logit y x1 x2 — Logit for binary outcomes
ivregress 2sls y (x1 = z1 z2) x2 x3 — 2SLS with instruments
var y1 y2 y3, lags(2) — VAR with 2 lags
spreg y x1 x2, id(id) model(ols) moran — Spatial regression with Moran's I test

reg y x1 x2 x3 — OLS 回归
reg y x1 x2 x3, robust — OLS 带稳健标准误
xtreg y x1 x2, fe — 固定效应（内估计量）
xtreg y x1 x2, re — 随机效应 (GLS)
logit y x1 x2 — 二元结果的 Logit
ivregress 2sls y (x1 = z1 z2) x2 x3 — 2SLS 带工具
var y1 y2 y3, lags(2) — VAR 2 个滞后
spreg y x1 x2, id(id) model(ols) moran — 空间回归与 Moran 的 I 检验

R Packages & Functions

R 包和函数

lme4: lmer(y ~ x1 + (1|group)) — Multilevel models with random intercepts
sandwich: vcovHC(model) — Robust standard errors
AER: ivreg(y ~ x1 | z1) — Instrumental variables (2SLS)
spdep: lagsarlm(y ~ x1 + x2) — Spatial autoregressive models
vars: VAR(data, p=2) — Vector autoregression
forecast: auto.arima(y) — ARIMA for univariate time series
ggplot2: Visualization of results and marginal effects

lme4： lmer(y ~ x1 + (1|group)) — 多层次模型带随机截距
sandwich： vcovHC(model) — 稳健标准误
AER： ivreg(y ~ x1 | z1) — 工具变量 (2SLS)
spdep： lagsarlm(y ~ x1 + x2) — 空间自回归模型
vars： VAR(data, p=2) — 向量自回归
forecast： auto.arima(y) — 单变量时间序列的 ARIMA
ggplot2：结果和边际效应的可视化

Key References

关键参考文献

Franzese, R. J., & Kam, C. D. (2009). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. University of Michigan Press. [The definitive text on interaction models in social science]
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. (2nd ed.) MIT Press. [Comprehensive reference for GLS, multilevel, fixed effects, 2SLS]
Brambor, T., Clark, W. R., & Golder, M. (2006). Understanding Interaction Models. Political Analysis, 14, 63–82. [Essential reading on interpreting interactions; establishes the three reporting rules]
Keele, L., Tingley, D., & Yamamoto, T. (2015). The Causal Interpretation of Estimated Associations in Regression Models. Political Science Research and Methods, 3, 550–563. [On the gap between coefficients and causal responses; connects to Mode III vs IV]
Clarke, K. A., & Primo, D. M. (2012). A Model Discipline: Political Science and the Logic of Representations. Oxford University Press. [On theory-guided empirical specification and the EMTI framework]
Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression. In Identification and Inference for Econometric Models. [Essential for evaluating instrument strength; establishes F > 10 threshold]

Franzese, R. J., & Kam, C. D. (2009). Modeling and Interpreting Interactive Hypotheses in Regression Analysis. University of Michigan Press. [社会科学交互模型的决定性文本]
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. (2nd ed.) MIT Press. [GLS、多层次、固定效应、2SLS 的综合参考]
Brambor, T., Clark, W. R., & Golder, M. (2006). Understanding Interaction Models. Political Analysis, 14, 63–82. [解释交互的必读；建立三个报告规则]
Keele, L., Tingley, D., & Yamamoto, T. (2015). The Causal Interpretation of Estimated Associations in Regression Models. Political Science Research and Methods, 3, 550–563. [关于系数与因果反应之间的差距；连接模式三与四]
Clarke, K. A., & Primo, D. M. (2012). A Model Discipline: Political Science and the Logic of Representations. Oxford University Press. [关于理论指导的经验规范和 EMTI 框架]
Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression. In Identification and Inference for Econometric Models. [评估工具强度的必要；建立 F > 10 阈值]