Machine Learning — Research Methods Notebook

▶ Q1 What Is Your Problem?你要解决什么问题？ Start Here从这里开始

▶ R Predict a Continuous Number (Regression)预测一个连续的数值（回归） Regression

回归问题就是：你想预测的东西是一个数字，比如房价、温度、考试分数。它和分类不同——分类是选答案（A/B/C），回归是算出一个具体的值。

打个比方：分类像选择题（"这封邮件是不是垃圾邮件？"），回归像填空题（"这套房子值多少钱？"）。

Regression problems are about predicting a number — like a house price, temperature, or exam score. Unlike classification (picking a label), regression outputs a continuous value.

Think of it this way: classification is multiple choice ("Is this email spam?"), regression is fill-in-the-blank ("How much is this house worth?").

▶ L Linear Regression线性回归

Linear Regression

最基础的回归模型。核心思想很简单：假设输出 y 和输入 x 之间是直线关系（y = wx + b），然后找到让预测误差最小的那条直线。

怎么找？用最小二乘法——把所有预测值和真实值的差的平方加起来，找让这个总和最小的 w 和 b：

J(θ) = (1/2n) ∑ᵢ₁ⁿ (yᵢ − xᵢᵀθ)² = (1/2n)||y − Xθ||²

对 J(θ) 求导令其等于零，可以得到一个漂亮的闭式解（Normal Equation）：

∇J = −(1/n)Xᵀ(y − Xθ) = 0 ⟹ θ* = (XᵀX)⁻¹ Xᵀy

注意：这个解要求 XᵀX 可逆。当特征数 d > 样本数 n 时，矩阵不可逆；即使可逆，d 很大时计算 O(d³) 也很慢——这时用梯度下降更实际。

当特征变量很多时，模型可能会"过拟合"（训练数据拟合得很好但对新数据预测很差）。这时需要正则化来约束模型复杂度：

Ridge 回归 (L2)

min (1/2n)||y − Xθ||² + λ||θ||²₂

闭式解: θ* = (XᵀX + λI)⁻¹ Xᵀy — 加了 λI 后矩阵一定可逆。系数被压缩变小但不会变成零，适合所有特征可能都有用的情况。

Lasso 回归 (L1)

min (1/2n)||y − Xθ||² + λ||θ||₁

没有闭式解（L1 范数在零点不可导），需用坐标下降法求解。能把不重要的系数直接压到零——相当于自动帮你做特征选择。

例 1：根据房屋面积、卧室数量、地段预测房价——面积每增加 1 平米，房价大约上升 X 万元。

例 2：教育研究中预测学生的 GPA——分析学习时间、出勤率、家庭收入等因素对成绩的影响。

何时用：特征和结果大致呈线性关系；需要可解释的系数（比如论文中报告"每增加 1 单位 X，Y 变化多少"）。

The most fundamental regression model. The core idea is simple: assume a straight-line relationship between output y and input x (y = wx + b), then find the line that minimizes prediction error.

How? Using Least Squares — sum up the squared differences between predicted and actual values, then find the w and b that minimize this total:

J(θ) = (1/2n) ∑ᵢ₁ⁿ (yᵢ − xᵢᵀθ)² = (1/2n)||y − Xθ||²

Taking the derivative of J(θ) and setting it to zero yields a clean closed-form solution (Normal Equation):

∇J = −(1/n)Xᵀ(y − Xθ) = 0 ⟹ θ* = (XᵀX)⁻¹ Xᵀy

Note: this requires XᵀX to be invertible. When features d > samples n, the matrix is singular; even when invertible, computing O(d³) is slow for large d — gradient descent is more practical then.

With many features, the model may overfit (performs well on training data but poorly on new data). Regularization constrains model complexity:

Ridge (L2)

min (1/2n)||y − Xθ||² + λ||θ||²₂

Closed-form: θ* = (XᵀX + λI)⁻¹ Xᵀy — adding λI guarantees invertibility. Shrinks coefficients toward zero but never exactly zero. Best when all features may be relevant.

Lasso (L1)

min (1/2n)||y − Xθ||² + λ||θ||₁

No closed-form (L1 norm is non-differentiable at zero), solved by coordinate descent. Can shrink coefficients to exactly zero — automatic feature selection.

Ex 1: Predict house prices from square footage, bedrooms, and location — each extra sqft adds roughly $X to the price.

Ex 2: In education research, predict student GPA from study hours, attendance, and family income.

When to use: Roughly linear relationship between features and outcome; need interpretable coefficients (e.g., reporting "a 1-unit increase in X corresponds to a Y change").

Lec 2d & 3a · Lec 3b · Lec 3c · Lec 3d · Lec 3e

▶ P Polynomial & Nonlinear Regression多项式与非线性回归

Polynomial Regression

现实中很多关系不是一条直线，而是曲线。多项式回归的做法是把原始特征做多项式展开——构造特征映射：

φ(x) = [1, x, x², x³, ..., xᵈ]

然后对新特征 φ(x) 用线性回归拟合：y = θᵀφ(x)。本质上还是线性模型（对参数 θ 线性），只不过输入空间变成了 x 的各次方组合。

一个关键问题：d 阶多项式有 d+1 个参数，当 d 太大时模型自由度过高。Bias-Variance Tradeoff：d 太小 → 欠拟合（高偏差），d 太大 → 过拟合（高方差）。需要用交叉验证选择最优的 d。

就像拿一条橡皮筋去贴合数据点——直线只能拟合简单趋势，多项式可以弯曲去贴合更复杂的形状。

例：研究年龄和收入的关系——收入不是随年龄线性增长的，年轻时快速上升，中年到达峰值，退休后下降，这是一条抛物线。

何时用：关系明显是曲线形的，但维度不高。注意：阶数太高容易过拟合（训练集完美拟合，测试集一塌糊涂）。

Many real relationships are curves, not straight lines. Polynomial regression constructs a feature mapping:

φ(x) = [1, x, x², x³, ..., xᵈ]

Then fits y = θᵀφ(x) using linear regression. It's still linear in parameters θ — just with transformed inputs.

A key issue: a degree-d polynomial has d+1 parameters; too many degrees of freedom when d is large. Bias-Variance Tradeoff: d too small → underfitting (high bias), d too large → overfitting (high variance). Use cross-validation to select optimal d.

Like bending a rubber band to fit data points — a straight line can only capture simple trends, but polynomials can curve to match more complex shapes.

Ex: Income vs. age — income rises quickly when young, peaks at mid-career, then drops after retirement. A parabola fits better than a line.

When to use: Clearly curved relationship, low-dimensional data. Caution: high-degree polynomials overfit easily.

Lec 3e & 4a · Lec 4b · Lec 4c

Neural Network Regression

当数据中的关系非常复杂、无法用简单公式描述时，可以让神经网络自动学习非线性特征变换。输出层不加激活函数，直接输出一个数值，用 L2 Loss（均方误差）训练。

J(θ) = ∑ᵢ (yᵢ − ŷᵢ)²

例：预测城市的空气质量指数（AQI）——受温度、湿度、风速、交通流量、工厂排放等几十个因素的复杂交互影响，简单模型很难捕捉。

何时用：关系非常复杂、数据量大、不需要知道"哪个特征有多重要"（即不需要可解释性）。

When relationships in data are too complex for simple formulas, neural networks can automatically learn nonlinear feature transformations. The output layer has no activation, directly outputting a number, trained with L2 loss (MSE).

J(θ) = ∑ᵢ (yᵢ − ŷᵢ)²

Ex: Predicting air quality index (AQI) — influenced by dozens of interacting factors (temperature, humidity, traffic, factory emissions). Simple models can't capture these interactions.

When to use: Very complex relationships, large datasets, interpretability not required.

▶ C Assign Data to Categories (Classification)把数据分到类别里（分类） Classification

分类问题就是：给定一些特征信息，判断它属于哪一类。输出不是数字，而是标签。比如：这封邮件是"垃圾邮件"还是"正常邮件"？这张照片里是猫还是狗？这个患者的肿瘤是良性还是恶性？

打个比方：分类就像是分拣员——看到一个东西，把它放进对应的箱子里。箱子可以是两个（二分类），也可以是好多个（多分类）。

Classification is about assigning a label given some features. The output is a category, not a number. For example: Is this email spam or not? Is this photo a cat or dog? Is this tumor benign or malignant?

Think of it as a sorting worker — they see an item and put it into the right bin. There can be two bins (binary) or many (multiclass).

▶ 2 Binary Classification (Two Classes)二分类（只有两个类别）

Logistic Regression

别被名字骗了——虽然叫"回归"，但它其实是一个分类模型。核心思想：先像线性回归一样算出一个分数（wᵀx + b），然后通过 Sigmoid 函数把分数"压缩"到 0 和 1 之间，变成一个概率。

P(y=1|x) = σ(wᵀx + b) = 1 / (1 + e⁻⁽ʷᵀˣ⁺ᵇ⁾)

如果概率 > 0.5，预测为类别 1；否则预测为类别 0。

从 MLE 推导损失函数：假设 n 个样本独立，则似然函数为：

L(θ) = ∏ᵢ pᵢⁿᵢ (1−pᵢ)¹⁻ⁿᵢ，其中 pᵢ = σ(xᵢᵀθ)

取负对数似然得到交叉熵损失（Binary Cross-Entropy）：

J(θ) = −(1/n) ∑ᵢ [yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ)]

梯度形式和线性回归惊人地相似：∇J = (1/n) Xᵀ(σ(Xθ) − y)，但 J(θ) 没有闭式解，需要用梯度下降或牛顿法（IRLS）求解。好消息：J(θ) 是凸函数，保证收敛到全局最优。

Logistic Regression 最大的优势是可解释性——每个特征的系数可以转化为 odds ratio（几率比）：eᶝₖ，在医学和社会科学中非常有价值。

例 1：判断一封邮件是否为垃圾邮件——根据"免费"、"赢大奖"等关键词出现的频率，算出是垃圾邮件的概率。

例 2：在教育研究中，预测学生是否会退学——输入特征包括 GPA、出勤率、家庭经济状况，输出概率可以帮助学校提前干预。

例 3：医学诊断——根据血液检测指标预测患者是否有某种疾病，系数告诉医生哪个指标影响最大。

何时用：需要概率输出、需要可解释性（尤其是论文中报告 odds ratio）、数据量不大也能用。

Don't be fooled by the name — despite being called "regression," this is a classification model. Core idea: compute a score (wᵀx + b) like linear regression, then squeeze it through the Sigmoid function into a probability between 0 and 1.

P(y=1|x) = σ(wᵀx + b) = 1 / (1 + e⁻⁽ʷᵀˣ⁺ᵇ⁾)

If probability > 0.5, predict class 1; otherwise class 0.

Deriving the loss from MLE: Assuming n independent samples, the likelihood is:

L(θ) = ∏ᵢ pᵢⁿᵢ (1−pᵢ)¹⁻ⁿᵢ, where pᵢ = σ(xᵢᵀθ)

Taking the negative log-likelihood yields Binary Cross-Entropy:

J(θ) = −(1/n) ∑ᵢ [yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ)]

The gradient is strikingly similar to linear regression: ∇J = (1/n) Xᵀ(σ(Xθ) − y), but J(θ) has no closed-form solution — use gradient descent or Newton's method (IRLS). Good news: J(θ) is convex, guaranteeing convergence to the global optimum.

The biggest advantage is interpretability — each coefficient converts to an odds ratio: eᶜₖ, invaluable in medicine and social science.

Ex 1: Spam detection — compute spam probability from keyword frequencies like "free" and "win big."

Ex 2: Predict student dropout — input GPA, attendance, family income; output probability enables early intervention.

Ex 3: Medical diagnosis — predict disease from blood test indicators; coefficients tell doctors which markers matter most.

When to use: Need probability output, need interpretability (especially odds ratios in papers), works even with small samples.

Lec 5a · Lec 5b · Lec 5c

Support Vector Machine (SVM)

SVM 的目标是找到一个"最宽的马路"来分开两类数据。想象一下：你在桌上放了两堆不同颜色的棋子，要用一根直尺把它们分开——SVM 找的是让直尺两侧"空白区域"最大的那个位置。这个空白区域就叫间距（margin），越大说明分类越有把握。

Hard-Margin 原问题：

min (1/2)||w||²　s.t.　yᵢ(wᵀxᵢ + b) ≥ 1,　∀i

Soft-Margin（允许误分类）：引入松弛变量 ξᵢ ≥ 0 和惩罚参数 C：

min (1/2)||w||² + C ∑ᵢ ξᵢ　s.t.　yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ

用拉格朗日乘数法转化为对偶问题（这也是 SVM 理论最精妙的地方）：

max ∑ᵢ αᵢ − (1/2) ∑ᵢˇ αᵢαˇ yᵢyˇ xᵢᵀxˇ　s.t.　0 ≤ αᵢ ≤ C,　∑ᵢ αᵢyᵢ = 0

对偶问题的好处：(1) 只依赖内积 xᵢᵀxˇ，可以用核技巧替换；(2) αᵢ > 0 的样本就是支持向量。

核技巧（Kernel Trick）：用核函数 K(xᵢ, xˇ) = φ(xᵢ)ᵀφ(xˇ) 隐式映射到高维空间而不需要显式计算 φ。常用核：

线性核：K(x,z) = xᵀz
RBF（高斯）核：K(x,z) = exp(−γ||x−z||²)，隐式映射到无穷维空间
多项式核：K(x,z) = (xᵀz + c)ᵈ

例 1：文本分类——判断一条电影评论是正面还是负面。SVM 在文本分类（特征维度高但样本量适中）上表现出色。

例 2：图像中检测人脸——传统方法（在深度学习之前）用 SVM + HOG 特征做人脸检测非常经典。

何时用：样本量中等（几百到几万）、特征维度可以很高、想要最大间距的几何直觉、非线性时用核技巧。

SVM aims to find the "widest road" separating two classes. Imagine two piles of different-colored chess pieces on a table — you want to place a ruler between them so the empty space on both sides is as wide as possible. This empty space is the margin; wider means more confident.

Hard-Margin primal problem:

min (1/2)||w||²　s.t.　yᵢ(wᵀxᵢ + b) ≥ 1,　∀i

Soft-Margin (allowing misclassification): Introduce slack variables ξᵢ ≥ 0 and penalty C:

min (1/2)||w||² + C ∑ᵢ ξᵢ　s.t.　yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ

Via Lagrange multipliers, transform into the dual problem (the theoretical gem of SVM):

max ∑ᵢ αᵢ − (1/2) ∑ᵢˇ αᵢαˇ yᵢyˇ xᵢᵀxˇ　s.t.　0 ≤ αᵢ ≤ C,　∑ᵢ αᵢyᵢ = 0

The dual's benefits: (1) depends only on inner products xᵢᵀxˇ, enabling the kernel trick; (2) samples with αᵢ > 0 are support vectors.

Kernel Trick: Replace inner products with K(xᵢ, xˇ) = φ(xᵢ)ᵀφ(xˇ), implicitly mapping to high-dimensional space without computing φ explicitly. Common kernels:

Linear: K(x,z) = xᵀz
RBF (Gaussian): K(x,z) = exp(−γ||x−z||²), implicit infinite-dimensional mapping
Polynomial: K(x,z) = (xᵀz + c)ᵈ

Ex 1: Text classification — classifying movie reviews as positive or negative. SVM excels with high-dimensional text features.

Ex 2: Face detection — before deep learning, SVM + HOG features was the classic approach.

When to use: Moderate sample size (hundreds to tens of thousands), high-dimensional features, non-linear boundaries via kernels.

Lec 7b · Lec 7c

▶ M Multiclass Classification (Three or More Classes)多分类（三个类别以上）

Softmax / Multinomial Logistic Regression

Logistic Regression 的多类推广。对每一类计算一个线性得分 zₖ = bₖ + wₖᵀx，然后通过 Softmax 函数把所有类的得分归一化成概率（加起来等于 1）：

P(y=k|x) = eᶃₖ / ∑ˇ eᶃˇ，其中 zₖ = wₖᵀx + bₖ

对应的Categorical Cross-Entropy 损失（从 MLE 推导）：

J(θ) = −(1/n) ∑ᵢ ∑ᵌ₁ᴷ 𝟙(yᵢ=k) log P(yᵢ=k|xᵢ)

Softmax 的梯度也很优雅：∂J/∂zₖ = pₖ − 𝟙(y=k)，即"预测概率减去真实标签"——形式上和 Logistic Regression 完全一致。

Softmax 的精妙之处在于：它不仅告诉你"选哪个类"，还给出了对每个类的"信心"。模型很自信时，一个类的概率会接近 1；不确定时，概率分布会更均匀。

例 1：手写数字识别 (MNIST) — 识别 0-9 十个数字。仅用 Softmax 就能达到约 90% 的准确率，是最简单的多分类基线。

例 2：新闻文章自动分类 — 给定文章文本，判断它属于体育、财经、科技、娱乐中的哪一类。

何时用：多类问题的首选基线模型；简单高效；也是神经网络分类的输出层标配。

Multiclass extension of Logistic Regression. Computes a linear score zₖ = bₖ + wₖᵀx for each class, then normalizes via Softmax to probabilities that sum to 1:

P(y=k|x) = eᶃₖ / ∑ˇ eᶃˇ, where zₖ = wₖᵀx + bₖ

The corresponding Categorical Cross-Entropy loss (derived from MLE):

J(θ) = −(1/n) ∑ᵢ ∑ᵌ₁ᴷ 𝟙(yᵢ=k) log P(yᵢ=k|xᵢ)

The gradient is elegant: ∂J/∂zₖ = pₖ − 𝟙(y=k), i.e., "predicted probability minus true label" — identical in form to Logistic Regression.

Softmax's elegance: it not only picks a class but gives a "confidence" for each. When confident, one class's probability approaches 1; when uncertain, probabilities spread out.

Ex 1: Handwritten digit recognition (MNIST) — classifying digits 0-9. Softmax alone achieves ~90% accuracy, the simplest multiclass baseline.

Ex 2: News article classification — given article text, predict whether it's Sports, Finance, Tech, or Entertainment.

When to use: Default baseline for multiclass problems; also the standard output layer for neural network classifiers.

Lec 6d & 7a · Lec 7b

Multiclass Strategies多分类拆解策略

如果你手上的模型天生只能做二分类（比如 SVM），怎么解决多分类问题？两种经典策略：

One-vs-Rest (OvR)每类训练一个"这类 vs. 其余所有类"的二分类器。K 个类就训练 K 个分类器，选得分最高的类。

One-vs-One (OvO)每两类之间训练一个分类器。K 个类共训练 K(K-1)/2 个，投票决定最终类别。

例：假设你要识别猫、狗、鸟三种动物。OvR 训练 3 个分类器（猫 vs 非猫、狗 vs 非狗、鸟 vs 非鸟）。OvO 训练 3 个分类器（猫 vs 狗、猫 vs 鸟、狗 vs 鸟）。

If your model can only do binary classification natively (e.g., SVM), how to handle multiclass? Two classic strategies:

One-vs-Rest (OvR)Train one "this class vs. all others" classifier per class. K classes = K classifiers; pick the highest score.

One-vs-One (OvO)Train one classifier for every pair. K classes = K(K-1)/2 classifiers; majority vote decides.

Ex: Classify cat, dog, bird. OvR: 3 classifiers (cat vs not-cat, dog vs not-dog, bird vs not-bird). OvO: 3 classifiers (cat vs dog, cat vs bird, dog vs bird).

▶ I Image Classification (CNN)图像分类（CNN）

Convolutional Neural Network (CNN)

当输入是图像时，传统方法效果很差——因为图像是像素的二维矩阵，位置关系很重要。CNN 专门为图像设计，用一组小的卷积核（filter）在图像上"滑动扫描"，自动学习从边缘到语义的层次化特征。

想象你在用放大镜看一幅画：先看到线条和颜色（边缘），再看到眼睛鼻子嘴（部件），最后认出"这是一张人脸"（语义）。CNN 的每一层做的就是这种从低级到高级的特征提取。

关键组件的流程：卷积层（提取局部特征）→ ReLU 激活（引入非线性）→ 池化层（缩小尺寸、保留关键信息）→ 全连接层（汇总特征做分类）

Z[j₁,j₂] = ∑ᵌ₁ ∑ᵌ₂ W[k₁,k₂] · X[j₁+k₁, j₂+k₂]

里程碑：AlexNet (2012) 在 ImageNet (140 万张图, 1000 个类别) 上 top-5 错误率从 25.6% 降到 15.3%，一举开启了深度学习时代。

例 1：手机相册自动分类——识别照片中是风景、美食、人物还是宠物。

例 2：医学影像——从 X 光片、CT 扫描中自动检测肿瘤、骨折等异常。

例 3：自动驾驶——识别路上的行人、车辆、交通标志。

何时用：输入是图像（或有空间结构的 2D 数据，如频谱图）。深度学习时代几乎所有图像任务的基础。

When input is an image, traditional methods struggle because images are 2D pixel matrices where spatial relationships matter. CNN is purpose-built for images: small convolutional filters slide across the image, automatically learning hierarchical features from edges to semantics.

Imagine examining a painting with a magnifying glass: first you see lines and colors (edges), then eyes, nose, mouth (parts), finally you recognize "this is a face" (semantics). Each CNN layer does exactly this low-to-high feature extraction.

Pipeline: Conv layers (local features) → ReLU (nonlinearity) → Pooling (downsampling) → Fully-connected (classification)

Z[j₁,j₂] = ∑ᵌ₁ ∑ᵌ₂ W[k₁,k₂] · X[j₁+k₁, j₂+k₂]

Milestone: AlexNet (2012) reduced ImageNet (1.4M images, 1000 classes) top-5 error from 25.6% to 15.3%, launching the deep learning era.

Ex 1: Phone photo auto-tagging — recognizing landscape, food, people, or pets.

Ex 2: Medical imaging — detecting tumors and fractures from X-rays and CT scans.

Ex 3: Self-driving — identifying pedestrians, vehicles, and traffic signs.

When to use: Image inputs (or any data with 2D spatial structure like spectrograms). The foundation of virtually all image tasks in deep learning.

Lec 9a · Lec 9b · Lec 9c · Lec 9d · Lec 9e

▶ U No Labels — Discover Hidden Structure (Unsupervised)没有标签——发现数据中的隐藏结构（无监督学习） Unsupervised

前面的回归和分类都有一个前提：每条数据都有一个"正确答案"（标签）供模型学习。但很多时候我们没有标签，只有一堆原始数据，想从中发现规律。这就是无监督学习。

就像你走进一个完全陌生的超市，没有任何分类标签——你自然会发现"这一片都是饮料"、"那边都是零食"。无监督学习就是让机器做同样的事情。

Regression and classification above both assume each data point has a "correct answer" (label). But often we have no labels — just raw data from which we want to discover patterns. That's unsupervised learning.

Like walking into an unfamiliar supermarket with no section labels — you'd naturally notice "this area is all drinks" and "that corner is snacks." Unsupervised learning lets machines do the same thing.

▶ G Grouping Data (Clustering)把数据分组（聚类）

K-Means Clustering

最经典的聚类算法。目标函数（最小化组内平方和）：

J = ∑ᵌ₁ᴷ ∑ᵢ∈Cₖ ||xᵢ − μₖ||²，其中 μₖ = (1/|Cₖ|) ∑ᵢ∈Cₖ xᵢ

步骤很简单：(1) 随机初始化 K 个中心 μₖ；(2) 每个数据点分配到最近的中心（E 步）；(3) 重新计算每组的均值作为新中心（M 步）；(4) 重复直到收敛。每步都保证 J 不增，但只能收敛到局部最优——所以实践中常用 K-Means++ 初始化来找到更好的起点。

局限是需要提前指定 K（组数），可以用肘部法则来帮助判断——画出不同 K 值对应的 J 曲线，找误差下降趋缓的拐点。

例 1：客户分群——电商平台根据用户的消费金额、频率、偏好把客户分成"高价值客户"、"偶尔消费客户"、"流失风险客户"等，然后做精准营销。

例 2：图像压缩——用 K-Means 将一张图片的颜色从上百万种减少到 16 种，大幅减小文件大小同时保持视觉效果。

何时用：簇大致呈球形、数量已知或可以通过实验确定。简单高效，适合大数据。

The most classic clustering algorithm. Objective (minimize within-cluster sum of squares):

J = ∑ᵌ₁ᴷ ∑ᵢ∈Cₖ ||xᵢ − μₖ||², where μₖ = (1/|Cₖ|) ∑ᵢ∈Cₖ xᵢ

Steps: (1) randomly initialize K centers μₖ; (2) assign each point to nearest center (E-step); (3) recompute group means as new centers (M-step); (4) repeat until convergence. Each step guarantees J never increases, but only converges to a local optimum — so K-Means++ initialization is used in practice for better starting points.

Limitation: K must be specified in advance. The elbow method helps — plot J vs. K and look for the point where improvement levels off.

Ex 1: Customer segmentation — e-commerce platforms group users by spending amount, frequency, and preferences into "high-value," "occasional," and "at-risk" segments for targeted marketing.

Ex 2: Image compression — reduce millions of colors in a photo to just 16 using K-Means, dramatically shrinking file size while preserving appearance.

When to use: Roughly spherical clusters; K known or determinable experimentally. Simple, efficient, scales to big data.

Lec 11a · Lec 11b

Hierarchical Clustering

不需要预设分几组——它自底向上（凝聚法）逐步合并最相似的数据点/簇，形成一棵树状图（dendrogram）。你可以在任意"高度"横切一刀，得到不同粒度的分组。

例：基因表达分析——在生物学中，研究者用层次聚类对基因做分组，发现功能相似的基因会自然聚在一起。树状图还能直观展示基因之间的"亲缘关系"。

何时用：想观察数据在多个粒度下的聚类结构；数据量不太大（计算复杂度 O(n²)）；需要可视化层次关系。

No need to pre-specify K — it builds a dendrogram by iteratively merging the most similar points/clusters (agglomerative). Cut at any height to get different levels of grouping.

Ex: Gene expression analysis — biologists use hierarchical clustering to group genes, discovering that functionally related genes naturally cluster together. The dendrogram visually reveals evolutionary relationships.

When to use: Want to explore cluster structure at multiple granularities; moderate dataset size (O(n²) complexity); need hierarchical visualization.

▶ D Reduce Dimensions / Find Main Patterns降低维度 / 找出主要模式

PCA — Principal Component Analysis

当数据有几十甚至上百个特征时，很难直接分析。PCA 的核心思想是找到数据变化最大的方向（主成分），把高维数据投影到少数几个方向上，用最少的维度保留最多的信息。

打个比方：想象你在拍一栋建筑物的照片——从不同角度拍出的效果差别很大。PCA 就是帮你找到"信息量最大的那个拍摄角度"，让你只用一张照片就能最大限度地理解这栋建筑。

数学上，PCA 是对中心化数据的协方差矩阵做特征值分解：

C = (1/n) XᵀX = V Λ Vᵀ，取前 k 列 Vₖ，投影 Xₖ = X Vₖ

等价于求解：max ||Xv||² s.t. ||v|| = 1，即找方差最大的投影方向。第一主成分是方差最大的方向，第二主成分是与第一正交且方差最大的方向，以此类推。

例 1：人脸识别——原始图像可能有上万个像素维度，PCA 将其压缩到几十个"特征脸"（eigenface），既节省计算又去除噪声。

例 2：问卷数据分析——一份有 50 道题的心理学问卷，PCA 可以帮你发现其实只有 5-6 个潜在的心理维度在起作用。

何时用：可视化高维数据（投影到 2D/3D）；去噪；做其他模型的预处理步骤；探索数据结构。

When data has dozens or hundreds of features, direct analysis is impractical. PCA finds the directions of maximum variance (principal components) and projects high-dimensional data onto a few key dimensions, preserving maximum information with minimum dimensions.

Imagine photographing a building from different angles — some angles reveal much more than others. PCA finds the "most informative angle" so you understand the building from just one photo.

Mathematically, PCA eigendecomposes the centered data's covariance matrix:

C = (1/n) XᵀX = V Λ Vᵀ, take top k columns Vₖ, project Xₖ = X Vₖ

Equivalently: max ||Xv||² s.t. ||v|| = 1 — find the projection direction with maximum variance. The 1st PC has max variance; the 2nd is orthogonal to the 1st with max remaining variance, and so on.

Ex 1: Face recognition — raw images have thousands of pixel dimensions; PCA compresses them into dozens of "eigenfaces," saving computation and removing noise.

Ex 2: Survey data — a 50-item psychology questionnaire may have only 5-6 underlying dimensions that PCA can reveal.

When to use: Visualizing high-dimensional data (project to 2D/3D); denoising; preprocessing for downstream models; exploring data structure.

Lec 12a · Lec 12b

t-SNE

一种非线性降维方法，特别擅长将高维数据可视化到 2D 或 3D。PCA 保留全局结构（远近关系），t-SNE 更注重保留局部邻域关系——所以 t-SNE 的可视化图中，相似的点会明显聚成一团，看起来非常直观。

例：用 t-SNE 把 MNIST 手写数字（784 维）投影到 2D，你会看到 0-9 十个数字自然形成十个清晰的簇——不需要任何标签信息！

何时用：纯粹用于可视化（不能用来做特征提取或预处理）。跑起来比较慢，适合中小规模数据。

A nonlinear dimensionality reduction method that excels at visualizing high-dimensional data in 2D/3D. While PCA preserves global structure, t-SNE focuses on local neighborhoods — similar points form clear clusters in the visualization.

Ex: Apply t-SNE to MNIST digits (784 dimensions) projected to 2D — you'll see digits 0-9 naturally forming 10 clean clusters, without any label information!

When to use: Visualization only (not for feature extraction). Slower, suitable for small-to-medium datasets.

Lec 13a · Lec 13b

▶ Q2 How Do You Train a Model?模型怎么训练？ Training训练

训练一个机器学习模型，本质上就是在做一件事：调整参数，让预测尽可能接近真实值。这个过程分三步：(1) 定义一个"损失函数"衡量预测有多差，(2) 用优化算法不断调整参数降低损失，(3) 用正则化防止模型"死记硬背"。

就像调收音机——你转旋钮（参数），听到噪音就继续转（损失高），直到声音最清晰（损失最低）。

Training a model is essentially one thing: adjusting parameters to make predictions as close to reality as possible. Three steps: (1) define a "loss function" measuring how bad predictions are, (2) use optimization to adjust parameters and reduce loss, (3) use regularization to prevent memorization.

Like tuning a radio — you turn the dial (parameters), keep turning if there's static (high loss), until the signal is clearest (lowest loss).

▶ O Optimization: Finding the Best Parameters优化算法：找到最好的参数

Gradient Descent (GD)

最基础的优化算法。想象你在一座雾蒙蒙的山上，看不到全貌，只知道脚下的坡度。梯度下降的策略就是：每一步都朝最陡的下坡方向走，一步一步走到山谷（损失最小的地方）。

θ⁽ᵗ⁺¹⁾ = θ⁽ᵗ⁾ − α ∇J(θ⁽ᵗ⁾)

其中 α 是学习率（步长），∇J 是损失函数的梯度（坡度方向）。学习率太大会跳过最低点甚至发散，太小会走得很慢。

问题：经典 GD 每走一步都需要用全部数据计算梯度，数据量大时非常慢。

The most basic optimizer. Imagine standing on a foggy mountain — you can't see the full landscape, only the slope at your feet. Gradient descent's strategy: always step in the steepest downhill direction, walking step by step to the valley (minimum loss).

θ⁽ᵗ⁺¹⁾ = θ⁽ᵗ⁾ − α ∇J(θ⁽ᵗ⁾)

α is the learning rate (step size), ∇J is the loss gradient (slope direction). Too large = overshoot or diverge; too small = painfully slow.

Problem: classic GD uses the entire dataset for each step — extremely slow with large data.

Lec 6a · Lec 6b

Stochastic Gradient Descent (SGD)

解决方案：不用全部数据，每步只随机取一小批数据（mini-batch）来估算梯度方向。虽然每步的方向不那么精确，但走得快得多！这就是深度学习的标准训练方法。

g′ ≈ (1/|Iᵗ|) ∑ᵢ∈Iᵗ ∇Jᵢ(xᵢ, yᵢ)

一个 epoch = 所有 mini-batch 都用过一遍 = n/B 步（n 是总样本数，B 是 batch size）。通常需要训练几十到上百个 epoch。

例：训练一个图像分类模型，有 50,000 张图。如果 batch size = 64，每个 epoch 就是 50000/64 ≈ 781 步。训练 100 个 epoch 就是约 78,100 次参数更新。

Solution: instead of the full dataset, use a random mini-batch each step to estimate the gradient. Less precise per step, but much faster! This is the standard training method for deep learning.

g′ ≈ (1/|Iᵗ|) ∑ᵢ∈Iᵗ ∇Jᵢ(xᵢ, yᵢ)

One epoch = all mini-batches seen once = n/B steps (n = total samples, B = batch size). Typically need tens to hundreds of epochs.

Ex: Training an image classifier with 50,000 images, batch size = 64 means each epoch is ~781 steps. 100 epochs = ~78,100 parameter updates.

Lec 6b · Lec 6c

Newton's Method

用二阶导数（Hessian 矩阵）信息来加速收敛——不仅知道坡度方向，还知道"坡度变化得有多快"，所以能用更少的步数到达最低点。更新公式：

θ⁽ᵗ⁺¹⁾ = θ⁽ᵗ⁾ − H⁻¹ ∇J(θ⁽ᵗ⁾)，其中 H = ∇²J（Hessian 矩阵）

直觉：梯度下降用"平面"近似损失函数，每步走固定步长；牛顿法用"二次曲面"近似，能在一步内跳到近似曲面的极小值——所以收敛速度是二次的（quadratic convergence），远快于梯度下降的线性收敛。

但计算 H⁻¹ 需要 O(d³)，存储 H 需要 O(d²) 内存。当 d = 100 万时完全不可行。实际中常用拟牛顿法（如 L-BFGS）来近似 H⁻¹，在中等规模的凸优化问题中非常有效。

Uses second-order derivatives (Hessian matrix) for faster convergence — knowing not just the slope but "how fast the slope changes" enables reaching the minimum in fewer steps. Update rule:

θ⁽ᵗ⁺¹⁾ = θ⁽ᵗ⁾ − H⁻¹ ∇J(θ⁽ᵗ⁾), where H = ∇²J (Hessian matrix)

Intuition: GD approximates the loss with a "plane" and takes fixed steps; Newton's method uses a "quadratic surface" and jumps to its minimum in one step — hence quadratic convergence, far faster than GD's linear convergence.

But computing H⁻¹ is O(d³), storing H is O(d²). When d = 1 million, this is infeasible. In practice, Quasi-Newton methods (e.g., L-BFGS) approximate H⁻¹ and work well for medium-scale convex problems.

Lec 6c · Lec 6d

▶ L Loss Functions: Measuring Prediction Error损失函数：衡量预测有多差

Choosing the Right Loss Function如何选择损失函数

损失函数就是你给模型的"评分标准"——告诉它"预测得多差"。不同类型的问题用不同的损失函数：

回归 → MSE（均方误差）：
J = (1/n) ∑ᵢ₁ⁿ (yᵢ − ŷᵢ)²
对大误差惩罚更重（因为要平方）。梯度 ∂J/∂ŷᵢ = −2(yᵢ − ŷᵢ)/n，简单直观。假设噪声服从高斯分布时，最小化 MSE 等价于 MLE。
二分类 → Binary Cross-Entropy：
J = −(1/n) ∑ᵢ [yᵢ log(pᵢ) + (1−yᵢ) log(1−pᵢ)]
当模型很自信地预测错（比如 p=0.99 但 y=0）时，−log(0.01) ≈ 4.6，惩罚非常大。这就是为什么 Cross-Entropy 比 MSE 更适合分类——它对自信的错误惩罚更重。
多分类 → Categorical Cross-Entropy：
J = −(1/n) ∑ᵢ ∑ᵌ₁ᴷ yᵢᵌ log(ŷᵢᵌ)
搭配 Softmax 使用。yᵢᵌ 是 one-hot 编码，所以求和实际上只保留正确类别的那一项。
SVM → Hinge Loss：
J = ∑ᵢ max(0, 1 − yᵢ f(xᵢ))
只要分类正确且置信度 ≥ 1 就不罚分，否则线性惩罚。这就是"最大间距"的来源。

打个比方：MSE 像老师打分，错 1 分扣 1，错 10 分扣 100（平方惩罚）。Cross-Entropy 像老师说"你答 A 我可以接受，但你很自信地答错我要重罚"——越自信的错误惩罚越重。

The loss function is the model's "grading rubric" — it defines how bad a prediction is. Different problems need different loss functions:

Regression → MSE:
J = (1/n) ∑ᵢ₁ⁿ (yᵢ − ŷᵢ)²
Penalizes large errors more (squared). Gradient: ∂J/∂ŷᵢ = −2(yᵢ − ŷᵢ)/n — simple and intuitive. Under Gaussian noise assumption, minimizing MSE is equivalent to MLE.
Binary Classification → Binary Cross-Entropy:
J = −(1/n) ∑ᵢ [yᵢ log(pᵢ) + (1−yᵢ) log(1−pᵢ)]
When the model is confidently wrong (e.g., p=0.99 but y=0), −log(0.01) ≈ 4.6, a severe penalty. This is why Cross-Entropy suits classification better than MSE — it punishes confident mistakes heavily.
Multiclass → Categorical Cross-Entropy:
J = −(1/n) ∑ᵢ ∑ᵌ₁ᴷ yᵢᵌ log(ŷᵢᵌ)
Paired with Softmax. yᵢᵌ is one-hot encoded, so the sum retains only the correct class's term.
SVM → Hinge Loss:
J = ∑ᵢ max(0, 1 − yᵢ f(xᵢ))
No penalty as long as classification is correct with confidence ≥ 1; linear penalty otherwise. This is the source of "maximum margin."

MSE is like a teacher deducting 1 point for a small error but 100 for a big one (squared). Cross-Entropy is like saying "I accept uncertainty, but confidently wrong answers get severely punished."

▶ R Preventing Overfitting: Regularization防止过拟合：正则化

What Is Overfitting?什么是过拟合？

过拟合就是模型"死记硬背"了训练数据，包括其中的噪声和偶然规律，导致在新数据上表现很差。就像一个学生把课本上的例题答案背下来了，但遇到新题就不会做。

正则化是一套"防作弊"手段，让模型学到真正的规律而不是噪声：

L2 正则化 (Ridge) — 在损失函数中加 λ||θ||²，限制参数不能太大。效果：模型更平滑、更"保守"。
L1 正则化 (Lasso) — 加 λ||θ||₁，能把不重要的参数压到零——相当于自动丢弃无用特征。
Dropout — 训练时随机"关闭"一部分神经元（比如 50%），迫使网络不过度依赖某几条路径。测试时全部神经元都打开。
Batch Normalization — 对每层的输入做标准化（均值=0, 方差=1），加速训练、稳定梯度。
交叉验证 (Cross-Validation) — 把数据分成 K 份，轮流用其中 1 份做测试、其余做训练。K 次结果的平均值就是对泛化能力的客观评估。

例：假设你用 100 个特征预测房价，但其实只有 10 个特征真正有用。Lasso 可以自动帮你把另外 90 个特征的系数压到零，只保留重要的。Ridge 会把所有系数变小但不会完全归零。

Overfitting is when a model memorizes training data, including noise and coincidental patterns, performing poorly on new data. Like a student who memorized textbook answers but can't solve new problems.

Regularization is the "anti-cheating" toolkit, ensuring models learn genuine patterns:

L2 (Ridge) — Adds λ||θ||² to loss, keeping parameters small. Effect: smoother, more "conservative" model.
L1 (Lasso) — Adds λ||θ||₁, can zero out unimportant parameters — automatic feature selection.
Dropout — Randomly deactivates neurons during training (e.g., 50%), preventing over-reliance on specific pathways. All neurons active during testing.
Batch Normalization — Normalizes each layer's inputs (mean=0, variance=1), speeding up training and stabilizing gradients.
Cross-Validation — Split data into K folds; rotate one fold as test, rest as training. Average over K trials gives an objective estimate of generalization.

Ex: Predicting house prices with 100 features, but only 10 actually matter. Lasso zeros out the other 90; Ridge shrinks all of them but none to exactly zero.

▶ Q3 How Do Neural Networks Work?神经网络是怎么工作的？ Deep Learning

神经网络是一种受大脑启发（但并不真正模拟大脑）的计算模型。它的强大之处在于：只要网络够大，理论上可以拟合任意复杂的函数（万能逼近定理）。近年来深度学习的爆发正是因为我们终于有了足够的数据和算力来训练大型神经网络。

Neural networks are brain-inspired (but don't literally simulate brains) computational models. Their power: given sufficient size, they can theoretically approximate any function (Universal Approximation Theorem). The deep learning explosion happened because we finally had enough data and compute to train large networks.

▶ A Architecture: Feed-Forward Network基本结构：前馈神经网络

The Building Blocks基本构成

最简单的神经网络由三层组成：输入层（接收特征数据）→ 隐藏层（做可学习的特征变换）→ 输出层（给出预测结果）。

zₕ = Wₕx + bₕ　→　aₕ = h(zₕ)　→　zₒ = Wₒaₕ + bₒ

每一层做两件事：(1) 线性变换（矩阵乘法 + 偏置），(2) 非线性激活。没有激活函数的话，多层网络等价于单层线性模型——激活函数才是让网络能学习复杂模式的关键。

可以把神经网络想象成一个工厂流水线：每一层是一个加工车间，把半成品（特征）加工成更有用的形式，最终车间输出成品（预测）。

例：识别手写数字——输入层 784 个神经元（28x28 像素），隐藏层 256 个神经元（学习笔画模式），输出层 10 个神经元（代表 0-9 十个数字的概率）。

The simplest neural network has three layers: Input (receives features) → Hidden (learnable feature transformation) → Output (produces predictions).

zₕ = Wₕx + bₕ　→　aₕ = h(zₕ)　→　zₒ = Wₒaₕ + bₒ

Each layer does two things: (1) linear transformation (matrix multiply + bias), (2) nonlinear activation. Without activations, a multi-layer network collapses to a single linear model — activations are the key to learning complex patterns.

Think of a neural network as a factory assembly line: each layer is a processing station that refines raw materials (features) into increasingly useful forms, with the final station outputting the finished product (prediction).

Ex: Recognizing handwritten digits — input layer: 784 neurons (28x28 pixels); hidden layer: 256 neurons (learning stroke patterns); output layer: 10 neurons (probabilities for digits 0-9).

Lec 8a · Lec 8b · Lec 8c

▶ F Activation Functions: Why Nonlinearity Matters激活函数：为什么非线性如此重要

Three Main Activation Functions三种主要激活函数

激活函数的作用是在每层的线性变换后引入非线性。没有它，再深的网络也只能画直线（做线性分割）。有了它，网络才能画出任意复杂的曲线来分开数据。

Activation functions introduce nonlinearity after each layer's linear transformation. Without them, even a deep network can only draw straight lines. With them, the network can draw arbitrarily complex decision boundaries.

Sigmoid

h(z) = 1/(1+e⁻ᶃ)

h′(z) = h(z)(1 − h(z)) ≤ 0.25

输出 (0,1)。导数最大值只有 0.25，10 层连乘后梯度衰减为 0.25¹⁰ ≈ 10⁻⁶，这就是梯度消失。Output (0,1). Max derivative is only 0.25; after 10 layers of multiplication, gradient decays to 0.25¹⁰ ≈ 10⁻⁶ — vanishing gradients.

Tanh

h(z) = (eᶃ − e⁻ᶃ)/(eᶃ + e⁻ᶃ)

h′(z) = 1 − h(z)² ≤ 1

输出 (-1,1)。零中心化，导数最大为 1，比 Sigmoid 好但仍有饱和区（|z| 大时 h′ ≈ 0）。Output (-1,1). Zero-centered, max derivative = 1, better than Sigmoid but still saturates (h′ ≈ 0 for large |z|).

ReLU — The Default Choice深度学习的默认选择

h(z) = max(0, z)，　h′(z) = 𝟙(z > 0) = { 1 if z > 0,　0 if z ≤ 0 }

极其简单：正数原样通过，负数变成零。梯度要么是 1 要么是 0，永远不会消失。计算速度快。今天几乎所有深度网络都用 ReLU 或其变体（Leaky ReLU、GELU 等）。

梯度消失问题详解：Sigmoid 和 Tanh 的梯度始终小于 1。在多层网络中，反向传播需要将各层梯度连乘——很多个小于 1 的数相乘，结果趋近于零。这意味着靠近输入的前面几层几乎无法学习。ReLU 的梯度是 1（正数区域），完美解决了这个问题。

Extremely simple: positive values pass through, negatives become zero. Gradient is either 1 or 0 — it never vanishes. Fast to compute. Nearly all modern deep networks use ReLU or variants (Leaky ReLU, GELU, etc.).

Vanishing gradient explained: Sigmoid/Tanh gradients are always <1. In deep networks, backpropagation multiplies gradients across layers — many numbers <1 multiplied together approach zero. Early layers barely learn. ReLU's gradient is 1 (for positive inputs), solving this completely.

Lec 8a · Lec 8b

▶ B Backpropagation: How Networks Learn反向传播：网络如何学习

Backpropagation

反向传播是训练神经网络的核心算法。整个过程分两步：

前向传播：数据从输入层流过每一层，最终得到预测结果，计算出损失。

反向传播：从输出层开始，利用链式法则逐层往回算。以一个两层网络为例（输入 x → 隐藏层 → 输出 ŷ）：

前向：zₕ = Wₕx + bₕ,　aₕ = h(zₕ),　zₒ = Wₒaₕ + bₒ,　ŷ = zₒ

定义误差信号 δ（每层对最终损失的"贡献"）：

δₒ = ∂J/∂zₒ = (ŷ − y)　（输出层）

δₕ = ∂J/∂zₕ = (Wₒᵀ δₒ) ⊙ h′(zₕ)　（隐藏层，⊙ 表示逐元素乘法）

有了 δ，每层的梯度就很简单了：∂J/∂Wₕ = δₕ xᵀ，∂J/∂Wₒ = δₒ aₕᵀ。这就是为什么激活函数的导数 h′(z) 如此重要——Sigmoid 的 h′ 很小，导致 δ 层层衰减（梯度消失）。

就像工厂出了次品，你从最后的检测环节倒着追溯："是包装的问题？还是组装的问题？还是原材料的问题？"每一步都追溯到是哪个环节、哪个工人（参数）出了差错，然后针对性地改进。

Backpropagation is the core algorithm for training neural networks. Two phases:

Forward pass: Data flows through each layer to produce predictions, then compute loss.

Backward pass: Starting from the output, use the chain rule layer by layer. For a two-layer network (input x → hidden → output ŷ):

Forward: zₕ = Wₕx + bₕ,　aₕ = h(zₕ),　zₒ = Wₒaₕ + bₒ,　ŷ = zₒ

Define the error signal δ (each layer's "contribution" to the final loss):

δₒ = ∂J/∂zₒ = (ŷ − y)　(output layer)

δₕ = ∂J/∂zₕ = (Wₒᵀ δₒ) ⊙ h′(zₕ)　(hidden layer, ⊙ = element-wise multiply)

With δ, each layer's gradients are simple: ∂J/∂Wₕ = δₕ xᵀ, ∂J/∂Wₒ = δₒ aₕᵀ. This is why the activation function's derivative h′(z) matters so much — Sigmoid's h′ is tiny, causing δ to decay layer by layer (vanishing gradients).

Like a factory tracing a defective product back through the assembly line: "Was it packaging? Assembly? Raw materials?" Each step traces which station and which worker (parameter) contributed to the defect, enabling targeted improvement.

Lec 8c · Lec 9a

▶ C CNN Architecture Deep DiveCNN 架构详解

Convolutional Layers — Local Feature Extraction卷积层 — 局部特征提取

卷积核（比如 3×3 的小方块）像一个小窗口在图像上滑动，每到一个位置就和图像的对应区域做点乘求和。不同的卷积核学会检测不同的模式——有的检测水平线，有的检测颜色变化，有的检测角点。

Padding 模式决定了卷积后图像的大小：

Valid（不填充）：输出尺寸 = (N−K+1)，图像会越来越小
Same（零填充）：输出尺寸 = N，保持大小不变
Full（完全填充）：输出尺寸 = (N+K−1)，图像反而变大

Convolution kernels (e.g., 3×3 patches) slide across the image like a small window, computing dot products at each position. Different kernels learn to detect different patterns — horizontal edges, color changes, corners, etc.

Padding modes determine output size:

Valid (no padding): output = (N−K+1), image shrinks
Same (zero-pad): output = N, size preserved
Full (full padding): output = (N+K−1), image grows

Feature Hierarchy — Simple to Semantic特征层次 — 从简单到语义

CNN 最迷人的特性：浅层自动学到边缘和角点 → 中层学到纹理和部件 → 深层学到完整的语义概念（人脸、汽车、动物）。这种层次结构惊人地类似于人类视觉皮层 V1 → V2 → ... 的信息处理方式。

CNN's most fascinating property: early layers automatically learn edges and corners → middle layers learn textures and parts → deep layers learn full semantic concepts (faces, cars, animals). This hierarchy remarkably mirrors the human visual cortex V1 → V2 → ... processing pipeline.

AlexNet (2012) — The Deep Learning Big Bang深度学习的"大爆炸"

5 层卷积 + 3 层全连接，6000 万参数。三大关键创新：ReLU 替代 Sigmoid、Max Pooling 做降采样、Dropout 防过拟合。在 ImageNet（140 万张图，1000 个类别）上，top-5 错误率从传统方法的 25.6% 降到 15.3%——第一次让世界意识到深度学习的威力。从此之后，深度学习统治了计算机视觉领域。

5 conv + 3 FC layers, 60 million parameters. Three key innovations: ReLU replacing Sigmoid, Max Pooling for downsampling, Dropout for regularization. Reduced ImageNet (1.4M images, 1000 classes) top-5 error from 25.6% to 15.3% — the moment the world realized deep learning's power. Deep learning has dominated computer vision ever since.

Lec 9b · Lec 9c · Lec 9d · Lec 9e

▶ Q4 What Math Do You Need?需要什么数学基础？ Math

机器学习的数学基础主要有三块：线性代数（处理数据和变换）、概率统计（从数据中推理）、优化理论（训练模型）。不需要一开始就全学会——可以边用边学，遇到不懂的概念再回来查。

ML math has three pillars: linear algebra (handling data and transformations), probability & statistics (reasoning from data), and optimization (training models). You don't need to learn everything upfront — learn as you go, and come back to look things up as needed.

▶ V Linear Algebra (Unit 1)线性代数 (Unit 1)

ML 中的数据、参数和变换都是矩阵和向量。线性代数就是处理它们的工具。

SVD 分解 — 任何矩阵 A 都可以分解为 A = UΣVᵀ，其中 U, V 是正交矩阵，Σ 是对角矩阵（奇异值）。PCA 的数学核心：对数据的协方差矩阵做 SVD，前 k 个奇异向量就是主成分。
A = UΣVᵀ,　A ∈ ℝᵐ×ⁿ,　U ∈ ℝᵐ×ᵐ,　Σ ∈ ℝᵐ×ⁿ,　V ∈ ℝⁿ×ⁿ
特征值/特征向量 — Av = λv。协方差矩阵 C = (1/n)XᵀX 的特征值 λᵢ 就是第 i 个主成分的方差。PCA 保留前 k 个特征值对应的方向，使得保留的方差比最大：
保留方差比 = ∑ᵢ₁ᵌ λᵢ / ∑ᵢ₁ᵈ λᵢ
矩阵求逆与伪逆 — 闭式解需要 (XᵀX)⁻¹。当不可逆时，用 Moore-Penrose 伪逆 X⁺ = VΣ⁺Uᵀ（从 SVD 直接得到）。
向量范数 — L2 范数 ||x||₂ = √(∑xᵢ²)，L1 范数 ||x||₁ = ∑|xᵢ|。正则化项就是参数的范数。
正定矩阵 — 矩阵 A 正定 ⇔ xᵀAx > 0 对所有 x ≠ 0。Hessian 正定意味着函数是严格凸的（有唯一全局最优解）。

为什么重要：一张 28×28 的灰度图像就是一个 784 维向量。一个有 1000 个样本、50 个特征的数据集就是一个 1000×50 的矩阵。训练模型就是在这些矩阵上做运算。

In ML, data, parameters, and transformations are all matrices and vectors. Linear algebra is the toolkit for working with them.

SVD — Any matrix A can be decomposed as A = UΣVᵀ, where U, V are orthogonal and Σ is diagonal (singular values). The math behind PCA: apply SVD to the covariance matrix; the top k singular vectors are the principal components.
A = UΣVᵀ,　A ∈ ℝᵐ×ⁿ,　U ∈ ℝᵐ×ᵐ,　Σ ∈ ℝᵐ×ⁿ,　V ∈ ℝⁿ×ⁿ
Eigenvalues/Eigenvectors — Av = λv. Covariance matrix C = (1/n)XᵀX eigenvalues λᵢ are the variance along the i-th principal component. PCA retains the top k eigenvectors to maximize explained variance:
Explained variance ratio = ∑ᵢ₁ᵌ λᵢ / ∑ᵢ₁ᵈ λᵢ
Matrix Inverse & Pseudoinverse — Closed-form needs (XᵀX)⁻¹. When singular, use Moore-Penrose pseudoinverse X⁺ = VΣ⁺Uᵀ (directly from SVD).
Vector Norms — L2 norm ||x||₂ = √(∑xᵢ²), L1 norm ||x||₁ = ∑|xᵢ|. Regularization terms are parameter norms.
Positive Definite Matrices — A is positive definite ⇔ xᵀAx > 0 for all x ≠ 0. A positive definite Hessian means the function is strictly convex (unique global optimum).

Why it matters: A 28×28 grayscale image is a 784-dimensional vector. A dataset of 1000 samples with 50 features is a 1000×50 matrix. Training models means computing with these matrices.

Lec 1a · Lec 1b & 2a

▶ P Probability & Statistics (Unit 2)概率与统计 (Unit 2)

从数据中学习的理论基础——如何在不确定性中做出最优推断。

贝叶斯定理：
P(θ|D) = P(D|θ) P(θ) / P(D) ∝ likelihood × prior
核心思想："在看到数据之后，我对世界的信念应该如何更新？"MAP（最大后验估计）= MLE + 先验，当先验为高斯时，MAP 等价于 L2 正则化！
最大似然估计 (MLE)：
θₘₗₑ = argmax ∏ᵢ P(xᵢ|θ) = argmax ∑ᵢ log P(xᵢ|θ)
取对数把连乘变成求和（数值稳定且方便求导）。MSE 是高斯假设下的 MLE，Cross-Entropy 是伯努利假设下的 MLE。
KL 散度 — 衡量两个分布的"距离"：
Dₖₗ(P||Q) = ∑ P(x) log(P(x)/Q(x)) ≥ 0
最小化 Cross-Entropy ⇔ 最小化预测分布 Q 和真实分布 P 之间的 KL 散度。
协方差矩阵 — C = E[(X−μ)(X−μ)ᵀ]，对角元素是方差，非对角元素是协方差。多元高斯分布 N(μ, Σ) 完全由均值和协方差矩阵决定。

例：一个简单的垃圾邮件过滤器（朴素贝叶斯）：先验概率 P(垃圾邮件)=20%，如果邮件中出现"免费赠送"这个词，P("免费赠送"|垃圾邮件) 很高但 P("免费赠送"|正常邮件) 很低，所以看到这个词后，P(垃圾邮件|这封邮件) 会大幅上升。

The theoretical foundation for learning from data — making optimal inferences under uncertainty.

Bayes' Theorem:
P(θ|D) = P(D|θ) P(θ) / P(D) ∝ likelihood × prior
Core idea: "After seeing data, how should I update my beliefs?" MAP (Maximum A Posteriori) = MLE + prior; with a Gaussian prior, MAP is equivalent to L2 regularization!
Maximum Likelihood (MLE):
θₘₗₑ = argmax ∏ᵢ P(xᵢ|θ) = argmax ∑ᵢ log P(xᵢ|θ)
Log transforms products into sums (numerically stable, easy to differentiate). MSE is MLE under Gaussian noise; Cross-Entropy is MLE under Bernoulli.
KL Divergence — Measures "distance" between distributions:
Dₖₗ(P||Q) = ∑ P(x) log(P(x)/Q(x)) ≥ 0
Minimizing Cross-Entropy ⇔ minimizing KL divergence between predicted Q and true P.
Covariance Matrix — C = E[(X−μ)(X−μ)ᵀ]; diagonal = variances, off-diagonal = covariances. Multivariate Gaussian N(μ, Σ) is fully defined by mean and covariance.

Ex: A simple spam filter (Naive Bayes): prior P(spam)=20%; if "free gift" appears, P("free gift"|spam) is high but P("free gift"|ham) is low, so P(spam|this email) jumps after seeing those words.

Lec 1b & 2a · Lec 2b · Lec 2c · Lec 2d

▶ O Optimization Theory (Unit 3)优化理论 (Unit 3)

训练模型 = 最小化损失函数 = 优化问题。理解优化理论能帮你解决"为什么模型训不好"的问题。

凸函数 — f 是凸的 ⇔ f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y)，等价于 Hessian H ⪰ 0（半正定）。凸函数只有一个全局最优解。线性回归和 Logistic 回归的损失都是凸的，SVM 也是凸的（二次规划）。深度学习的损失是非凸的（存在鞍点和局部最优）。
梯度 & Hessian — 梯度向量 ∇f ∈ ℝᵈ 指向函数增长最快的方向。Hessian 矩阵 H ∈ ℝᵈ×ᵈ 描述曲率：
∇f = [∂f/∂θ₁, ..., ∂f/∂θᵈ]ᵀ，　Hᵢˇ = ∂²f / ∂θᵢ∂θˇ
最优性条件（无约束）：一阶 ∇f = 0，二阶 H ⪰ 0。
拉格朗日乘数与 KKT 条件 — 有约束优化 min f(x) s.t. gᵢ(x) ≤ 0 的最优性条件：
L(x, α) = f(x) + ∑ᵢ αᵢ gᵢ(x)，　αᵢ ≥ 0,　αᵢ gᵢ(x) = 0（互补松弛）
SVM 的对偶问题就是用 KKT 条件推导的，互补松弛条件解释了为什么只有支持向量的 αᵢ > 0。
收敛速率 — GD 线性收敛 O(1/t)，牛顿法二次收敛 O(1/t²)，SGD 收敛速率约 O(1/√t)。Adam 等自适应方法结合了动量和学习率调节：
Adam: mᵗ = β₁mᵗ⁻¹ + (1−β₁)gᵗ,　vᵗ = β₂vᵗ⁻¹ + (1−β₂)gᵗ²,　θ = θ − α m̂ᵗ/(√v̂ᵗ + ε)

Training models = minimizing loss = optimization. Understanding optimization helps diagnose "why isn't my model learning?"

Convex Functions — f is convex ⇔ f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y), equivalently Hessian H ⪰ 0 (positive semi-definite). Convex functions have a single global optimum. Linear/Logistic regression losses are convex; SVM is convex (quadratic program). Deep learning losses are non-convex (saddle points and local optima exist).
Gradient & Hessian — Gradient ∇f ∈ ℝᵈ points in the steepest ascent direction. Hessian H ∈ ℝᵈ×ᵈ describes curvature:
∇f = [∂f/∂θ₁, ..., ∂f/∂θᵈ]ᵀ,　Hᵢˇ = ∂²f / ∂θᵢ∂θˇ
Optimality conditions (unconstrained): first-order ∇f = 0, second-order H ⪰ 0.
Lagrange Multipliers & KKT — For constrained optimization min f(x) s.t. gᵢ(x) ≤ 0:
L(x, α) = f(x) + ∑ᵢ αᵢ gᵢ(x),　αᵢ ≥ 0,　αᵢ gᵢ(x) = 0 (complementary slackness)
SVM's dual is derived via KKT; complementary slackness explains why only support vectors have αᵢ > 0.
Convergence Rates — GD: linear O(1/t), Newton: quadratic O(1/t²), SGD: ~O(1/√t). Adaptive methods like Adam combine momentum and learning rate scaling:
Adam: mᵗ = β₁mᵗ⁻¹ + (1−β₁)gᵗ,　vᵗ = β₂vᵗ⁻¹ + (1−β₂)gᵗ²,　θ = θ − α m̂ᵗ/(√v̂ᵗ + ε)

Lec 6a · Lec 6b · Lec 6c · Lec 6d