Machine Learning
Notes from ECE 5307, The Ohio State University · Video Lectures · Prof. Schniter
笔记整理自 ECE 5307, The Ohio State University · 课程视频 · Prof. Schniter
Machine learning is the science of getting computers to learn patterns from data, rather than being explicitly programmed with rules. Instead of telling the computer "if the email contains 'free money', mark it as spam," we show it thousands of emails labeled as spam or not-spam, and it figures out the rules on its own.
This knowledge tree organizes the core concepts into four questions: What problem are you solving? (choosing the right model), How do you train it? (optimization & evaluation), How do neural networks work? (deep learning), and What math do you need? (foundations). Click any node to explore.
机器学习是一门让计算机从数据中自动学习规律的科学,而不是由人类手动编写规则。举个例子:我们不需要告诉计算机"如果邮件里包含'免费赢大奖'就标记为垃圾邮件"——我们只需要给它看成千上万封已经标记好的邮件,它就能自己总结出判断垃圾邮件的规则。
这棵知识树把核心概念组织成四个问题:你要解决什么问题?(选对模型)、模型怎么训练?(优化与评估)、神经网络怎么工作?(深度学习)、需要什么数学?(基础知识)。点击任意节点即可展开。
New to machine learning? Don't worry — this course starts from scratch. "Machine learning" sounds fancy, but at its core, it's about finding patterns in data using math you mostly already know (linear algebra, calculus, probability). The tree below builds up from simple models (linear regression) to complex ones (neural networks). Prerequisites: comfort with basic calculus (derivatives) and matrix notation helps, but each concept includes intuitive analogies. Software: Python with scikit-learn or PyTorch is standard, though the concepts are language-agnostic.
机器学习新手?没关系——这门课从头开始教。"机器学习"听起来高深,但核心其实是用你大多已经了解的数学(线性代数、微积分、概率论)在数据中寻找规律。下面的知识树从简单模型(线性回归)逐步建立到复杂模型(神经网络)。前置知识:熟悉基础微积分(导数)和矩阵记号会有帮助,但每个概念都配有直观的类比。软件:Python 搭配 scikit-learn 或 PyTorch 是标准工具,但概念本身不限语言。
回归问题就是:你想预测的东西是一个数字,比如房价、温度、考试分数。它和分类不同——分类是选答案(A/B/C),回归是算出一个具体的值。
Regression problems are about predicting a number — like a house price, temperature, or exam score. Unlike classification (picking a label), regression outputs a continuous value.
最基础的回归模型。核心思想很简单:假设输出 y 和输入 x 之间是直线关系(y = wx + b),然后找到让预测误差最小的那条直线。
怎么找?用最小二乘法——把所有预测值和真实值的差的平方加起来,找让这个总和最小的 w 和 b:
对 J(θ) 求导令其等于零,可以得到一个漂亮的闭式解(Normal Equation):
注意:这个解要求 XᵀX 可逆。当特征数 d > 样本数 n 时,矩阵不可逆;即使可逆,d 很大时计算 O(d³) 也很慢——这时用梯度下降更实际。
当特征变量很多时,模型可能会"过拟合"(训练数据拟合得很好但对新数据预测很差)。这时需要正则化来约束模型复杂度:
The most fundamental regression model. The core idea is simple: assume a straight-line relationship between output y and input x (y = wx + b), then find the line that minimizes prediction error.
How? Using Least Squares — sum up the squared differences between predicted and actual values, then find the w and b that minimize this total:
Taking the derivative of J(θ) and setting it to zero yields a clean closed-form solution (Normal Equation):
Note: this requires XᵀX to be invertible. When features d > samples n, the matrix is singular; even when invertible, computing O(d³) is slow for large d — gradient descent is more practical then.
With many features, the model may overfit (performs well on training data but poorly on new data). Regularization constrains model complexity:
现实中很多关系不是一条直线,而是曲线。多项式回归的做法是把原始特征做多项式展开——构造特征映射:
然后对新特征 φ(x) 用线性回归拟合:y = θᵀφ(x)。本质上还是线性模型(对参数 θ 线性),只不过输入空间变成了 x 的各次方组合。
一个关键问题:d 阶多项式有 d+1 个参数,当 d 太大时模型自由度过高。Bias-Variance Tradeoff:d 太小 → 欠拟合(高偏差),d 太大 → 过拟合(高方差)。需要用交叉验证选择最优的 d。
Many real relationships are curves, not straight lines. Polynomial regression constructs a feature mapping:
Then fits y = θᵀφ(x) using linear regression. It's still linear in parameters θ — just with transformed inputs.
A key issue: a degree-d polynomial has d+1 parameters; too many degrees of freedom when d is large. Bias-Variance Tradeoff: d too small → underfitting (high bias), d too large → overfitting (high variance). Use cross-validation to select optimal d.
当数据中的关系非常复杂、无法用简单公式描述时,可以让神经网络自动学习非线性特征变换。输出层不加激活函数,直接输出一个数值,用 L2 Loss(均方误差)训练。
When relationships in data are too complex for simple formulas, neural networks can automatically learn nonlinear feature transformations. The output layer has no activation, directly outputting a number, trained with L2 loss (MSE).
分类问题就是:给定一些特征信息,判断它属于哪一类。输出不是数字,而是标签。比如:这封邮件是"垃圾邮件"还是"正常邮件"?这张照片里是猫还是狗?这个患者的肿瘤是良性还是恶性?
Classification is about assigning a label given some features. The output is a category, not a number. For example: Is this email spam or not? Is this photo a cat or dog? Is this tumor benign or malignant?
别被名字骗了——虽然叫"回归",但它其实是一个分类模型。核心思想:先像线性回归一样算出一个分数(wᵀx + b),然后通过 Sigmoid 函数把分数"压缩"到 0 和 1 之间,变成一个概率。
如果概率 > 0.5,预测为类别 1;否则预测为类别 0。
从 MLE 推导损失函数:假设 n 个样本独立,则似然函数为:
取负对数似然得到交叉熵损失(Binary Cross-Entropy):
梯度形式和线性回归惊人地相似:∇J = (1/n) Xᵀ(σ(Xθ) − y),但 J(θ) 没有闭式解,需要用梯度下降或牛顿法(IRLS)求解。好消息:J(θ) 是凸函数,保证收敛到全局最优。
Logistic Regression 最大的优势是可解释性——每个特征的系数可以转化为 odds ratio(几率比):eᶝₖ,在医学和社会科学中非常有价值。
Don't be fooled by the name — despite being called "regression," this is a classification model. Core idea: compute a score (wᵀx + b) like linear regression, then squeeze it through the Sigmoid function into a probability between 0 and 1.
If probability > 0.5, predict class 1; otherwise class 0.
Deriving the loss from MLE: Assuming n independent samples, the likelihood is:
Taking the negative log-likelihood yields Binary Cross-Entropy:
The gradient is strikingly similar to linear regression: ∇J = (1/n) Xᵀ(σ(Xθ) − y), but J(θ) has no closed-form solution — use gradient descent or Newton's method (IRLS). Good news: J(θ) is convex, guaranteeing convergence to the global optimum.
The biggest advantage is interpretability — each coefficient converts to an odds ratio: eᶜₖ, invaluable in medicine and social science.
SVM 的目标是找到一个"最宽的马路"来分开两类数据。想象一下:你在桌上放了两堆不同颜色的棋子,要用一根直尺把它们分开——SVM 找的是让直尺两侧"空白区域"最大的那个位置。这个空白区域就叫间距(margin),越大说明分类越有把握。
Hard-Margin 原问题:
Soft-Margin(允许误分类):引入松弛变量 ξᵢ ≥ 0 和惩罚参数 C:
用拉格朗日乘数法转化为对偶问题(这也是 SVM 理论最精妙的地方):
对偶问题的好处:(1) 只依赖内积 xᵢᵀxˇ,可以用核技巧替换;(2) αᵢ > 0 的样本就是支持向量。
核技巧(Kernel Trick):用核函数 K(xᵢ, xˇ) = φ(xᵢ)ᵀφ(xˇ) 隐式映射到高维空间而不需要显式计算 φ。常用核:
- 线性核:K(x,z) = xᵀz
- RBF(高斯)核:K(x,z) = exp(−γ||x−z||²),隐式映射到无穷维空间
- 多项式核:K(x,z) = (xᵀz + c)ᵈ
SVM aims to find the "widest road" separating two classes. Imagine two piles of different-colored chess pieces on a table — you want to place a ruler between them so the empty space on both sides is as wide as possible. This empty space is the margin; wider means more confident.
Hard-Margin primal problem:
Soft-Margin (allowing misclassification): Introduce slack variables ξᵢ ≥ 0 and penalty C:
Via Lagrange multipliers, transform into the dual problem (the theoretical gem of SVM):
The dual's benefits: (1) depends only on inner products xᵢᵀxˇ, enabling the kernel trick; (2) samples with αᵢ > 0 are support vectors.
Kernel Trick: Replace inner products with K(xᵢ, xˇ) = φ(xᵢ)ᵀφ(xˇ), implicitly mapping to high-dimensional space without computing φ explicitly. Common kernels:
- Linear: K(x,z) = xᵀz
- RBF (Gaussian): K(x,z) = exp(−γ||x−z||²), implicit infinite-dimensional mapping
- Polynomial: K(x,z) = (xᵀz + c)ᵈ
Logistic Regression 的多类推广。对每一类计算一个线性得分 zₖ = bₖ + wₖᵀx,然后通过 Softmax 函数把所有类的得分归一化成概率(加起来等于 1):
对应的Categorical Cross-Entropy 损失(从 MLE 推导):
Softmax 的梯度也很优雅:∂J/∂zₖ = pₖ − 𝟙(y=k),即"预测概率减去真实标签"——形式上和 Logistic Regression 完全一致。
Softmax 的精妙之处在于:它不仅告诉你"选哪个类",还给出了对每个类的"信心"。模型很自信时,一个类的概率会接近 1;不确定时,概率分布会更均匀。
Multiclass extension of Logistic Regression. Computes a linear score zₖ = bₖ + wₖᵀx for each class, then normalizes via Softmax to probabilities that sum to 1:
The corresponding Categorical Cross-Entropy loss (derived from MLE):
The gradient is elegant: ∂J/∂zₖ = pₖ − 𝟙(y=k), i.e., "predicted probability minus true label" — identical in form to Logistic Regression.
Softmax's elegance: it not only picks a class but gives a "confidence" for each. When confident, one class's probability approaches 1; when uncertain, probabilities spread out.
如果你手上的模型天生只能做二分类(比如 SVM),怎么解决多分类问题?两种经典策略:
If your model can only do binary classification natively (e.g., SVM), how to handle multiclass? Two classic strategies:
当输入是图像时,传统方法效果很差——因为图像是像素的二维矩阵,位置关系很重要。CNN 专门为图像设计,用一组小的卷积核(filter)在图像上"滑动扫描",自动学习从边缘到语义的层次化特征。
关键组件的流程:卷积层(提取局部特征)→ ReLU 激活(引入非线性)→ 池化层(缩小尺寸、保留关键信息)→ 全连接层(汇总特征做分类)
里程碑:AlexNet (2012) 在 ImageNet (140 万张图, 1000 个类别) 上 top-5 错误率从 25.6% 降到 15.3%,一举开启了深度学习时代。
When input is an image, traditional methods struggle because images are 2D pixel matrices where spatial relationships matter. CNN is purpose-built for images: small convolutional filters slide across the image, automatically learning hierarchical features from edges to semantics.
Pipeline: Conv layers (local features) → ReLU (nonlinearity) → Pooling (downsampling) → Fully-connected (classification)
Milestone: AlexNet (2012) reduced ImageNet (1.4M images, 1000 classes) top-5 error from 25.6% to 15.3%, launching the deep learning era.
前面的回归和分类都有一个前提:每条数据都有一个"正确答案"(标签)供模型学习。但很多时候我们没有标签,只有一堆原始数据,想从中发现规律。这就是无监督学习。
Regression and classification above both assume each data point has a "correct answer" (label). But often we have no labels — just raw data from which we want to discover patterns. That's unsupervised learning.
最经典的聚类算法。目标函数(最小化组内平方和):
步骤很简单:(1) 随机初始化 K 个中心 μₖ;(2) 每个数据点分配到最近的中心(E 步);(3) 重新计算每组的均值作为新中心(M 步);(4) 重复直到收敛。每步都保证 J 不增,但只能收敛到局部最优——所以实践中常用 K-Means++ 初始化来找到更好的起点。
局限是需要提前指定 K(组数),可以用肘部法则来帮助判断——画出不同 K 值对应的 J 曲线,找误差下降趋缓的拐点。
The most classic clustering algorithm. Objective (minimize within-cluster sum of squares):
Steps: (1) randomly initialize K centers μₖ; (2) assign each point to nearest center (E-step); (3) recompute group means as new centers (M-step); (4) repeat until convergence. Each step guarantees J never increases, but only converges to a local optimum — so K-Means++ initialization is used in practice for better starting points.
Limitation: K must be specified in advance. The elbow method helps — plot J vs. K and look for the point where improvement levels off.
不需要预设分几组——它自底向上(凝聚法)逐步合并最相似的数据点/簇,形成一棵树状图(dendrogram)。你可以在任意"高度"横切一刀,得到不同粒度的分组。
No need to pre-specify K — it builds a dendrogram by iteratively merging the most similar points/clusters (agglomerative). Cut at any height to get different levels of grouping.
当数据有几十甚至上百个特征时,很难直接分析。PCA 的核心思想是找到数据变化最大的方向(主成分),把高维数据投影到少数几个方向上,用最少的维度保留最多的信息。
数学上,PCA 是对中心化数据的协方差矩阵做特征值分解:
等价于求解:max ||Xv||² s.t. ||v|| = 1,即找方差最大的投影方向。第一主成分是方差最大的方向,第二主成分是与第一正交且方差最大的方向,以此类推。
When data has dozens or hundreds of features, direct analysis is impractical. PCA finds the directions of maximum variance (principal components) and projects high-dimensional data onto a few key dimensions, preserving maximum information with minimum dimensions.
Mathematically, PCA eigendecomposes the centered data's covariance matrix:
Equivalently: max ||Xv||² s.t. ||v|| = 1 — find the projection direction with maximum variance. The 1st PC has max variance; the 2nd is orthogonal to the 1st with max remaining variance, and so on.
一种非线性降维方法,特别擅长将高维数据可视化到 2D 或 3D。PCA 保留全局结构(远近关系),t-SNE 更注重保留局部邻域关系——所以 t-SNE 的可视化图中,相似的点会明显聚成一团,看起来非常直观。
A nonlinear dimensionality reduction method that excels at visualizing high-dimensional data in 2D/3D. While PCA preserves global structure, t-SNE focuses on local neighborhoods — similar points form clear clusters in the visualization.
训练一个机器学习模型,本质上就是在做一件事:调整参数,让预测尽可能接近真实值。这个过程分三步:(1) 定义一个"损失函数"衡量预测有多差,(2) 用优化算法不断调整参数降低损失,(3) 用正则化防止模型"死记硬背"。
Training a model is essentially one thing: adjusting parameters to make predictions as close to reality as possible. Three steps: (1) define a "loss function" measuring how bad predictions are, (2) use optimization to adjust parameters and reduce loss, (3) use regularization to prevent memorization.
最基础的优化算法。想象你在一座雾蒙蒙的山上,看不到全貌,只知道脚下的坡度。梯度下降的策略就是:每一步都朝最陡的下坡方向走,一步一步走到山谷(损失最小的地方)。
其中 α 是学习率(步长),∇J 是损失函数的梯度(坡度方向)。学习率太大会跳过最低点甚至发散,太小会走得很慢。
问题:经典 GD 每走一步都需要用全部数据计算梯度,数据量大时非常慢。
The most basic optimizer. Imagine standing on a foggy mountain — you can't see the full landscape, only the slope at your feet. Gradient descent's strategy: always step in the steepest downhill direction, walking step by step to the valley (minimum loss).
α is the learning rate (step size), ∇J is the loss gradient (slope direction). Too large = overshoot or diverge; too small = painfully slow.
Problem: classic GD uses the entire dataset for each step — extremely slow with large data.
解决方案:不用全部数据,每步只随机取一小批数据(mini-batch)来估算梯度方向。虽然每步的方向不那么精确,但走得快得多!这就是深度学习的标准训练方法。
一个 epoch = 所有 mini-batch 都用过一遍 = n/B 步(n 是总样本数,B 是 batch size)。通常需要训练几十到上百个 epoch。
Solution: instead of the full dataset, use a random mini-batch each step to estimate the gradient. Less precise per step, but much faster! This is the standard training method for deep learning.
One epoch = all mini-batches seen once = n/B steps (n = total samples, B = batch size). Typically need tens to hundreds of epochs.
用二阶导数(Hessian 矩阵)信息来加速收敛——不仅知道坡度方向,还知道"坡度变化得有多快",所以能用更少的步数到达最低点。更新公式:
直觉:梯度下降用"平面"近似损失函数,每步走固定步长;牛顿法用"二次曲面"近似,能在一步内跳到近似曲面的极小值——所以收敛速度是二次的(quadratic convergence),远快于梯度下降的线性收敛。
但计算 H⁻¹ 需要 O(d³),存储 H 需要 O(d²) 内存。当 d = 100 万时完全不可行。实际中常用拟牛顿法(如 L-BFGS)来近似 H⁻¹,在中等规模的凸优化问题中非常有效。
Uses second-order derivatives (Hessian matrix) for faster convergence — knowing not just the slope but "how fast the slope changes" enables reaching the minimum in fewer steps. Update rule:
Intuition: GD approximates the loss with a "plane" and takes fixed steps; Newton's method uses a "quadratic surface" and jumps to its minimum in one step — hence quadratic convergence, far faster than GD's linear convergence.
But computing H⁻¹ is O(d³), storing H is O(d²). When d = 1 million, this is infeasible. In practice, Quasi-Newton methods (e.g., L-BFGS) approximate H⁻¹ and work well for medium-scale convex problems.
损失函数就是你给模型的"评分标准"——告诉它"预测得多差"。不同类型的问题用不同的损失函数:
- 回归 → MSE(均方误差):
J = (1/n) ∑ᵢ₁ⁿ (yᵢ − ŷᵢ)²对大误差惩罚更重(因为要平方)。梯度 ∂J/∂ŷᵢ = −2(yᵢ − ŷᵢ)/n,简单直观。假设噪声服从高斯分布时,最小化 MSE 等价于 MLE。
- 二分类 → Binary Cross-Entropy:
J = −(1/n) ∑ᵢ [yᵢ log(pᵢ) + (1−yᵢ) log(1−pᵢ)]当模型很自信地预测错(比如 p=0.99 但 y=0)时,−log(0.01) ≈ 4.6,惩罚非常大。这就是为什么 Cross-Entropy 比 MSE 更适合分类——它对自信的错误惩罚更重。
- 多分类 → Categorical Cross-Entropy:
J = −(1/n) ∑ᵢ ∑ᵌ₁ᴷ yᵢᵌ log(ŷᵢᵌ)搭配 Softmax 使用。yᵢᵌ 是 one-hot 编码,所以求和实际上只保留正确类别的那一项。
- SVM → Hinge Loss:
J = ∑ᵢ max(0, 1 − yᵢ f(xᵢ))只要分类正确且置信度 ≥ 1 就不罚分,否则线性惩罚。这就是"最大间距"的来源。
The loss function is the model's "grading rubric" — it defines how bad a prediction is. Different problems need different loss functions:
- Regression → MSE:
J = (1/n) ∑ᵢ₁ⁿ (yᵢ − ŷᵢ)²Penalizes large errors more (squared). Gradient: ∂J/∂ŷᵢ = −2(yᵢ − ŷᵢ)/n — simple and intuitive. Under Gaussian noise assumption, minimizing MSE is equivalent to MLE.
- Binary Classification → Binary Cross-Entropy:
J = −(1/n) ∑ᵢ [yᵢ log(pᵢ) + (1−yᵢ) log(1−pᵢ)]When the model is confidently wrong (e.g., p=0.99 but y=0), −log(0.01) ≈ 4.6, a severe penalty. This is why Cross-Entropy suits classification better than MSE — it punishes confident mistakes heavily.
- Multiclass → Categorical Cross-Entropy:
J = −(1/n) ∑ᵢ ∑ᵌ₁ᴷ yᵢᵌ log(ŷᵢᵌ)Paired with Softmax. yᵢᵌ is one-hot encoded, so the sum retains only the correct class's term.
- SVM → Hinge Loss:
J = ∑ᵢ max(0, 1 − yᵢ f(xᵢ))No penalty as long as classification is correct with confidence ≥ 1; linear penalty otherwise. This is the source of "maximum margin."
过拟合就是模型"死记硬背"了训练数据,包括其中的噪声和偶然规律,导致在新数据上表现很差。就像一个学生把课本上的例题答案背下来了,但遇到新题就不会做。
正则化是一套"防作弊"手段,让模型学到真正的规律而不是噪声:
- L2 正则化 (Ridge) — 在损失函数中加 λ||θ||²,限制参数不能太大。效果:模型更平滑、更"保守"。
- L1 正则化 (Lasso) — 加 λ||θ||₁,能把不重要的参数压到零——相当于自动丢弃无用特征。
- Dropout — 训练时随机"关闭"一部分神经元(比如 50%),迫使网络不过度依赖某几条路径。测试时全部神经元都打开。
- Batch Normalization — 对每层的输入做标准化(均值=0, 方差=1),加速训练、稳定梯度。
- 交叉验证 (Cross-Validation) — 把数据分成 K 份,轮流用其中 1 份做测试、其余做训练。K 次结果的平均值就是对泛化能力的客观评估。
Overfitting is when a model memorizes training data, including noise and coincidental patterns, performing poorly on new data. Like a student who memorized textbook answers but can't solve new problems.
Regularization is the "anti-cheating" toolkit, ensuring models learn genuine patterns:
- L2 (Ridge) — Adds λ||θ||² to loss, keeping parameters small. Effect: smoother, more "conservative" model.
- L1 (Lasso) — Adds λ||θ||₁, can zero out unimportant parameters — automatic feature selection.
- Dropout — Randomly deactivates neurons during training (e.g., 50%), preventing over-reliance on specific pathways. All neurons active during testing.
- Batch Normalization — Normalizes each layer's inputs (mean=0, variance=1), speeding up training and stabilizing gradients.
- Cross-Validation — Split data into K folds; rotate one fold as test, rest as training. Average over K trials gives an objective estimate of generalization.
神经网络是一种受大脑启发(但并不真正模拟大脑)的计算模型。它的强大之处在于:只要网络够大,理论上可以拟合任意复杂的函数(万能逼近定理)。近年来深度学习的爆发正是因为我们终于有了足够的数据和算力来训练大型神经网络。
Neural networks are brain-inspired (but don't literally simulate brains) computational models. Their power: given sufficient size, they can theoretically approximate any function (Universal Approximation Theorem). The deep learning explosion happened because we finally had enough data and compute to train large networks.
最简单的神经网络由三层组成:输入层(接收特征数据)→ 隐藏层(做可学习的特征变换)→ 输出层(给出预测结果)。
每一层做两件事:(1) 线性变换(矩阵乘法 + 偏置),(2) 非线性激活。没有激活函数的话,多层网络等价于单层线性模型——激活函数才是让网络能学习复杂模式的关键。
The simplest neural network has three layers: Input (receives features) → Hidden (learnable feature transformation) → Output (produces predictions).
Each layer does two things: (1) linear transformation (matrix multiply + bias), (2) nonlinear activation. Without activations, a multi-layer network collapses to a single linear model — activations are the key to learning complex patterns.
激活函数的作用是在每层的线性变换后引入非线性。没有它,再深的网络也只能画直线(做线性分割)。有了它,网络才能画出任意复杂的曲线来分开数据。
Activation functions introduce nonlinearity after each layer's linear transformation. Without them, even a deep network can only draw straight lines. With them, the network can draw arbitrarily complex decision boundaries.
极其简单:正数原样通过,负数变成零。梯度要么是 1 要么是 0,永远不会消失。计算速度快。今天几乎所有深度网络都用 ReLU 或其变体(Leaky ReLU、GELU 等)。
梯度消失问题详解:Sigmoid 和 Tanh 的梯度始终小于 1。在多层网络中,反向传播需要将各层梯度连乘——很多个小于 1 的数相乘,结果趋近于零。这意味着靠近输入的前面几层几乎无法学习。ReLU 的梯度是 1(正数区域),完美解决了这个问题。
Extremely simple: positive values pass through, negatives become zero. Gradient is either 1 or 0 — it never vanishes. Fast to compute. Nearly all modern deep networks use ReLU or variants (Leaky ReLU, GELU, etc.).
Vanishing gradient explained: Sigmoid/Tanh gradients are always <1. In deep networks, backpropagation multiplies gradients across layers — many numbers <1 multiplied together approach zero. Early layers barely learn. ReLU's gradient is 1 (for positive inputs), solving this completely.
反向传播是训练神经网络的核心算法。整个过程分两步:
前向传播:数据从输入层流过每一层,最终得到预测结果,计算出损失。
反向传播:从输出层开始,利用链式法则逐层往回算。以一个两层网络为例(输入 x → 隐藏层 → 输出 ŷ):
定义误差信号 δ(每层对最终损失的"贡献"):
有了 δ,每层的梯度就很简单了:∂J/∂Wₕ = δₕ xᵀ,∂J/∂Wₒ = δₒ aₕᵀ。这就是为什么激活函数的导数 h′(z) 如此重要——Sigmoid 的 h′ 很小,导致 δ 层层衰减(梯度消失)。
Backpropagation is the core algorithm for training neural networks. Two phases:
Forward pass: Data flows through each layer to produce predictions, then compute loss.
Backward pass: Starting from the output, use the chain rule layer by layer. For a two-layer network (input x → hidden → output ŷ):
Define the error signal δ (each layer's "contribution" to the final loss):
With δ, each layer's gradients are simple: ∂J/∂Wₕ = δₕ xᵀ, ∂J/∂Wₒ = δₒ aₕᵀ. This is why the activation function's derivative h′(z) matters so much — Sigmoid's h′ is tiny, causing δ to decay layer by layer (vanishing gradients).
卷积核(比如 3×3 的小方块)像一个小窗口在图像上滑动,每到一个位置就和图像的对应区域做点乘求和。不同的卷积核学会检测不同的模式——有的检测水平线,有的检测颜色变化,有的检测角点。
Padding 模式决定了卷积后图像的大小:
- Valid(不填充):输出尺寸 = (N−K+1),图像会越来越小
- Same(零填充):输出尺寸 = N,保持大小不变
- Full(完全填充):输出尺寸 = (N+K−1),图像反而变大
Convolution kernels (e.g., 3×3 patches) slide across the image like a small window, computing dot products at each position. Different kernels learn to detect different patterns — horizontal edges, color changes, corners, etc.
Padding modes determine output size:
- Valid (no padding): output = (N−K+1), image shrinks
- Same (zero-pad): output = N, size preserved
- Full (full padding): output = (N+K−1), image grows
CNN 最迷人的特性:浅层自动学到边缘和角点 → 中层学到纹理和部件 → 深层学到完整的语义概念(人脸、汽车、动物)。这种层次结构惊人地类似于人类视觉皮层 V1 → V2 → ... 的信息处理方式。
CNN's most fascinating property: early layers automatically learn edges and corners → middle layers learn textures and parts → deep layers learn full semantic concepts (faces, cars, animals). This hierarchy remarkably mirrors the human visual cortex V1 → V2 → ... processing pipeline.
5 层卷积 + 3 层全连接,6000 万参数。三大关键创新:ReLU 替代 Sigmoid、Max Pooling 做降采样、Dropout 防过拟合。在 ImageNet(140 万张图,1000 个类别)上,top-5 错误率从传统方法的 25.6% 降到 15.3%——第一次让世界意识到深度学习的威力。从此之后,深度学习统治了计算机视觉领域。
5 conv + 3 FC layers, 60 million parameters. Three key innovations: ReLU replacing Sigmoid, Max Pooling for downsampling, Dropout for regularization. Reduced ImageNet (1.4M images, 1000 classes) top-5 error from 25.6% to 15.3% — the moment the world realized deep learning's power. Deep learning has dominated computer vision ever since.
机器学习的数学基础主要有三块:线性代数(处理数据和变换)、概率统计(从数据中推理)、优化理论(训练模型)。不需要一开始就全学会——可以边用边学,遇到不懂的概念再回来查。
ML math has three pillars: linear algebra (handling data and transformations), probability & statistics (reasoning from data), and optimization (training models). You don't need to learn everything upfront — learn as you go, and come back to look things up as needed.
ML 中的数据、参数和变换都是矩阵和向量。线性代数就是处理它们的工具。
- SVD 分解 — 任何矩阵 A 都可以分解为 A = UΣVᵀ,其中 U, V 是正交矩阵,Σ 是对角矩阵(奇异值)。PCA 的数学核心:对数据的协方差矩阵做 SVD,前 k 个奇异向量就是主成分。
A = UΣVᵀ, A ∈ ℝᵐ×ⁿ, U ∈ ℝᵐ×ᵐ, Σ ∈ ℝᵐ×ⁿ, V ∈ ℝⁿ×ⁿ
- 特征值/特征向量 — Av = λv。协方差矩阵 C = (1/n)XᵀX 的特征值 λᵢ 就是第 i 个主成分的方差。PCA 保留前 k 个特征值对应的方向,使得保留的方差比最大:
保留方差比 = ∑ᵢ₁ᵌ λᵢ / ∑ᵢ₁ᵈ λᵢ
- 矩阵求逆与伪逆 — 闭式解需要 (XᵀX)⁻¹。当不可逆时,用 Moore-Penrose 伪逆 X⁺ = VΣ⁺Uᵀ(从 SVD 直接得到)。
- 向量范数 — L2 范数 ||x||₂ = √(∑xᵢ²),L1 范数 ||x||₁ = ∑|xᵢ|。正则化项就是参数的范数。
- 正定矩阵 — 矩阵 A 正定 ⇔ xᵀAx > 0 对所有 x ≠ 0。Hessian 正定意味着函数是严格凸的(有唯一全局最优解)。
In ML, data, parameters, and transformations are all matrices and vectors. Linear algebra is the toolkit for working with them.
- SVD — Any matrix A can be decomposed as A = UΣVᵀ, where U, V are orthogonal and Σ is diagonal (singular values). The math behind PCA: apply SVD to the covariance matrix; the top k singular vectors are the principal components.
A = UΣVᵀ, A ∈ ℝᵐ×ⁿ, U ∈ ℝᵐ×ᵐ, Σ ∈ ℝᵐ×ⁿ, V ∈ ℝⁿ×ⁿ
- Eigenvalues/Eigenvectors — Av = λv. Covariance matrix C = (1/n)XᵀX eigenvalues λᵢ are the variance along the i-th principal component. PCA retains the top k eigenvectors to maximize explained variance:
Explained variance ratio = ∑ᵢ₁ᵌ λᵢ / ∑ᵢ₁ᵈ λᵢ
- Matrix Inverse & Pseudoinverse — Closed-form needs (XᵀX)⁻¹. When singular, use Moore-Penrose pseudoinverse X⁺ = VΣ⁺Uᵀ (directly from SVD).
- Vector Norms — L2 norm ||x||₂ = √(∑xᵢ²), L1 norm ||x||₁ = ∑|xᵢ|. Regularization terms are parameter norms.
- Positive Definite Matrices — A is positive definite ⇔ xᵀAx > 0 for all x ≠ 0. A positive definite Hessian means the function is strictly convex (unique global optimum).
从数据中学习的理论基础——如何在不确定性中做出最优推断。
- 贝叶斯定理:
P(θ|D) = P(D|θ) P(θ) / P(D) ∝ likelihood × prior核心思想:"在看到数据之后,我对世界的信念应该如何更新?"MAP(最大后验估计)= MLE + 先验,当先验为高斯时,MAP 等价于 L2 正则化!
- 最大似然估计 (MLE):
θₘₗₑ = argmax ∏ᵢ P(xᵢ|θ) = argmax ∑ᵢ log P(xᵢ|θ)取对数把连乘变成求和(数值稳定且方便求导)。MSE 是高斯假设下的 MLE,Cross-Entropy 是伯努利假设下的 MLE。
- KL 散度 — 衡量两个分布的"距离":
Dₖₗ(P||Q) = ∑ P(x) log(P(x)/Q(x)) ≥ 0最小化 Cross-Entropy ⇔ 最小化预测分布 Q 和真实分布 P 之间的 KL 散度。
- 协方差矩阵 — C = E[(X−μ)(X−μ)ᵀ],对角元素是方差,非对角元素是协方差。多元高斯分布 N(μ, Σ) 完全由均值和协方差矩阵决定。
The theoretical foundation for learning from data — making optimal inferences under uncertainty.
- Bayes' Theorem:
P(θ|D) = P(D|θ) P(θ) / P(D) ∝ likelihood × priorCore idea: "After seeing data, how should I update my beliefs?" MAP (Maximum A Posteriori) = MLE + prior; with a Gaussian prior, MAP is equivalent to L2 regularization!
- Maximum Likelihood (MLE):
θₘₗₑ = argmax ∏ᵢ P(xᵢ|θ) = argmax ∑ᵢ log P(xᵢ|θ)Log transforms products into sums (numerically stable, easy to differentiate). MSE is MLE under Gaussian noise; Cross-Entropy is MLE under Bernoulli.
- KL Divergence — Measures "distance" between distributions:
Dₖₗ(P||Q) = ∑ P(x) log(P(x)/Q(x)) ≥ 0Minimizing Cross-Entropy ⇔ minimizing KL divergence between predicted Q and true P.
- Covariance Matrix — C = E[(X−μ)(X−μ)ᵀ]; diagonal = variances, off-diagonal = covariances. Multivariate Gaussian N(μ, Σ) is fully defined by mean and covariance.
训练模型 = 最小化损失函数 = 优化问题。理解优化理论能帮你解决"为什么模型训不好"的问题。
- 凸函数 — f 是凸的 ⇔ f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y),等价于 Hessian H ⪰ 0(半正定)。凸函数只有一个全局最优解。线性回归和 Logistic 回归的损失都是凸的,SVM 也是凸的(二次规划)。深度学习的损失是非凸的(存在鞍点和局部最优)。
- 梯度 & Hessian — 梯度向量 ∇f ∈ ℝᵈ 指向函数增长最快的方向。Hessian 矩阵 H ∈ ℝᵈ×ᵈ 描述曲率:
∇f = [∂f/∂θ₁, ..., ∂f/∂θᵈ]ᵀ, Hᵢˇ = ∂²f / ∂θᵢ∂θˇ最优性条件(无约束):一阶 ∇f = 0,二阶 H ⪰ 0。
- 拉格朗日乘数与 KKT 条件 — 有约束优化 min f(x) s.t. gᵢ(x) ≤ 0 的最优性条件:
L(x, α) = f(x) + ∑ᵢ αᵢ gᵢ(x), αᵢ ≥ 0, αᵢ gᵢ(x) = 0(互补松弛)SVM 的对偶问题就是用 KKT 条件推导的,互补松弛条件解释了为什么只有支持向量的 αᵢ > 0。
- 收敛速率 — GD 线性收敛 O(1/t),牛顿法二次收敛 O(1/t²),SGD 收敛速率约 O(1/√t)。Adam 等自适应方法结合了动量和学习率调节:
Adam: mᵗ = β₁mᵗ⁻¹ + (1−β₁)gᵗ, vᵗ = β₂vᵗ⁻¹ + (1−β₂)gᵗ², θ = θ − α m̂ᵗ/(√v̂ᵗ + ε)
Training models = minimizing loss = optimization. Understanding optimization helps diagnose "why isn't my model learning?"
- Convex Functions — f is convex ⇔ f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y), equivalently Hessian H ⪰ 0 (positive semi-definite). Convex functions have a single global optimum. Linear/Logistic regression losses are convex; SVM is convex (quadratic program). Deep learning losses are non-convex (saddle points and local optima exist).
- Gradient & Hessian — Gradient ∇f ∈ ℝᵈ points in the steepest ascent direction. Hessian H ∈ ℝᵈ×ᵈ describes curvature:
∇f = [∂f/∂θ₁, ..., ∂f/∂θᵈ]ᵀ, Hᵢˇ = ∂²f / ∂θᵢ∂θˇOptimality conditions (unconstrained): first-order ∇f = 0, second-order H ⪰ 0.
- Lagrange Multipliers & KKT — For constrained optimization min f(x) s.t. gᵢ(x) ≤ 0:
L(x, α) = f(x) + ∑ᵢ αᵢ gᵢ(x), αᵢ ≥ 0, αᵢ gᵢ(x) = 0 (complementary slackness)SVM's dual is derived via KKT; complementary slackness explains why only support vectors have αᵢ > 0.
- Convergence Rates — GD: linear O(1/t), Newton: quadratic O(1/t²), SGD: ~O(1/√t). Adaptive methods like Adam combine momentum and learning rate scaling:
Adam: mᵗ = β₁mᵗ⁻¹ + (1−β₁)gᵗ, vᵗ = β₂vᵗ⁻¹ + (1−β₂)gᵗ², θ = θ − α m̂ᵗ/(√v̂ᵗ + ε)