Chapter 5 Machine Learning Basics
《Deep Learning》 读书笔记
写这个笔记的目的有两个:一是以高层的角度把整个章节的内容联系起来,从而加深自己的理解,同时也可以供日后复习使用;二是在日后的组会中可能会降到的时候,有东西可以讲(好偷懒 -_-)。
因为是刚入门的新手,有很多东西还不了解,或者了解的不透彻,肯定会有错误和疏漏的地方,希望大家不吝赐教哈~~
5.1 Learning Algorithms
首先,提出了经典的ML
的定义:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E.
然后,本章分 3 节分别详细的对定义中的3个实体进行 description(to provide a formal definition of what may be used for each of these entities),然后再最后一小节用 Linear Regression 的例子进行了讲解。
另外,这里引入了 design matrix 的概念。
具体内容可以参阅书籍。
5.2 Capacity, Overfitting and Underfitting
Machine Learning
使用的是Optimization
的方法进行模型的训练,但是它并不是Optimization
。因为Optimization
是完全的希望准确度在训练数据上尽可能的高(最终过拟合),而ML
最终想要的是 generalization/test error 尽可能的低。这个内容在第 8 章有一小节专门对它们进行了对比,具体可以看那里。
How can we affect performance on the test set when we get to observe only the training set?
Make some assumptions: These assumptions are that the examples in each dataset are independent from each other, and that the train set and test set are identically distributed, drawn from the same probability distribution as each other.
Capacity
这里的
Capacity
指的是模型对 training dataset 的拟合能力,所以Capacity
越高并不一定越好,可能会产生过拟合现象。One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution. 另外,还需要注意到 functions 的具体形式也会影响到模型的
Capacity
。关于模型的
Capacity
,提到了Occam’s razor
原则:在模型的效果相似的情况下,选择较为简单的模型。目的是提高模型的 generalization 能力。
We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test error) we must still choose a sufficiently complex hypothesis to achieve low training error.
也就是总是选择复杂的模型,然后用
Regularization
的方法(weight decay
、dropout
、early stopping
、cross validation
等)减轻过拟合的程度。(CS231n
课程中有强调这一点)。Capacity
与Generalization error
的关系如下图所示。Parametric models learn a function described by a parameter vector whose size is finite and fixed before any data is observed. Non-parametric models have no such limitation. 注意 Non-parametric models 并不是没有参数,只是参数数量不是固定的。
Note that it is possible for the model to have optimal capacity and yet still have a large gap between training and generalization error. In this situation, we may be able to reduce this gap by gathering more training examples.
5.2.1 The No Free Lunch Theorem
In some sense, no machine learning algorithm is universally any better than any other.
The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task.
5.2.2 Regularization
Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
Regularization
就是为了尽量减轻过拟合的程度,比如使用了L2/weight decay
正则化的拟合结果如下图所示。从中可以看出
regularization
的效果是非常显著的。Regularization
有多种实现方式,但是必须根据要解决的具体问题进行选择使用哪一种方式(we must choose a form of regularization that is well-suited to the particular task)。第 7 章 都在讲这个,可以参考书本内容。
5.3 Hyperparameters and Validation Sets
Hyperparameters
(如learning rate
、batch size
、regularization
的强度等)不需要学习,原因有两个:难于学习
如果学习,可能会造成
overfitting
(If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overfitting (refer to figure 5.3)).
与
Hyperparameters
对应的是parameters
(如权重矩阵W
、偏置项b
),它们是需要学习的,它们控制着模型学习到的 features 如何影响接下来的 prediction(107页)。Hyperparameters
从某个角度上说也是可以学习的,比如对于learning rate
,一开始不知道哪一个好,可以定义一个区间,然后使用随机搜索(CS231n
课程中说随机搜索会比网格搜索效果好,关于这点本书的 第 8 章 也有讲解对比,可以参阅那小节的内容)的方式从这个区间进行选择,用一个比较小的 mini-batch 去跑一个比较小的 epoch 来验证哪一个(或哪一个区间内的)learning rate
比较好,然后缩小这个区间,按照同样的方式再接着进行选择(也就是学习)(CS231n
课程第一个 assignment 中有一小题是关于这个的)。It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters.
也就是说在训练的时候不能用模型在 test dataset 的 test error 的高低来决定选择哪一个
Hyperparameters
、或者在什么时候停止训练,因为这相当于以一种变相的方式把 test dataset 当成了 training dataset 。正确的方法是从 training dataset 中抽出一部分(disjoint subsets,通常是20%)当做 validation dataset 来完成这项工作(CS231n
课程中也有强调这一点)。validation set is used to “train” the hyperparameters
After all hyperparameter optimization is complete, the generalization error may be estimated using the test set.
5.3.1 Cross-Validation
通常是
k-fold cross-validation
(CS231n
课程的第一个 assignment 中有一小题是关于cross-validation
的)。注意分割一定要保证:non-overlapping subsets
这里有一个问题,
5.4 Estimators, Bias and Variance
上面的都是用来评估机器学习算法学习到的模型的好坏的指标,分别从不同的角度对机器学习算法的性能进行衡量。
5.4.1 Point Estimation
Point Estimator
因为对函数没有任何约束(The definition does not require that g return a value that is close to the true θ or even that the range of g is the same as the set of allowable values of θ),所以可以是任何函数(almost any function thus qualifies as an estimator)。
但却有 good estimator:a good estimator is a function whose output is close to the true underlying θ that generated the training data.
就是机器学习算法要学习的模型(function),当然目的是要得到 good estimator。
Function Estimation
- 机器学习算法学习到的模型实际上就是一个 function,Point Estimation 做的是找到 function 的
parameters
,而从某种角度上说其实就是在找这个 function。
- 机器学习算法学习到的模型实际上就是一个 function,Point Estimation 做的是找到 function 的
5.4.2 Bias
- Bias
- 衡量模型拟合训练数据的能力(训练数据不一定是整个 training dataset,而是只用于训练它的那一部分数据,例如:mini-batch)。
bias
越小,拟合能力越高(可能产生overfitting
);反之,拟合能力越低(可能产生underfitting
)。- 注意,
bias
是针对一个模型来说的,这一点与variance
不同。
这点还是比较好理解的,并且bias
翻译成偏差,从字面意思也可以看出是和某个值的偏离程度,而这个某个值在机器学习算法中就是整个数据的期望值。
5.4.3 Variance and Standard Error
Variance
- 衡量模型的 generalization 的能力。
variance
越小,模型的 generalization 的能力越高;反之,模型的 generalization 的能力越低。- 它是针对多个模型来说的,针对多个模型来说是什么意思?看下面的分析。
在训练模型的时候不是一下子在整个 training dataset 上进行训练,而是采用从 training dataset 中 sample 数据点的方式来进行训练(例如:mini-batch)。
因为都是从同一个数据集 sample 数据点,所以 sample 的数据点可以保证都是同分布的。但是也正因为 sample 的数据点的不同,很可能会生成不同的模型。
举例来说,假设这里进行了4次 sample 数据点,并在得到的 mini-batch 的数据集上进行算法的训练,得到的模型如下图所示。
那么你认为这样的结果好吗?肯定是不好的啊,因为算法每次学习到的模型都不一样,这说明算法无法学习到数据的 distribution 规律,那还怎么来进行预测?
我们期望的结果是:每次不管怎么 sample 数据点,算法学习到的模型都不会变化太大(比如:都是直线,或者都是二次曲线)。这样的话说明算法能很好的学习到数据的 distribution 规律,那么如果来了新的数据(同分布),算法学习到的模型就能够很好的进行 predict(说明模型的 generalization 能力好)。
而
variance
就是衡量每次在不同 mini-batch 数据上训练生成的不同模型之间的差异性大小的度量值。variance
越小,说明在不同 mini-batch 训练数据上生成的模型的差异性小 —-> 算法学习到的模型能够很好的对新的数据(同分布)进行 predict(算法能很好的学习到了数据的 distribution 规律) —-> 说明模型的 generalization 能力好;反之,模型的 generalization 的能力越低。现在再看就知道为什么
variance
是针对多个(不同)模型说的了,它与bias
的区别也很明显的可以看到了。那
variance
又是如何和cross-validation
联系上的呢?接着看下一小节的分析。
5.4.4 Trading off Bias and Variance to Minimize Mean Squared Error
Bias and variance measure two different sources of error in an estimator.
Bias measures the expected deviation from the true value of the function or parameter.
Variance on the other hand, provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause.
The most common way to negotiate this trade-off is to use cross-validation. Empirically, cross-validation is highly successful on many real-world tasks.
bias
、variance
与error rate
有什么关系?当用的
loss function
是MSE
的时候有如下关系:bias
VSvariance
的关系可以看下图(来自 Machine Learning Basics 5.4.4 节)注意:模型的 capacity 最优点不是
bias
和variance
的相交点。
5.4.5 Consistency
- Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows.
5.5 Maximum Likelihood Estimation
这一节讲的非常好。
Maximum Likelihood Estimation
:利用已知的样本结果反推最有可能(最大概率)导致这样结果的参数值。本部分主要讲了如何把 maximum likelihood 转化为 minimize of the negative log-likelihood 的,讲的非常好。
指出:use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer
Any loss consisting of a negative log-likelihood is a crossentropy between the empirical distribution defined by the training set and the probability distribution defined by model.
问题:
Maximum Likelihood Estimation
与BP
、GD
的关系是什么?
5.5.1 Conditional Log-Likelihood and Mean Squared Error
The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability P(y | x; θ) in order to predict y given x. This is actually the most common situation because it forms the basis for most supervised learning. If X represents all our inputs and Y all our observed targets, then the conditional maximum likelihood estimator is
If the examples are assumed to be i.i.d., then this can be decomposed into
然后,用 Linear Regression as Maximum Likelihood 的例子讲了如何把
MSE
转化为Maximum Likelihood Estimation
5.5.2 Properties of Maximum Likelihood
主要讲了
Maximum Likelihood
的优点。关于为什么要用
Maximum Likelihood Estimation
,而不要用MSE
的原因,可以参考下面博客,讲的也是非常好。
5.6 Bayesian Statistics
啊啊啊啊,看到数学公式就不想看了 -_-
5.6.1 Maximum A Posteriori (MAP) Estimation
5.7 Supervised Learning Algorithms
这里主要简要的介绍了一些比较简单的Supervised Learning Algorithms,如果想要细致了解这些算法,看这些肯定是不够的。
- One weakness of k-nearest neighbors is that it cannot learn that one feature is more discriminative than another.(143页)
5.7.1 Probabilistic Supervised Learning
5.7.2 Support Vector Machines
5.7.3 Other Simple Supervised Learning Algorithms
5.8 Unsupervised Learning Algorithms
Informally, unsupervised learning refers to most attempts to extract information from a distribution that do not require human labor to annotate examples.
There are multiple ways of defining a simpler representation. Three of the most common include(146页):
lower dimensional representations
sparse representations
independent representations
5.8.1 Principal Components Analysis
- 讲了
PCA
的例子
5.8.2 k-means Clustering
- 149 页用
k-means
的例子解释了one-hot
形式(一种比较极端的 sparse representations)会损失很多有用的信息,使用distributed representation
(比如NLP
中的word embeddings
)的方式可以取得更好的效果。
5.9 Stochastic Gradient Descent
SGD
实际值得是只取 1 个 example(此时的训练速度非常慢),但这里与mini-batch GD
(此时的训练速度会得到加快)混用了,本篇笔记里把它们俩分开使用。mini-batch GD
可以加速收敛的原因,可以用一个极端的例子进行解释(这个例子来自CS231n
课程):比如有 1000 个 training examples,总共有 10 个类别,每个类别都有 100 个 examples,假设 mini-batch 的大小是 10,每次 sample mini-batch 进行训练的时候都恰好从每个类别中选择一个 example。在这种极端的情况下,可以看出 mini-batch 完全就是整个 training examples 的表达,所以能够把一个根本不着边际的参数,一下子很快的更新到(几乎)最好的情况。
5.10 Building a Machine Learning Algorithm
这里把 Building a Machine Learning Algorithm 看成是一种类似组建积木的方式,整个 Building a Machine Learning Algorithm 可以分成几个部分。这样看的好处是可以把不同的 Machine Learning Algorithm 联系起来,而不是相互独立。(Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model.)
这种看待的方式可以用于supervised and unsupervised learning(The recipe for constructing a learning algorithm by combining models, costs, and optimization algorithms supports both supervised and unsupervised learning.)。
5.11 Challenges Motivating Deep Learning
这里主要讲了Deep Learning
中遇到的 Challenges,感觉类似一种科普,没事可以读读玩。
5.11.1 The Curse of Dimensionality
维度爆炸。
比如:用
one-hot
形式表示 word,如果词汇表变大,每个 word 的one-hot
表示的维度也会随之增大。多个 word 相乘,维度会很快的增大,最后导致无法计算。
5.11.2 Local Constancy and Smoothness Regularization
- Among the most widely used of these implicit “priors” is the smoothness prior or local constancy prior. This prior states that the function we learn should not change very much within a small region.
5.11.3 Manifold Learning
流形学习
主要目的是
Dimension Reduction
,PCA
就是一种。可以参考下面的链接: