2017-02-25

Chapter 7 Regularization for Deep Learning

首先，介绍了regularization的定义：

any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

也就是说regularization的目的是：通过控制模型复杂度的方式，尽量减轻模型overfitting的程度，提高generalization能力。

regularization的大致可以分为以下几类：
- 作用对象/实现方式
  - put extra constraints on a machine learning model，such as adding restrictions on the parameter values。
  - add extra terms in the objective function that can be thought of as corresponding to a soft constraint on the parameter values。
- 原因
  - encode specific kinds of prior knowledge
  - express a generic preference for a simpler model class in order to promote generalization
An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.
controlling the complexity of the model is not a simple matter of finding the model of the right size, with the right number of parameters.
Instead, we might find—and indeed in practical deep learning scenarios, we almost always do find—that the best fitting model (in the sense of minimizing generalization error) is a large model that has been regularized appropriately.

7.1 Parameter Norm Penalties

最常用的方法是在objective function上加上一个 parameter norm penalty，这种方法的理论基础是 limiting the capacity of models。

此时，objective function的一般形式如下：

其中，α控制着两者的强度，对于neural networks，针对每个层，可以选择不同的值，但是很难确定哪一个值是正确的（it can be expensive to search for the correct value of multiple hyperparameters），所以一般都选择同一个值；Ω(θ)可以选择不同的函数，产生的效果也会不一样。
通常不对bias进行正则化，一是会加大计算量，同时效果也不显著；二是可能会产生underfitting。

下面对不同的 Parameter Norm Penalties 进行介绍，主要是 L2 与 L1。

7.1.1 L2 Parameter Regularization

就是通常所说的weight decay，此时objective function的形式如下：

这是最常用的一种正则化，它对参数更新的每一步的作用是：

the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step

但是，在整个训练过程中，它又是如何发挥作用的呢（But what happens over the entire course of training?）？

为了回答这个问题，紧接着是一系列的数学公式推导……

最后的答案是：

We can see that L2 regularization causes the learning algorithm to “perceive” the input X as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low compared to this added variance

7.1.2 L1 Regularization

此时objective function的形式如下：

这一小节并没有看太难懂，只是知道它的主要作用是：

稀疏化作用
正因为它的稀疏性特性，常被用来做 feature selection（CS231n课程中也有提到这一点）

但是相对于 L2，使用并不频繁。

7.2 Norm Penalties as Constrained Optimization

没看懂 -_-

这里讨论的不再是对objective function进行惩罚，而是直接使用 explicit constraints。
使用 explicit constraints 的好处：
- Another reason to use explicit constraints and reprojection rather than enforcing constraints with penalties is that penalties can cause non-convex optimization procedures to get stuck in local minima corresponding to small θ.
- 当使用的leraning rate比较大的时候，可能会陷入 a positive feedback loop，而 explicit constraints 可以避免这种情况。

7.3 Regularization and Under-Constrained Problems

没看懂 -_-

7.4 Dataset Augmentation

提升模型性能的最好方法是使用更大的数据集对其进行训练，但是很多时候数据集都是有限的。为了解决这个问题，可以对已有的数据集进行操作（比如镜像，裁剪，patch等），然后加到已有的数据集中去。
这种方法可以很好的应用于 classification 类型的任务（不仅仅是 classification，还包括 recognition、detection 等），但是对其它的一些任务却很难使用（比如对于 density estimation 类型的任务，除非已经知道了原有数据的分布，不过既然知道了，也不再需要进行 Dataset Augmentation 了）。

注意：
- 对数据集（特别是图像）进行 Dataset Augmentation（比如：transformations、rotating、scaling）的时候，一定要保证不能改变数据的label（比如应该避免b转化为d等类似操作）。
- 对图像进行的时候，不能只 shift 图像，因为这样根本没有帮助。
对数据集添加噪音（Injecting noise）也是一种 Dataset Augmentation，加入噪音的对象可以分为：
- input。
- hidden units。Dropout就是一种。对 hidden units 加入噪音可以看作是对不同抽象层级的一种 Dataset Augmentation。
- weights。下一小节讲的内容主要就是这个。
- output units。下一小节提了一下这个，不过没看懂。
为什么要加入噪音？

原因是因为neural networks对 noise 的鲁棒性不是很好，而提高neural networks鲁棒性的一种方式就是用加入噪音的数据对其进行训练。
注意区分 Dataset Augmentation 与 pre-processing 的区别：
- operations that are generally applicable (such as adding Gaussian noise to the input) are
  considered part of the machine learning algorithm。
- while operations that are specific to one application domain (such as randomly cropping an image) are considered to be separate pre-processing steps。

7.5 Noise Robustness

noise injection对neural networks的性能的提升比仅仅只是 simply shrinking the parameters 要更强有力。特别是对hidden units加入 noise 的时候。
Adding noise to the weights 在RNN的regularization中有很好的效果（但是记得CS231n课程中提到RNN不需要用？）。这一段在新版中被删掉了，不知道为什么。
然后用MLP举了个对 adding noise to the weights 的例子，但是没有看懂想表达什么意思。。。

7.5.1 Injecting Noise at the Output Targets

这一小节没有看懂 -_-

7.6 Semi-Supervised Learning

In the paradigm of semi-supervised learning, both unlabeled examples from P(x) and labeled examples from P (x, y) are used to estimate P (y | x) or predict y from x.
In the context of deep learning, semi-supervised learning usually refers to learning a representation h = f (x). The goal is to learn a representation so that examples from the same class have similar representations.

7.7 Multi-Task Learning

多个不同的，但相关的任务共享网络结构，以期达到整体提高的效果。其中，多任务共享的网络相当于一个特征提取器，多任务信息互相补充，让公用特征提取的更好。
比如下面就是一个 Multi-Task Learning 的例子。

Multi-Task Learning 可以分为两个阶段，首先是可以共用的特征的学习，然后进行 task-specific 的任务。

对于上图，这里X是共享的 input，通过这个 input 学习到可以共用的 feature representation（上图是h(shared)），然后再在h(shared)的基础上做 task-specific 的任务（学习 task-specific 的 feature representation，然后完成相应的预测）。

7.8 Early Stopping

Definition：

obtain a model with better validation set error (and thus, hopefully better test set error) by returning to the parameter setting at the point in time with the lowest validation set error. Every time the error on the validation set improves, we store a copy of the model parameters.
Method：

Every time the error on the validation set improves, we store a copy of the model parameters. When the training algorithm terminates, we return these parameters, rather than the latest parameters. The algorithm terminates when no parameters have improved over the best recorded validation error for some pre-specified number of iterations.
Advantages：
- effectiveness
- simplicity
- requires almost no change in the underlying training procedure, the objective function, or the set of allowable parameter values
- may be used either alone or in conjunction with other regularization strategies.
- reduces the computational cost of the training procedure
Disadvantages：
- the need to maintain a copy of the best parameters
- requires a validation set
注意：它与cross validation是不同的，它只有一个固定的validation dataset，用来判断neural networks的 generalization 能力。而cross validation通常是把整个training dataset分成k份，然后每次选其中一份作validation dataset，其余的作training dataset，且需要训练k个 epoch。
How to think of it？

One way to think of early stopping is as a very efficient hyperparameter selection algorithm. In this view, the number of training steps is just another hyperparameter.
会有一部分数据没有直接被用来训练 model，所以为了更好的利用这一部分数据，可以在确定 training steps 后，用所有的 training examples 再训练一次。此时有2种策略：
- 随机初始化所有的parameters，然后训练 training steps 步数。不过这会带来一个问题：由于 training examples 增多了，所以这么多的训练步数可能无法充分的更新参数，致使网络无法充分的学习到数据的features。
- 按照之前存储下来的parameters初始化所有的parameters。但是此时由于没有 validation dataset，导致不知道应该在何时停止训练。
How early stopping acts as a regularizer

主要就是说early stopping的作用和L2的作用相似，且比L2要好（因为它既起到了L2使weight比较小的作用，同时也起到保证 generalization 比较好作用），具体解释为什么和L2作用相似的部分没有太理解。

具体讲解可以参考书本，这里不知道该怎么阐述。CNN中filter的parameter sharing就是这种思想的体现。

有一个问题：为什么 supervised paradigm 要与 unsupervised paradigm 相一致？而不是反过来。

7.10 Sparse Representations

有什么好处？
太笨了，没看懂 -_-

7.11 Bagging and Other Ensemble Methods

The idea is to train several different models separately, then have all of the models vote on the output for test examples. This is an example of a general strategy in machine learning called model averaging. Techniques employing this strategy are known as ensemble methods。
书中有个例子很好的解释了 bagging 是如何工作的，以及它为什么效果会单一的模型要好。
只看这些的话，对理解或使用 Bagging 肯定是不够的，可以看看别人写的博客或找本书看看。
注意区分以下概念：
- bagging（bootstrap aggregating）：
- boosting：
- stack：

7.12 Dropout

感觉原论文讲的已经非常好了，比较容易理解，其它讲的都是试图去解释它为什么会起作用，而不是它是怎么实现的、怎么起作用的。

7.13 Adversarial Training

可以参考这篇 Deep Learning Adversarial Examples，也是 Goodfellow 写的，更为详细和容易理解些。
前面已经讲到，neural networks对 noise 的鲁棒性不好，所以即使是准确率达到 100% 的模型也有可能出现非常匪夷所思的错误。如下图的例子。
注意它与GAN是没有任何关系的。

7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

太笨了，没看懂 -_-

WatsonYang's Blog

Enrich yourself.

Chapter 7 Regularization for Deep Learning

7.1 Parameter Norm Penalties

7.1.1 L2 Parameter Regularization

7.1.2 L1 Regularization

7.2 Norm Penalties as Constrained Optimization

7.3 Regularization and Under-Constrained Problems

7.4 Dataset Augmentation

7.5 Noise Robustness

7.5.1 Injecting Noise at the Output Targets

7.6 Semi-Supervised Learning

7.7 Multi-Task Learning

7.8 Early Stopping

7.10 Sparse Representations

7.11 Bagging and Other Ensemble Methods

7.12 Dropout

7.13 Adversarial Training

7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Comments

7.1 Parameter Norm Penalties

7.1.1 L2 Parameter Regularization

7.1.2 L1 Regularization

7.2 Norm Penalties as Constrained Optimization

7.3 Regularization and Under-Constrained Problems

7.4 Dataset Augmentation

7.5 Noise Robustness

7.5.1 Injecting Noise at the Output Targets

7.6 Semi-Supervised Learning

7.7 Multi-Task Learning

7.8 Early Stopping

7.9 Parameter Tying and Parameter Sharing

7.10 Sparse Representations

7.11 Bagging and Other Ensemble Methods

7.12 Dropout

7.13 Adversarial Training

7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Comments