2017-02-25

Chapter 8 Optimization for Training Deep Models

非常抱歉，这一张基本上没怎么看，只是大致的浏览了一下。

原因是，这章基本上都是理论的介绍，感觉对做工程的可能没有太大帮助（如果你是研究这个的phD，而且你对这方面也不是很了解，还是很建议你好好读一下的），另外Neural Networks的 optimization 本身就像是个黑盒子。

这一部分感觉目前对我的帮助真的不大，打算是等到以后遇到相关问题后再回来复习吧。

This chapter focuses on one particular case of optimization: finding the parameters θ of a neural network that significantly reduce a cost function J(θ), which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms。

8.1 How Learning Differs from Pure Optimization

前面有个小节就是关于optimization的，在那里也说了：ML只是利用 optimization 的方法来进行训练，它的终极目标是test error尽可能的小；而 optimization 是希望完全的拟合数据。

Machine learning usually acts indirectly，reduce a different cost function J(θ) in the hope that doing so will improve P；pure optimization, where minimizing J is a goal in and of itself
Cost function of Machine learning is an average over the training data，pure optimization 通常是单独的对每一个 training example 都进行 optimization。
pure optimization 不会遇到 local minimum，因为它的目的是完全的拟合每一个 training example。

8.1.1 Empirical Risk Minimization

太笨了，没看懂 -_-

8.1.2 Surrogate Loss Functions and Early Stopping

这一部分主要讲要选择好的loss function，当一个loss function不好训练的时候，可以考虑改变成等价的容易训练的形式，比如：用MSE不好训练，就改为log-likelyhood、cross entropy。
实际说的就是把loss function换成等价的另外一种形式，这里的等价不是简单的改变形式，而是可以从概率或其它方面给出相同的解释。
For example, the negative log-likelihood of the correct class is typically used as a surrogate for the 0-1 loss.
另外，这里再次讲了一下early stopping，与之前不同的是这里主要对比与 optimization 的不同：DL的训练如果用了early stopping，就会发现停止训练的时候，loss function的梯度值往往还很大；而这时正是 optimization 最“喜欢”的时候，是绝对不可能停止 optimization 的。

8.1.3 Batch and Minibatch Algorithms

Batch与Minibatch的定义是不一样的，可以看 278 页的讲解。-_-
Minibatch stochastic gradient descent的讲解在CS231n的课程中讲解的很好，对它的解释也很好，可以参考那个视频。

主要思想就是可以很好的粗略的代表整体数据集的 distribution，所以能够加快object function的收敛速度，起到加快训练的速度。

另外，这里强调了在取得时候一定要记得 minibatches be selected randomly，目的是尽量的降低取得的 minibatche 中的数据之间的耦合度，防止网络持续看到某种局部分布，导致无法学习到整体的数据分布。

做法就是每次在整个数据集上迭代之后都对数据集 shuffle 一次。不过如果数据集过大，也可以只 shuffle 一次，不过这样只有在都一个 epoch 的时候，是 unbiased，第二次就会 biased，不过这种训练带来的好处可以 overcome 因为 biased 而导致的 overfitting。

总之，在训练之前对数据 shuffle，是非常重要的。即使只是 shuffle 一次，也会有很大的意义。
如果数据量非常非常大，可能导致在训练的时候，不会用整个数据集进行训练（每个 training example 都只使用到了一次），这个时候overfitting不再是问题，而underfitting就成为了潜在的问题。

8.2 Challenges in Neural Network Optimization

8.2.1 Ill-Conditioning

Ill-Conditioning一般被认为存在于Neural Networks的训练过程中。病态体现在随机梯度下降会”卡”在某些情况，此时即使很小的更新步长也会增加cost function。
解决办法是什么？
什么是ill-conditioning 对SGD有什么影响？ - 知乎

8.2.2 Local Minima

对于凸函数，objective function最终可以收敛到 global minima。
但是对于非凸函数，objective function很大可能会收敛到 local minima。，但是实验证明很多情况下 global minima 与 local minima 的差别不大，所以不是Neural Networks训练的主要问题。
Local Minima 不是主要问题的主要原因可以分为以下几个：
- 最重要的当然是上面说的实验证明很多情况下 global minima 与 local minima 的差别不大。
- Neural Networks训练最终想要的是 test error 尽可能的低，并且不管是 global minima 还是 local minima 都是针对 training dataset 说的，如果真的达到了 global minima ，反而成了overfitting，所以不是很重要。
- 当前的一些 optimization 方法能够很好的降低这种现象发生的可能性，如：momentum。

8.2.3 Plateaus, Saddle Points and Other Flat Regions

For many high-dimensional non-convex functions, local minima (and maxima) are in fact rare compared to another kind of point with zero gradient: a saddle point.

8.2.4 Cliffs and Exploding Gradients

梯度消失和梯度爆炸都不好，都会使得之前的训练工作变得无效，需要重新开始训练。
The objective function for highly nonlinear deep neural networks or for recurrent neural networks often contains sharp nonlinearities in parameter space resulting from the multiplication of several parameters. These nonlinearities give rise to very high derivatives in some places. When the parameters get close to such a cliff region, a gradient descent update can catapult the parameters very far, possibly losing most of the optimization work that had been done.

8.2.5 Long-Term Dependencies

就是说的梯度消失和梯度爆炸的原因：连续乘一个矩阵W，大于 1 就会产生梯度爆炸；小于 1 就会产生梯度消失。
在 10.7 节专门讲了RNN中的 Long-Term Dependencies。

8.2.6 Inexact Gradients

In other cases, the objective function we want to minimize is actually intractable. When the objective function is intractable, typically its gradient is intractable as well. In such cases we can only approximate the gradient.
One can also avoid the problem by choosing a surrogate loss function that is easier to approximate than the true loss

8.2.7 Poor Correspondence between Local and Global Structure

Initialization on the wrong side of mountain causes failure to find globally best solution。

8.2.8 Theoretical Limits of Optimization

太笨了，没看懂 -_-
深度模型中的优化 – Deep Learning Book Chinese Translation

8.3 Basic Algorithms

介绍训练的方法：SGD、Momentum与Nesterov Momentum
Optimization for Training Deep Models - Deep learning reading group

8.4 Parameter Initialization Strategies

特别蛋疼，讲了那么多方法，突然来了句：上面介绍的方法可能都不会取得很好的效果，然后分析了效果不好的原因。主要是前面介绍的内容挺多的。
weight
- 说了那么多，实际工程中，weight的初始化大多还是随机初始化为一个接近于 0 的比较小的数。
- almost always initialize all the weights in the model to values drawn randomly from a Gaussian or uniform distribution. The choice of Gaussian or uniform distribution does not seem to matter very much, but has not been exhaustively studied.
- The scale of the initial distribution, however, does have a large effect on both the outcome of the optimization procedure and on the ability of the network to generalize.
- 初始化为比较大的值：
  - 好处
    - Larger initial weights will yield a stronger symmetry breaking effect, helping to avoid redundant units.
    - They also help to avoid losing signal during forward or back-propagation through the linear component of each layer—larger values in the matrix result in larger outputs of matrix multiplication.
  - 坏处
    - 太大会产生梯度爆炸。
    - 太大可能会使得某些activation function陷入 saturate。
综上，大多时候还是初始化为比较小的值。
bias
- 大都是初始化为 0 （也有说初始化为非常小的接近 0 的值，说这样可以在一开始就起到训练作用）。但是有一些情况初始为非 0 值还是很需要的。可以看书中 305 页相关内容的介绍。
- For ReLU networks 0.1 may be better for hidden units。
不论是weight还是bias，初始化都要注意打破对称性。
除了上面列出的随机初始化还可以使用下面的几种方式：
- initialize a supervised model with the parameters learned by an unsupervised model trained on the same inputs.
- One can also perform supervised training on a related task. Even performing supervised training on an unrelated task can sometimes yield an initialization that offers faster convergence than a random initialization.
可以参考最后列出的两个 slides 中的相关内容。

8.5 Algorithms with Adaptive Learning Rates

讲的主要是几个比较流行的 adaptive learning rates 算法。
可以参考最后列出的两个 slides 中的相关内容。

8.6 Approximate Second-Order Methods

用二阶方法进行训练的方法，没有看。

8.7 Optimization Strategies and Meta-Algorithms

In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm.

WatsonYang's Blog

Enrich yourself.