2017-02-24

Chapter 6 Deep Feedforward Networks

首先，介绍了Deep feedforward networks的概念：之所以称为networks，是因为它们是通过把不同的函数组合起来的形式表达的（typically represented by composing together many different functions）；feedforward是因为 no feedback connections in which outputs of the model are fed back into itself（）。

然后，介绍了output layer与hidden layer。training examples 决定了的output layer的行为：输出要接近目标值y。但是它并没有定义hidden layer的行为，不过 learning algorithm 则必须决定如何使用这些hidden layer来完成对函数的逼近。

接着，引入了非线性函数的概念（不是activation function，是指最终学出来的那个函数），指出得到非线性函数的方法有3种：

使用一个比较 generic 的函数，比如RBF kernel，但最终在 test dataset 上的效果非常不好（generalization to the test set often remains poor）。
使用 domain knowledge 人工指定，但是太费时费力。
算法自己学习，这种方法既可以保证第一种方法的 generic 的性质，又能保证第二种 domain knowledge 的优点（人为的设置一些biased）。

最后，介绍了本章的主要内容。

6.1 Example: Learning XOR

Solving the XOR problem by learning a representation

首先，举XOR的例子说明了使用传统的linear model不能完成这个任务，引出可以使用neural networks完成这项任务。因为层与层之间需要做非线性的变换，否则还是linear model，这就引出了activation function。

所以activation function的作用就是：为neural networks引入非线性元素。
接着，指出上面的学习例子比较简单，现实中的任务比较复杂，模型参数的数量非常多，需要一个有效的算法来学习，从而引出了下节内容。

6.2 Gradient-Based Learning

neural networks与linear model的最大区别是：由于activation function的原因，会使得大多数的cost function变成非凸的。通常使用Gradient Descent的方法，进行最优化求解。
对于feedforward networks，parameters的初始化很重要，通常把所有的weights都初始化为 small random values，把biase初始化为 0 或 small positive values。
为了应用gradient descent，必须要定义一个cost function，而cost function的具体使用和形式又和output units相关，所以分 2 小节进行讲解。

6.2.1 Cost Functions

大多数时候cost function使用cross-entropy。（有个问题不明白：回归问题大多数使用MSE？）
通常时候会加上一个regularization term（通常是weight decay，直接作用在cost function上）。

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution.
虽然cost function大多是cross-entropy，但它的具体形式也会随着具体模型的变化而变化。
通常使用maximum likelihood estimation的方法定义object function；使用gradient descent的方法求解使得object function取得最小值的parameters；使用BP的方法更新parameters
相对于MSE，使用log-likelihood的优点有 2 个：
- 把cost function中的乘法操作变成加法操作。
- 把e^x重新转化为x。
上面两点有效的可以避免cost function在 extreme large 或 extreme negative 的时候梯度消失。

6.2.1.2 Learning Conditional Statistics

没看懂 -_-

大概意思是使用MSE可能会使得训练速度很慢（梯度消失），这是cross-entropy之所以很流行的原因。可以看这个博客，写的很好。
- Improving the way neural networks learn

6.2.2 Output Units

这里主要是说cost function的选取与的output units选取有很大的关系，output units的不同，也会导致cost function具体形式的不同，即使是同一种cost function。

然后，讲了 4 中不同形式的output units，本小节的主要内容就是对这几种不同形式的output units做介绍。

6.2.2.1 Linear Units for Gaussian Output Distributions

是说最后的output layer如果没有使用任何activation function的时候，linear units 不会 saturates，所以cost function可以使用MSE？但是如果隐藏层使用了sigmoid，当sigmoid saturates 的时候，cost function不也是会 saturates 吗？

6.2.2.2 Sigmoid Units for Bernoulli Output Distributions

对于二元分类问题，就可以把这个看成是一个 Bernoulli 试验，如果在最后的output layer使用了sigmoid，记输出正确的概率是p，那么错误的概率就是1-p。
sigmoid会在输入 very negative 的时候 saturates to 0，在输入 very positive 的时候 saturates to 1 。如果使用MSE，当sigmoid saturates 的时候，MSE也会 saturates。
如果最后一层的output layer使用了sigmoid，最好使用maximum likelihood的方法进行最优化求解。也就是说使用概率的方式(log-likelihood形式的cost function) –> 也就是说不使用MSE。

6.2.2.3 Softmax Units for Multinoulli Output Distributions

对于多元分类问题，就可以把这个看成是一个 Multinoulli 试验，这样就可以在最后的output layer使用softmax，直接得到各个分类的概率。
softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable. 比如Attention model中的权重。
softmax可以看成是上面的一种 generalization（一般化），因为sigmoid可以看成是一种二项分布（p(y)=a, p(1-y)=1-a），softmax可以看成是一种多项式分布（经softmax操作后，输出的和为1）。
另外，这里更加清楚的解释了为什么要使用log-likelihood形式的cost function：因为softmax中含有e^x，当 x 极小的时候，也会出现梯度消失的现象。而使用log-likelihood形式的cost function就可以很好的避免这种现象的发生。

6.2.2.4 Other Output Types

没看这一小节

6.3 Hidden Units

这里主要讲了各个activation function的定义、变种形式以及优缺点。

activation function都是以element-wise的形式作用于output。
It is usually impossible to predict in advance which will work best。

6.3.1 Rectified Linear Units and Their Generalizations

有的activation function可能不是处处可微，比如ReLU在 0 处就不可微。此时通常返回它左、右（偏）导数的其中一个（如果是 0 的话，通常不会返回0，而是返回一个接近 0 的数）。
ReLU也有缺点： they cannot learn via gradientbased methods on examples for which their activation is zero。
ReLU有几种比较重要的变种形式，可以看书中的内容。
Maxout也是ReLU的一个变种形式，书中介绍了它的优点，可以参考书中的内容，也可以看这里的笔记：Maxout Networks & Network in Network

6.3.2 Logistic Sigmoid and Hyperbolic Tangent

Their use as output units is compatible with the use of gradient-based learning when an appropriate cost function can undo the saturation of the sigmoid in the output layer.
sigmoid和tanh有着很强的联系：tanh(z) = 2sigmoid(2z) - 1，但tanh有着比sigmoid更好的特性（tanh(0) = 0），所以在必须要使用sigmoid的时候，最好使用tanh代替。
sigmoid在很多的时候还是很有用的，比如RNN/LSTM中的gate unit、probabilistic models以及autoencoders（Recurrent networks, many probabilistic models, and some autoencoders have additional requirements that rule out the use of piecewise linear activation functions and make sigmoidal units more appealing despite the drawbacks of saturation.）。

6.3.3 Other Hidden Units

这节主要介绍了一些其它的activation function，这些activation function和前面的效果都差不多，在某些特定的情况下可能取得更好的效果。

可以通过把一个使用activation function的 layer 分成 2 个 layer：一个 layer 不使用任何activation function（相当于使用 identity function as the activation function） + 一个 layer 使用activation function。这样做的好处是可以减少parameters（具体可见195页）。
softmax function通常作为 output units，但有时也可以作为hidden units，比如Attention model中的权重。（These kinds of hidden units are usually only used in more advanced architectures that explicitly learn to manipulate memory, described in section 10.12.）。

6.4 Architecture Design

这里讨论了Neural Networks的 Architecture 问题，主要是depth和width的设计问题。

书中多次强调要使用深层的网络，然后使用regularization减轻overfitting（CS231n课程中也有强调这一点）。

6.4.1 Universal Approximation Properties and Depth

任何Borel measurable的函数都可以通过Neural Networks表示出来（必须含有non-linear的activation function）。
虽然只有 1 个hidden layer的Neural Networks就足够用来拟合所有的函数，但是每个hidden layer可能需要设定非常多的hidden units（exponential number of hidden units），最终可能无法学习到合适的parameters来表达这个函数，generalization 也不会好（对浅层的Neural Networks也是这个道理）。

从上图可以看出，
所以，通常还是使用多层的Neural Networks（更少hidden units与parameters、更好的学习和 generalization 能力）（Empirically, greater depth does seem to result in better generalization for a wide variety of tasks）。
网络越深，表达能力就越强。如果用浅层的网络，为了达到同等好的效果，则需要增加每层网络的hidden units个数，但此时会增加overfitting的概率（202页的图对比了 depth 和 width 对网络的影响，就是下图）。
上图主要讲的内容可以概括为下面几点（具体可以参看 202 页）：
- 只增加网络每层的 parameters 数量，而不增加网络的 depth，效果并不会有很好的提升，并且此时更容易过拟合。
- 层数越深，可以学习到的 features 越细，feature 的 representation 也就越 simple。
- 层数浅的时候，为了提升效果可能采取增加的方式，但这样容易过拟合。加深层数可以在一定程度上减轻过拟合，但也会产生过拟合（太多，training dataset过小，训练周期太长等）
- 总结：使用层数深的网络，然后使用正则化等方法，减轻过拟合。

6.4.2 Other Architectural Considerations

强调了应该根据具体的任务选择不同的NN Architecture，比如CV就使用CNN，Time Series就使用RNN。同时，还需要考虑内部细节（是否使用skip connections），具体的结构（比如CNN又有FCN、Deconvolution CNN与ResNet等）。

6.5 Back-Propagation and Other Differentiation Algorithms

主要介绍BP算法，算法实现时的computation graph、chain rules等。
推荐参考这些博客：

6.6 Historical Notes

主要介绍feedforward networks的发展历史。

大家也都知道Neural Networks在很早之前就提出来了，但是一直没有发展起来（主要是因为硬件的计算能力不够，大家也都知道的），最近又火起来的原因主要是因为计算能力的提高（GPU的使用）。

那么现在Neural Networks的与之前的Neural Networks有什么区别呢？答案是：core ideas 并没有太大的区别，还是使用的Backpropagation和Gradient Descent的方法进行学习和优化。

但是还是有着一些 improvements 的：

更大的数据集。这使得 statistical generalization 变得容易。
正是由于计算能力的提高，使得Neural Networks可以变得 larger 与 deeper。
一些 algorithmic changes 也提高了Neural Networks的 performance：
- 用cross-entropy family of loss functions代替MSE。
- 用ReLU代替sigmoid。
  
  其实ReLU也在很早之前就提出了，一是因为它在 0 处是不可微的，ReLU被很多人否定；二是由于当时的Neural Networks比较小，sigmoid在小型Neural Networks上的效果又比ReLU好，所以ReLU就被sigmoid代替了。
  
  现在又开始使用ReLU是因为随着Neural Networks的变大，sigmoid出现了梯度消失的问题，使得学习速度变慢；同时overfit的问题也出现了，而ReLU可以很好的解决这两个问题。
- 其它的一些改进，比如网络结构的改变：RCNN、FCNN以及ResNet等（书中没有提及这个）。

WatsonYang's Blog

Enrich yourself.