2017-02-25

Chapter 10 Sequence Modeling: Recurrent and Recursive Nets

最近比较浮躁，书读不进去，总结也写不下去。虽然感觉有很多事情要做，却有种分不清主次的感觉。啊啊啊啊啊啊啊啊，什么时候才是个头啊！！！！！！

这一 Chapter 的很多内容都没有仔细的去看，所以这个总结里的内容也很少，会在后面慢慢的补充上来 -_-

首先，介绍了RNN的定义：

Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of neural networks for processing sequential data. specialized for processing a sequence of values x⁽¹⁾, . . . , x^(τ).

也就是在每一步都有新的输入，所以从某种角度它也算是一种Feedforward Networks。

另外，这里介绍了RNN的parameter sharing机制：Each member of the output is produced using the same update rule applied to the previous outputs. This recurrent formulation results in the sharing of parameters through a very deep computational graph.

Sharing Parameters is key to RNNs。书中举得 “I went to Nepal in 1999 ” and “In 1999, I went to Nepal ”的例子就是为了说明这个重要性。

注意：

上面指出了是在一个computational graph上进行的parameter sharing，在整个过程中相同的 units 都是只对同一个 parameter（可能有不同种类的parameters，比如：W、U等）进行更新。
CNN中的parameter sharing是通过kernel实现的，与RNN的是不同的。

10.1 Unfolding Computational Graphs

这里主要讲RNN的定义，以及如何把RNN的递归定义展开成 chain 形式。
并不需要过去所有的信息，只需要足够的信息就可以了。

10.2 Recurrent Neural Networks

Important design patterns for recurrent neural networks
- produce an output at each time step and have recurrent connections between hidden units
  - can choose to put any information it wants about the past into its hidden representation h and transmit h to the future.
  - 书中的例子都是按照这个来讲的。同时，注意整个过程都是保持对同一个W、U进行更新。
- produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step
  - Less powerful (can express a smaller set of functions) than those in the family represented by figure 10.3.
    
    Because this network lacks hidden-to-hidden recurrence, it requires that the output units capture all of the information about the past that the network will use to predict the future. Because the output units are explicitly trained to match the training set targets, they are unlikely to capture the necessary information about the past history of the input
  - Training can thus be parallelized, with the gradient for each step t computed in isolation. There is no need to compute the output for the previous time step first, because the training set provides the ideal value of that output.
- recurrent connections between hidden units, that read an entire sequence and then produce a single output
  - Such a network can be used to summarize a sequence and produce a fixed-size representation used as input for further processing. There might be a target right at the end (as depicted here) or the gradient on the output O^(t) can be obtained by back-propagating from further downstream modules.

any function computable by a Turing machine can be computed by such a recurrent network of a finite size.

10.2.1 Teacher Forcing and Networks with Output Recurrence

Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.
Teacher forcing is a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output y(t) as input at time t + 1.
- (Left) At train time, we feed the correct output y^(t) drawn from the train set as input to h^(t+1).
- (Right) When the model is deployed, the true output is generally not known. In this case, we approximate the correct output y^(t) with the model’s output O^(t), and feed the output back into the model.
The disadvantage of strict teacher forcing arises if the network is going to be later used in an open-loop mode, with the network outputs (or samples from the output distribution) fed back as input.

10.2.2 Computing the Gradient in a Recurrent Neural Network

主要讲计算梯度。

10.2.3 Recurrent Networks as Directed Graphical Models

主要是把RNN作为一种Graphical Models来讲解。
没弄懂这么做有什么好处，易于理解？

10.2.4 Modeling Sequences Conditioned on Context with RNNs

把一个 fixed-size vector 转化为 sequence 的方法：
- as an extra input at each time step
- as the initial state h⁽⁰⁾
- both

10.3 Bidirectional RNNs

Prediction of y which may depend on whole input sequence。
Bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence.
- Computation of a typical bidirectional recurrent neural network, meant to learn to map input sequences x to target sequences y, with loss L^(t) at each step t.
- The h recurrence propagates information forward in time (towards the right) while the g recurrence propagates information backward in time (towards the left).
- Thus at each point t, the output units o^(t) can benefit from a relevant summary of the past in its h^(t) input and from a relevant summary of the future in its g^(t) input.
This allows the output units o^(t) to compute a representation that depends on both the past and the future but is most sensitive to the input values around time t, without having to specify a fixed-size window around t (as one would have to do with a feedforward network, a convolutional network, or a regular RNN with a fixed-size look-ahead buffer).
也可以用于更高维的数据，如：2-D 图像。（This idea can be naturally extended to 2-dimensional input, such as images, by having four RNNs, each one going in one of the four directions: up, down, left, right.）

10.4 Encoder-Decoder Sequence-to-Sequence Architectures

作用：map an input sequence to an output sequence which is not necessarily of the same length.
主要分为 2 部分：
- Encoder：把 inputs 转化为 final hidden state （书中称作 context C）的 function。注意，这里的最终输出是一个 fixed-length vector，所以有缺点，这也是接下来的 attention 机制要改进的地方。
- Decoder：conditioned on that fixed-length vector to generate the output sequence
There is no constraint that the encoder must have the same size of hidden layer as the decoder
Limitation：when the context C output by the encoder RNN has a dimension that is too small to properly summarize a long sequence
- 解决：attention mechanism，make C a variable-length sequence rather than a fixed-size vector
- Attention based model 是什么，它解决了什么问题？ - 知乎
非常推荐看一下参考文献 1 中这一节的内容。

10.5 Deep Recurrent Networks

Need enough depth in order to perform the required mappings。
讲了如何构建深层RNN，如下图所示，具体可以见书中内容（没看懂想要表达什么意思 -_-）。
- (a) The hidden recurrent state can be broken down into groups organized hierarchically.
- (b) Deeper computation (e.g., an MLP) can be introduced in the input-to-hidden, hidden-to-hidden and hidden-to-output parts. This may lengthen the shortest path linking different time steps.
- (c) The path-lengthening effect can be mitigated by introducing skip connections.

10.6 Recursive Neural Networks

与RNN有很大的区别，RNN主要还是链式结构（在时间维度上），而Recursive Neural Networks却是 tree 结构的。
相比较RNN的一大优势是，可以和把 tree 的深度变成O(lg n)。
至于如何构建 tree，可能需要更多的 domain knowledge。

10.7 The Challenge of Long-Term Dependet’r’re’er’rncies

主要讲了梯度爆炸和梯度消失的问题。
感觉目前解决这个问题的还是LSTM，其它的说了也没有什么帮助。

10.8 Echo State Networks

可以看 wikipedia 和这个，比书里讲的容易理解些。
这个还是有很多需要理解的东西的，不过奈何理解能力不够，没有看懂。

10.9 Leaky Units and Other Strategies for Multiple Time Scales

目的还是为了解决 long-term dependencies，从而能够更多、更好的利用过去的信息。
不过还是感觉没有什么帮助，LSTM挑大梁。
这里讲的方法都遵循了一个思想：设计一个可以处理不同 time scales 的模型，让模型的一部分在细粒度的 time scales 下进行操作，以期捕获细节；其它的部分在粗粒度的 time scales 下进行操作，以期能够更高效地把较远的过去信息用到现在的操作中。（One way to deal with long-term dependencies is to design a model that operates at multiple time scales, so that some parts of the model operate at fine-grained time scales and can handle small details, while other parts operate at coarse time scales and transfer information from the distant past to the present more efficiently.）
除了Skip Connections，其它都没有看太懂 -_- 。

10.9.1 Adding Skip Connections through Time

与现在CNN中的 Skip Connections 差不多，就是直接把过去的某个时间 t 的信息直接连到当前时间。（One way to obtain coarse time scales is to add direct connections from variables in the distant past to variables in the present.）

10.9.2 Leaky Units and a Spectrum of Different Time Scales

Another way to obtain paths on which the product of derivatives is close to one is to have units with linear self-connections and a weight near one on these connections.
参数需要人工的指定。

10.9.3 Removing Connections

注意与 Skip Connections 的不同。

10.10 The Long Short-Term Memory and Other Gated RNNs

主要介绍为了解决 long-term dependencies 的问题，而被提出来的几种网络结构。
主要思路：有时候把用过的信息丢掉也许会更好。
让neural networks自己决定丢掉哪些内容，掉丢多少，什么时候丢掉：gate function。

10.10.1 LSTM

没有仔细看这小节的内容。可以看一下参考文献 4，写的很通俗易懂。

10.10.2 Other Gated RNNs

没有仔细看。

10.11 Optimization for Long-Term Dependencies

主要介绍为了解决 long-term dependencies 而产生的 exploding gradients 与 vanishing gradients。

10.11.1 Clipping Gradients

为了解决 exploding gradients。

10.11.2 Regularizing to Encourage Information Flow

为了解决 vanishing gradients。

10.12 Explicit Memory

没看懂

WatsonYang's Blog

Enrich yourself.