Hypernetworks
(ICLR 2017) Hypernetworks
Paper: https://openreview.net/pdf?id=rkpACe1lx
Code: https://github.com/hardmaru/supercell
Blog: http://blog.otoro.net/2016/09/28/hyper-networks/
学习一个动态更新的循环神经网络,利用一个小网络去学习另一个大网络的权重,学习到的权重也会是大网络某个层的特定的。
HN 提供了一种新的权重共享方式,介于 CNN 和 RNN 之间,使得 HN 能在参数的个数和模型的效果和灵活性之间做出比较不错的平衡。
using one network, also known as a hypernetwork, to generate the weights for another network.
We apply hypernetworks to generate adaptive weights for recurrent networks.
hypernetworks can generate non-shared weights for LSTM.
Introduction
using a small network (called a “hypernetwork") to generate the weights for a larger network (called a main network)
the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weights for that layer.
The focus of this work is to use hypernetworks to generate weights for recurrent networks (RNN).
We perform experiments to investigate the behaviors of hypernetworks in a range of contexts and find that hypernetworks mix well with other techniques such as batch normalization and layer normalization.
Our main result is that hypernetworks can generate non-shared weights for LSTM that work better than the standard version of LSTM.
Related Work
difficult to directly operate in large search spaces consisting of millions of weight parameters
HyperNEAT framework: Compositional Pattern-Producing Networks (CPPNs) are evolved to define the weight structure of the much larger main network.
Differentiable Pattern Producing Networks (DPPNs): the structure is evolved but the weights are learned
ACDC-Networks: linear layers are compressed with DCT and the parameters are learned
Methods
when they are applied to recurrent networks, hypernetworks can be seen as a form of relaxed weight-sharing in the time dimension.
HyperRNN
When a hypernetwork is used to generate the weights for an RNN, we refer to it as the HyperRNN.
The standard formulation of a Basic RNN is given by:
\(h_t = \phi(W_h h_{t - 1} + W_x x_t + b)\)
In HyperRNN, we allow \(W_h\) and \(W_x\) to float over time by using a smaller hypernetwork to generate these parameters of the main RNN at each step. More concretely, the parameters \(W_h\), \(W_x\), \(b\) of the main RNN are different at different time steps, so that \(h_t\) can now be computed as:
\(h_t = \phi(W_h(z_h) h_{t - 1} + W_x(z_x) x_t + b(z_b))\), where
\(W_h(z_h) = \left< W_{hz}, z_h \right>\)
\(W_x(z_x) = \left< W_{xz}, z_x \right>\)
\(b(z_b) = W_{bz} z_b + b_0\)
Figure 1: An overview of HyperRNNs. Black connections and parameters are associated basic
RNNs. Orange connections and parameters are introduced in this work and associated with HyperRNNs. Dotted arrows are for parameter generation.
We use a recurrent hypernetwork to compute \(z_h\), \(z_x\) and \(z_b\) as a function of \(x_t\) and \(h_{t−1}\):
\(\hat{x}_t = \begin{pmatrix} h_{t - 1} \\ x_t \\ \end{pmatrix}\)
\(\hat{h}_t = \phi(W_{\hat{h}} \hat{h}_{t - 1} + W_{\hat{x}} \hat{x}_t + \hat{b})\)
\(z_h = W_{\hat{h}h} \hat{h}_{t - 1} + b_{\hat{h}h}\)
\(z_x = W_{\hat{h}x} \hat{h}_{t - 1} + b_{\hat{h}x}\)
\(z_b = W_{\hat{h}b} \hat{h}_{t - 1}\)
However, Equation 2 is not practical because the memory usage becomes too large for real problems.
We will use an intermediate hidden vector \(d(z) \in \mathbb{R}^{N_h}\) to parametrize each weight matrix, where \(d(z)\) will be a linear function of \(z\). We refer d as a weight scaling vector. Below is the modification to \(W(z)\):
\(w(z) = w(d(z)) = \begin{pmatrix} d_0(z) W_0 \\ d_1(z) W_1 \\ \cdots \\ d_{N_h}(z) W_{N_h} \\ \end{pmatrix}\)
only be using memory in the order \(N_z\) times the number of hidden units, which is an acceptable amount of extra memory usage.
Related Approaches
The formulation of the HyperRNN in Equation 5 has similarities to Recurrent Batch Normalization (Cooijmans et al., 2016) and Layer Normalization (Ba et al., 2016).
The central idea for the normalization techniques is to calculate the first two statistical moments of the inputs to the activation function, and to linearly scale the inputs to have zero mean and unit variance.
After the normalization, an additional set of fixed parameters are learned to unscale the inputs if required.
The element-wise operation also has similarities to the Multiplicative RNN and its extensions (mRNN, mLSTM) (Sutskever et al., 2011; Krause et al., 2016) and Multiplicative Integration RNN (MI-RNN) (Wu et al., 2016).
Experiments
Character-level Penn Treebank Language Modelling
Hutter Prize Wikipedia Language Modelling
Handwriting Sequence Generation
Neural Machine Translation
Conclusion
In this paper, we presented a method to use one network to generate weights for another neural network. Our hypernetworks are trained end-to-end with backpropagation and therefore are efficient and scalable. We focused on applying hypernetworks to generate weights for recurrent networks. On language modelling and handwriting generation, hypernetworks are competitive to or sometimes better than state-of-the-art models. On machine translation, hypernetworks achieve a significant gain on top of a state-of-the-art production-level model.