8.1.5.1：Decoderの改良3【ゼロつく2のノート(実装)】

はじめに

　『ゼロから作るDeep Learning 2――自然言語処理編』の初学者向け【実装】攻略ノートです。『ゼロつく2』学習の補助となるように適宜解説を加えています。本と一緒に読んでください。

　本の内容を1つずつ確認しながらゆっくりと組んでいきます。

　この記事は、8.1.5項「Decoderの改良3」の内容です。エンコードされた入力情報から必要な情報を抽出するAttentionレイヤの処理を解説して、Pythonで実装します。

【前節の内容】

www.anarchive-beta.com

【他の節の内容】

www.anarchive-beta.com

【この節の内容】

はじめに
8.1.5.1 Decoderの改良3
- ・処理の確認
  - ・順伝播の計算
  - ・逆伝播の計算
- ・実装
参考文献
おわりに

8.1.5.1 Decoderの改良3

　Weight SumレイヤとAttention Weightレイヤを組み合わせてAttentionレイヤを実装します。Attentionレイヤでは、エンコードされた入力情報から必要な情報を抽出してDecoderで利用できるようにします。

・処理の確認

　図8-16を参考にして、処理を確認していきます。

・順伝播の計算

　データとパラメータの形状に関する値を設定します。

　Attention Weightレイヤには、Encoderから隠れ状態$\mathbf{hs}^{(\mathrm{Enc})} = (\mathbf{h}_0^{(\mathrm{Enc})}, \cdots, \mathbf{h}_{T-1}^{(\mathrm{Enc})})$とDecoderの$t$番目のLSTMレイヤから$\mathbf{h}_t^{(\mathrm{Dec})}$が入力します。$\mathbf{hs}^{(\mathrm{Enc})}$の$T$は、Encoder側の時系列サイズです。

　ここでは、$\mathbf{hs}^{(\mathrm{Enc})}$と$\mathbf{h}_t^{(\mathrm{Dec})}$を処理結果が分かりやすくなるように作成しておきます。

# 変数の形状に関する値を指定
N = 3 # バッチサイズ(入力する文章数)
T = 4 # Encoderの時系列サイズ(入力する単語数)
H = 5 # 隠れ状態のサイズ(LSTMレイヤの中間層のニューロン数)

# (簡易的に)EncoderのT個の隠れ状態を作成
enc_hs = np.random.randn(N, T, H)
print(np.round(enc_hs, 2))
print(enc_hs.shape)

# (簡易的に)Decoderのt番目の隠れ状態を作成
dec_h = np.random.randn(N, H)
print(np.round(dec_h, 2))
print(dec_h.shape)

[[[ 0.06  1.79 -0.63 -0.67  1.05]
  [-0.95 -1.21  0.96  0.36  0.58]
  [ 1.04  0.22 -0.45  1.22  0.22]
  [ 0.36  0.42 -0.93 -0.83  1.3 ]]

 [[-1.24 -1.02  1.75  0.29 -0.78]
  [-2.42  0.72  0.69  0.64 -0.39]
  [ 0.88 -0.24 -1.12  0.07  0.72]
  [ 1.1   0.51  1.52  0.83  0.12]]

 [[-0.51 -0.02  1.6   0.41  1.12]
  [-1.44  0.05 -0.37  1.   -1.59]
  [ 0.03 -1.86  0.2  -0.52  0.81]
  [ 0.61  0.43 -2.2  -0.04 -1.22]]]
(3, 4, 5)
[[-1.2  -0.28  0.44  0.49  0.84]
 [ 1.17  0.94 -0.63 -1.13 -1.6 ]
 [ 0.28  0.03  0.82 -1.36  1.9 ]]
(3, 5)

　2つのレイヤのインスタンスを作成します。

# Attention Weightレイヤのインスタンスを作成
attention_weight_layer = AttentionWeight()

# Weight Sumレイヤのインスタンスを作成
weight_sum_layer = WeightSum()

　$\mathbf{hs}^{(\mathrm{Enc})}$と$\mathbf{h}_t^{(\mathrm{Dec})}$をAttention Weightレイヤに入力して、Attentionの重み$\mathbf{a}_t$を計算します。

# Attentionの重みを計算
a = attention_weight_layer.forward(enc_hs, dec_h)
print(np.round(a, 2))
print(np.sum(a, axis=1))
print(a.shape)

[[0.05 0.87 0.03 0.05]
 [0.03 0.03 0.61 0.33]
 [0.59 0.   0.41 0.  ]]
[1. 1. 1.]
(3, 4)

　行ごとの和が1になります。

　$\mathbf{hs}^{(\mathrm{Enc})}$と$\mathbf{a}_t$をWeight Sumレイヤに入力して、コンテキスト$\mathbf{c}_t$を計算します。

# コンテキストを計算
c = weight_sum_layer.forward(enc_hs, a)
print(np.round(c, 2))
print(c.shape)

[[-0.76 -0.93  0.74  0.28  0.62]
 [ 0.78  0.01 -0.09  0.35  0.44]
 [-0.29 -0.77  1.03  0.03  0.99]]
(3, 5)

　$\mathbf{c}_t$と$\mathbf{h}_t^{(\mathrm{Dec})}$を結合して$t$番目のAffineレイヤに入力します。

　以上が順伝播の処理です。続いて、逆伝播の処理を確認します。

・逆伝播の計算

　$t$番目のAffineレイヤからコンテキストの勾配$\frac{\partial L}{\partial \mathbf{c}_t}$が入力します。ここでは、$\frac{\partial L}{\partial \mathbf{c}_t}$を簡易的に作成します。

# (簡易的に)コンテキストの勾配を作成
dc = np.ones((N, H))
print(dc.shape)

(3, 5)

　$\frac{\partial L}{\partial \mathbf{c}_t}$をWeight Sumレイヤに入力して、「Encoderの隠れ状態の勾配と$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$」と「Attentionの重みの勾配$\frac{\partial L}{\partial \mathbf{a}_t}$」を計算します。

# Weight Sumレイヤの逆伝播を計算
enc_dhs0, da = weight_sum_layer.backward(dc)
print(np.round(enc_dhs0, 2))
print(enc_dhs0.shape)
print(np.round(da, 2))
print(da.shape)

[[[0.05 0.05 0.05 0.05 0.05]
  [0.87 0.87 0.87 0.87 0.87]
  [0.03 0.03 0.03 0.03 0.03]
  [0.05 0.05 0.05 0.05 0.05]]

 [[0.03 0.03 0.03 0.03 0.03]
  [0.03 0.03 0.03 0.03 0.03]
  [0.61 0.61 0.61 0.61 0.61]
  [0.33 0.33 0.33 0.33 0.33]]

 [[0.59 0.59 0.59 0.59 0.59]
  [0.   0.   0.   0.   0.  ]
  [0.41 0.41 0.41 0.41 0.41]
  [0.   0.   0.   0.   0.  ]]]
(3, 4, 5)
[[ 1.6  -0.26  2.26  0.33]
 [-1.01 -0.76  0.32  4.08]
 [ 2.6  -2.35 -1.35 -2.42]]
(3, 4)

　$\frac{\partial L}{\partial \mathbf{a}_t}$をAttention Weightレイヤに入力して、「Encoderの隠れ状態の勾配と$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$」と「Decoderの隠れ状態の勾配と$\frac{\partial L}{\partial \mathbf{h}_t^{(\mathrm{Dec})}}$」を計算します。

# Attention Weightレイヤの逆伝播を計算
enc_dhs1, dec_dh = attention_weight_layer.backward(da)
print(np.round(enc_dhs1, 2))
print(enc_dhs1.shape)
print(np.round(dec_dh, 2))
print(dec_dh.shape)

[[[-0.1  -0.02  0.04  0.04  0.07]
  [ 0.21  0.05 -0.08 -0.09 -0.15]
  [-0.09 -0.02  0.03  0.04  0.06]
  [-0.02 -0.01  0.01  0.01  0.02]]

 [[-0.1  -0.08  0.05  0.1   0.14]
  [-0.08 -0.06  0.04  0.08  0.11]
  [-0.83 -0.66  0.45  0.79  1.13]
  [ 1.01  0.8  -0.54 -0.97 -1.37]]

 [[ 0.27  0.03  0.78 -1.3   1.81]
  [-0.   -0.   -0.    0.   -0.  ]
  [-0.27 -0.03 -0.78  1.29 -1.81]
  [-0.   -0.   -0.    0.   -0.  ]]]
(3, 4, 5)
[[ 0.26  0.38 -0.27 -0.04  0.03]
 [ 0.6   0.64  1.9   0.59 -0.31]
 [-0.52  1.75  1.35  0.89  0.3 ]]
(3, 5)

　$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$を$t$番目のLSTMレイヤに入力します。

　2つのレイヤで$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$が求まりました。これは順伝播においてEncoderの隠れ状態$\mathbf{hs}^{(\mathrm{Enc})}$が分岐して2つのレイヤ入力したためです。よって、分岐ノードの逆伝播として2つの$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$の和を求めます。

# 分岐したEncoderの隠れ状態の勾配を合算
enc_dhs = enc_dhs0 + enc_dhs1
print(np.round(enc_dhs, 2))
print(enc_dhs.shape)

[[[-0.05  0.03  0.09  0.09  0.12]
  [ 1.08  0.92  0.79  0.78  0.72]
  [-0.06  0.01  0.07  0.07  0.1 ]
  [ 0.03  0.05  0.06  0.06  0.07]]

 [[-0.07 -0.05  0.09  0.13  0.17]
  [-0.05 -0.03  0.07  0.11  0.14]
  [-0.22 -0.06  1.05  1.4   1.73]
  [ 1.34  1.13 -0.21 -0.64 -1.04]]

 [[ 0.86  0.62  1.37 -0.71  2.4 ]
  [ 0.    0.   -0.    0.   -0.  ]
  [ 0.14  0.38 -0.37  1.7  -1.4 ]
  [ 0.    0.   -0.    0.   -0.  ]]]
(3, 4, 5)

　合計した$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$をEncoderのTime LSTMレイヤに入力します。

　以上がAttentionレイヤの処理です。

・実装

　処理の確認ができたので、Attentionレイヤをクラスとして実装します。

# Attentionレイヤの実装
class Attention:
    # 初期化メソッド
    def __init__(self):
        # 他のレイヤと対応させるための空のリストを作成
        self.params = [] # パラメータ
        self.grads = []  # 勾配
        
        # レイヤのインスタンスを作成
        self.attention_weight_layer = AttentionWeight()
        self.weight_sum_layer = WeightSum()
        
        # Attentionの重みを初期化
        self.attention_weight = None
    
    # 順伝播メソッド
    def forward(self, hs, h):
        # Attentionの重みを計算
        a = self.attention_weight_layer.forward(hs, h)
        
        # EncoderのT個の隠れ状態の重み付け和を計算
        out = self.weight_sum_layer.forward(hs, a)
        
        # Attentionの重みを保存
        self.attention_weight = a
        return out
    
    # 逆伝播メソッド
    def backward(self, dout):
        # 各レイヤの逆伝播を計算
        dhs0, da = self.weight_sum_layer.backward(dout)
        dhs1, dh = self.attention_weight_layer.backward(da)
        
        # EncoderのT個の隠れ状態の勾配を計算
        dhs = dhs0 + dhs1
        
        # EncoderのT個の隠れ状態の勾配とDecoderのt番目の隠れ状態の勾配を出力
        return dhs, dh

　実装したクラスを試してみましょう。

　Encoderの隠れ状態$\mathbf{hs}$とDecoderの隠れ状態$\mathbf{h}_t$を簡易的に作成して、Attentionレイヤのインスタンスを作成します。

# (簡易的に)EncoderのT個の隠れ状態を作成
hs = np.random.randn(N, T, H)
print(np.round(hs, 2))
print(hs.shape)

# (簡易的に)Decoderのt番目の隠れ状態を作成
h = np.random.randn(N, H)
print(np.round(h, 2))
print(h.shape)

# インスタンスを作成
attention_layer = Attention()

[[[ 1.1   1.23 -0.16 -0.57  0.58]
  [-0.37 -0.55  0.18  0.81  2.19]
  [ 1.33 -0.56  0.66 -0.1  -1.81]
  [-0.57  0.77 -1.41  0.34 -0.74]]

 [[-0.99 -0.95  0.56  1.7  -0.82]
  [ 0.85  0.46  0.92 -1.13 -0.16]
  [-0.11  0.95 -0.34 -0.15  0.9 ]
  [ 0.14 -1.89 -0.74 -1.13  0.49]]

 [[ 2.17 -0.1  -1.32  0.52  0.4 ]
  [ 1.13 -0.05 -0.31 -0.25 -0.87]
  [-0.3  -0.95 -0.15 -0.74 -1.75]
  [-0.17 -0.18  0.6   0.09  1.63]]]
(3, 4, 5)
[[ 1.11 -0.26 -1.81 -2.28  0.74]
 [ 0.72 -1.65 -2.16 -0.61  0.73]
 [-2.74  1.02 -0.74 -0.74 -1.99]]
(3, 5)

　順伝播を計算します。

# 順伝播を計算
c = attention_layer.forward(hs, h)
print(np.round(c, 2))
print(c.shape)

[[ 0.96  1.12 -0.22 -0.46  0.47]
 [ 0.14 -1.88 -0.73 -1.13  0.49]
 [-0.29 -0.94 -0.15 -0.74 -1.74]]
(3, 5)

　エンコードされた入力情報cを同じ時刻のAffineレイヤに入力します。

　インスタンス変数として保存されているAttentionの重みを確認します。

# 順伝播を計算
a = attention_layer.attention_weight
print(np.round(a, 3))
print(np.sum(a, axis=1))
print(a.shape)

[[0.885 0.021 0.024 0.07 ]
 [0.    0.001 0.002 0.997]
 [0.    0.007 0.993 0.001]]
[1. 1. 1.]
(3, 4)

　行ごとの和が1になっているのを確認できました。

　Attentionの重みの勾配($t$番目のAffineレイヤの出力)$\frac{\partial L}{\partial \mathbf{a}_t}$を簡易的に作成して、逆伝播を計算します。

# (簡易的に)逆伝播の入力を作成
dc = np.ones((N, H))
print(dc.shape)

# 逆伝播を計算
dhs, dh = attention_layer.backward(dc)
print(np.round(dhs, 2))
print(dhs.shape)
print(np.round(dh, 2))
print(dh.shape)

(3, 5)
[[[ 1.21  0.81  0.36  0.22  1.1 ]
  [ 0.03  0.02  0.01  0.    0.03]
  [-0.04  0.04  0.13  0.15 -0.02]
  [-0.2   0.13  0.51  0.62 -0.11]]

 [[ 0.   -0.   -0.   -0.    0.  ]
  [ 0.   -0.   -0.   -0.    0.  ]
  [ 0.01 -0.01 -0.02 -0.    0.01]
  [ 0.99  1.02  1.03  1.01  0.99]]

 [[-0.    0.   -0.   -0.   -0.  ]
  [-0.06  0.03 -0.01 -0.01 -0.04]
  [ 1.07  0.97  1.01  1.01  1.05]
  [-0.01  0.   -0.   -0.   -0.01]]]
(3, 4, 5)
[[ 0.38  0.2   0.26 -0.23  0.47]
 [-0.    0.04  0.01  0.01  0.  ]
 [ 0.03  0.02 -0.    0.01  0.03]]
(3, 5)

　dhsはEncoderののTime LSTMレイヤに、dhはDecoderの同じ時刻のLSTMレイヤに入力します。

　以上でAttentionレイヤを実装できました。次項では、時系列データに対応したAttentionレイヤを実装します。

参考文献

斎藤康毅『ゼロから作るDeep Learning 2――自然言語処理編』オライリー・ジャパン,2018年.

おわりに

　ホントよくできてるよなぁ面白い。

【次節の内容】

からっぽのしょこ

読んだら書く！書いたら読む！同じ事は二度調べ(たく)ない