8.1.5.2：Time Attentionレイヤの実装【ゼロつく2のノート(実装)】

はじめに

　『ゼロから作るDeep Learning 2――自然言語処理編』の初学者向け【実装】攻略ノートです。『ゼロつく2』学習の補助となるように適宜解説を加えています。本と一緒に読んでください。

　本の内容を1つずつ確認しながらゆっくりと組んでいきます。

　この記事は、8.1.5項「Decoderの改良③」の内容です。時系列データに対応したAttentionレイヤの処理を解説して、Pythonで実装します。

【前節の内容】

www.anarchive-beta.com

【他の節の内容】

www.anarchive-beta.com

【この節の内容】

はじめに
8.1.5.2 Time Attentionレイヤの実装
- ・処理の確認
  - ・順伝播の計算
  - ・逆伝播の計算
- ・実装
参考文献
おわりに

8.1.5.2 Time Attentionレイヤの実装

　$T$個のAttentionレイヤを組み合わせて時系列データに対応したTime Attentionレイヤを実装します。

# 8.1.5項で利用するライブラリ
import numpy as np

・処理の確認

　Time Attentionレイヤの処理を確認していきます。

・順伝播の計算

　データとパラメータの形状に関する値を設定して、「Encoderの隠れ状態$\mathbf{hs}^{(\mathrm{Enc})} = (\mathbf{h}_0^{(\mathrm{Enc})}, \cdots, \mathbf{h}_{T-1}^{(\mathrm{Enc})})$」と「Decoderの隠れ状態$\mathbf{hs}^{(\mathrm{Dec})} = (\mathbf{h}_0^{(\mathrm{Dec})}, \cdots, \mathbf{h}_{T-1}^{(\mathrm{Dec})})$」を簡易的に作成しておきます。

# 変数の形状に関する値を指定
N = 3 # バッチサイズ(入力する文章数)
T_enc = 4 # Encoderの時系列サイズ(入力する単語数)
T_dec = 7 # Decoderの時系列サイズ(入力する単語数)
H = 5 # 隠れ状態のサイズ(LSTMレイヤの中間層のニューロン数)

# (簡易的に)EncoderのT個の隠れ状態を作成
hs_enc = np.random.randn(N, T_enc, H)
print(hs_enc.shape)

# (簡易的に)DecoderのT個の隠れ状態を作成
hs_dec = np.random.randn(N, T_dec, H)
print(hs_dec.shape)

(3, 4, 5)
(3, 7, 5)

　Attentionレイヤのインスタンスを作成して順伝播を計算します。計算結果をコンテキスト$\mathbf{cs} = (\mathbf{c}_0, \cdots, \mathbf{c}_{T-1})$とします。
　順伝播メソッドの入力は、「Encoderの$T$個の隠れ状態hs_enc」と「Decoderの$t$番目の隠れ状態hs_dec[:, t, :]」です。出力をcsに、使用したインスタンスはlayersに格納していきます。また、インスタンス変数attention_weightとして保存されているAttentionの重みもattention_weightsに格納します。これをT_dec回繰り返します。

# T個のAttentionレイヤの受け皿を初期化
layers = []

# T個のAttentionの重みの受け皿を初期化
attention_weights = []

# T個のコンテキストの受け皿を初期化
cs = np.empty((N, T_dec, H))

# Time Attentionレイヤの処理
for t in range(T_dec):
    # t番目のAttentionレイヤを作成
    layer = Attention()
    
    # t番目のコンテキストを計算
    cs[:, t, :] = layer.forward(hs_enc, hs_dec[:, t, :])
    
    # t番目のAttentionレイヤを格納
    layers.append(layer)
    
    # t番目のAttentionの重みを格納
    attention_weights.append(layer.attention_weight)

　$\mathbf{cs}$を確認します。

# T個のコンテキストを確認
print(np.round(cs[0], 2)) # 0番目のコンテキスト
print(cs.shape)

[[ 0.54 -0.08  1.23  0.17 -0.95]
 [ 0.31 -0.38  0.75  0.5  -1.69]
 [ 0.59  0.48 -0.02  0.43 -0.25]
 [ 0.48 -0.43  1.67  0.09 -1.33]
 [ 0.54 -0.15  0.73  0.15 -0.5 ]
 [ 0.48 -0.19  0.94  0.25 -1.02]
 [ 0.74  1.12  0.12  0.47 -0.08]]
(3, 7, 5)

　$T$個のAttentionレイヤが格納されたリストlayersと$T$個のAttentionの重み$\mathbf{a}_t$を格納したリストattention_weightsを確認します。

# リストを確認
print(len(layers))
print(len(attention_weights))

# Attentionの重みを確認
print(np.round(attention_weights[0], 2))
print(np.sum(attention_weights[0], axis=1))
print(attention_weights[0].shape)

7
7
[[0.15 0.12 0.67 0.06]
 [0.16 0.1  0.15 0.59]
 [0.2  0.06 0.67 0.07]]
[1. 1. 1.]
(3, 4)

　layersには$T$個のインスタンス変数が、attention_weightsには$T$個のNumPy配列が格納されています。

　以上が順伝播の処理です。続いて、逆伝播の処理を確認します。

・逆伝播の計算

　Time Affineレイヤからコンテキストの勾配$\frac{\partial L}{\partial \mathbf{cs}} = \Bigl( \frac{\partial L}{\partial \mathbf{c}_0}, \cdots, \frac{\partial L}{\partial \mathbf{c}_{T-1}} \Bigr)$が入力します。ここでは、$\frac{\partial L}{\partial \mathbf{cs}}$を簡易的に作成します。

# (簡易的に)T個のコンテキストの勾配を作成
dcs = np.ones((N, T_dec, H))
print(dcs.shape)

(3, 7, 5)

　layersから1つずつレイヤを取り出して、逆伝播を計算します。各レイヤの出力は、「Encoderの$T$個の隠れ状態の勾配$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$」と「Encoderの$t$番目の隠れ状態の勾配$\frac{\partial L}{\partial \mathbf{h}_t^{(\mathrm{Dec})}}$」です。それぞれdhs、dhとします。
　$\mathbf{hs}^{(\mathrm{Enc})}$は分岐して全てのAttentionレイヤに入力したので、各Attentionレイヤで求まるdhsをdhs_encに足し合わせていきます(Repeatノード)。$\mathbf{hs}^{(\mathrm{Dec})}$はそれぞれ同じ時刻のAttentionレイヤに入力したので、各Attentionレイヤで求まるdhをdhs_decに格納していきます。

# EncoderのT個の隠れ状態の勾配を初期化
dhs_enc = 0

# DecoderのT個の隠れ状態の勾配を初期化
dhs_dec = np.empty_like(dcs)

# Time Attentionレイヤの処理
for t in range(T_dec):
    # t番目のAttentionレイヤを取得
    layer = layers[t]
    
    # EncoderとDecoderの隠れ状態の勾配を計算
    dhs, dh = layer.backward(dcs[:, t, :])
    
    # EncoderのT個の隠れ状態の勾配を加算
    dhs_enc += dhs
    
    # Decoderのt番目の隠れ状態の勾配を格納
    dhs_dec[:, t, :] = dh

　処理に影響しないため、layers内のレイヤを逆順に処理する必要はありません。

　$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$と$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Dec})}}$を確認します。

# EncoderのT個の隠れ状態の勾配を確認
print(dhs_enc.shape)

# DecoderのT個の隠れ状態の勾配を確認
print(dhs_dec.shape)

(3, 4, 5)
(3, 7, 5)

　以上がTime Attentionレイヤの処理です。

・実装

　処理の確認ができたので、Time Attentionレイヤをクラスとして実装します。

# Time Attentionレイヤの実装
class TimeAttention:
    # 初期化メソッド
    def __init__(self):
        # 他のレイヤに対応するための空のリストを作成
        self.params = [] # パラメータ
        self.grads = []  # 勾配
        
        # Attentionの重みの受け皿を初期化
        self.attention_weights = None
    
    # 順伝播メソッド
    def forward(self, hs_enc, hs_dec):
        # 変数の形状に関する値を取得
        N, T, H = hs_dec.shape
        
        # T個のコンテキストの受け皿を初期化
        out = np.empty_like(hs_dec)
        
        # 受け皿を初期化
        self.layers = []
        self.attention_weights = []
        
        # Attentionレイヤごとに処理
        for t in range(T):
            # t番目のAttentionレイヤを作成
            layer = Attention()
            
            # t番目のコンテキスト(隠れ状態の重み付き和)を計算
            out[:, t, :] = layer.forward(hs_enc, hs_dec[:, t, :])
            
            # t番目のレイヤとAttentionの重みを格納
            self.layers.append(layer)
            self.attention_weights.append(layer.attention_weight)
        
        return out
    
    # 逆伝播メソッド
    def backward(self, dout):
        # 変数の形状に関する値を取得
        N, T, H = dout.shape
        
        # EncoderとDecoderの隠れ状態の勾配を初期化
        dhs_enc = 0
        dhs_dec = np.empty_like(dout)
        
        # Attentionレイヤごとに処理
        for t in range(T):
            # t番目のAttentionレイヤを取得
            layer = self.layers[t]
            
            # EncoderとDecoderの隠れ状態の勾配を計算
            dhs, dh = layer.backward(dout[:, t, :])
            
            # EncoderのT個の隠れ状態の勾配を加算
            dhs_enc += dhs
            
            # Decoderのt番目の隠れ状態の勾配を格納
            dhs_dec[:, t, :] = dh
        
        return dhs_enc, dhs_dec

　$T$個分のAttentionの重みを格納するインスタンス変数self.attention_weightsは、変数名の最後にsが付いているのに注意してください。

　実装したクラスを試してみましょう。

　Encoderの隠れ状態$\mathbf{hs}^{(\mathrm{Enc})}$とDecoderの隠れ状態$\mathbf{hs}^{(\mathrm{Dec})}$を簡易的に作成して、Time Attentionレイヤのインスタンスを作成します。

# (簡易的に)EncoderのT個の隠れ状態を作成
hs_enc = np.random.randn(N, T_enc, H)
print(hs_enc.shape)

# (簡易的に)Decoderの隠れ状態を作成
hs_enc = np.random.randn(N, T_dec, H)
print(hs_enc.shape)

# インスタンスを作成
time_attention_layer = TimeAttention()

(3, 4, 5)
(3, 7, 5)

　順伝播を計算します。

# T個のコンテキストを計算
cs = time_attention_layer.forward(hs_enc, hs_enc)
print(np.round(cs, 2))
print(cs.shape)

[[[-1.47 -0.71 -1.7  -1.03  0.93]
  [-2.17  0.52  0.48  1.49 -0.47]
  [-0.86 -0.63 -0.74 -0.67  0.27]
  [-1.11 -0.22 -0.26 -0.12 -0.6 ]
  [ 0.68 -2.36  1.59  1.36  2.16]
  [-0.17  2.91  1.69  1.5  -0.24]
  [ 0.04  1.51  0.89  0.88 -0.42]]

 [[ 1.39 -0.5   0.47 -0.43  1.2 ]
  [ 0.56  1.04  0.12  0.32  0.17]
  [-1.3   1.04 -1.21 -1.89  0.32]
  [-1.61  1.04 -1.55 -1.92  0.43]
  [ 1.24  1.33 -0.64 -0.1  -1.03]
  [ 0.44  0.85  0.61  1.64 -0.1 ]
  [ 0.52  1.11  0.71  1.87 -0.12]]

 [[ 1.02  0.81  1.08 -0.37 -1.28]
  [ 0.19  0.5  -0.74 -0.31 -0.43]
  [-1.06  0.26 -0.91 -0.45 -0.7 ]
  [ 2.24  1.04 -0.46  0.95  0.96]
  [-0.47  0.09 -1.02  0.27 -0.86]
  [ 0.72  0.97 -0.49 -0.76  0.12]
  [-0.76  0.31 -0.9  -0.19 -0.7 ]]]
(3, 7, 5)

　この出力はTime Affineレイヤに入力します。

　インスタンス変数として保存されている$T$個のAttentionの重みを確認します。

# Attentionの重みを取得
attention_weights = time_attention_layer.attention_weights
print(np.round(attention_weights[0], 2)) # 0番目のAttentionの重み
print(np.sum(attention_weights[0], axis=1))
print(np.array(attention_weights[0]).shape)

[[0.99 0.   0.   0.   0.   0.   0.  ]
 [0.95 0.03 0.01 0.   0.01 0.01 0.01]
 [0.95 0.01 0.   0.02 0.   0.02 0.  ]]
[1. 1. 1.]
(3, 7)

　2次元方向の和が1になっているのを確認できました。

　$T$個のコンテキストの勾配(Time Affineレイヤの出力)$\frac{\partial L}{\partial \mathbf{cs}}$を簡易的に作成して、逆伝播を計算します。

# (簡易的に)逆伝播の入力を作成
dcs = np.random.randn(N, T_dec, H)
print(dcs.shape)

# 逆伝播を計算
dhs_enc, dhs_dec = time_attention_layer.backward(dcs)
print(dhs_enc.shape)
print(dhs_dec.shape)

(3, 7, 5)
(3, 7, 5)
(3, 7, 5)

　dhs_encはEncoderのTime LSTMレイヤに、dhs_decはDecoderのTime LSTMレイヤに入力します。

　8.1節では、Time Attentionレイヤを実装しました。8.2節では、Attention付きのseq2seqを実装します。

参考文献

斎藤康毅『ゼロから作るDeep Learning 2――自然言語処理編』オライリー・ジャパン,2018年.

おわりに

　これで8章のメインパーツが完成ですね。あとちょっと！

【次節の内容】

からっぽのしょこ

読んだら書く！書いたら読む！同じ事は二度調べ(たく)ない