はじめに

　『ゼロから作るDeep Learning 2――自然言語処理編』の初学者向け【実装】攻略ノートです。『ゼロつく2』学習の補助となるように適宜解説を加えています。本と一緒に読んでください。

　本の内容を1つずつ確認しながらゆっくりと組んでいきます。

　この記事は、8.2.1項「Encoderの実装」と8.2.1項「Encoderの実装」の内容です。Attention付きseq2seqで用いるEncoderとDecoderの処理を解説して、Pythonで実装します。

【前節の内容】

www.anarchive-beta.com

【他の節の内容】

www.anarchive-beta.com

【この節の内容】

はじめに
8.2.1 Encoderの実装
- ・実装
8.2.2 Decoderの実装
- ・処理の確認
- ・実装
参考文献
おわりに

8.2.1 Encoderの実装

　Attention付きseq2seqの入力側のRNNであるEncoderを実装します。

# 8.2.1項で利用するライブラリ
import numpy as np

　Encoderの実装には、RNNで用いるレイヤを利用します。そのため、各レイヤのクラス定義を再実行するか、次の方法で実装済みのクラスを読み込む必要があります。各レイヤのクラスは、「common」フォルダ内の「time_layers.py」ファイルに実装されています。各レイヤについては、5・6章の資料を参照してください。

# 実装済みクラスの読み込み用の設定
import sys
#sys.path.append('C://Users//「ユーザー名」//Documents//・・・//deep-learning-from-scratch-2-master')

# 実装済みのレイヤを読み込み
from common.time_layers import TimeEmbedding # 5.4.2.1項
from common.time_layers import TimeLSTM # 6.3.1項

　「deep-learning-from-scratch-2-master」フォルダにパスを設定しておく必要があります。

　または、本のように7.3.1項で実装したEncoderクラスを継承します。

・実装

　7.3.1項で実装したEncoderよりもシンプルに実装できるので、処理の確認は省略してAttention Encoderをクラスとして実装します。詳しい処理については7.3.1項も参照してください。

# Attention用のEncoderの実装
class AttentionEncoder:
    # 初期化メソッド
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        # 変数の形状に関する値を取得
        V, D, H = vocab_size, wordvec_size, hidden_size
        
        # パラメータを初期化
        embed_W = (np.random.randn(V, D) * 0.01).astype('f')
        lstm_Wx = (np.random.randn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh = (np.random.randn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')
        
        # レイヤのインスタンスを作成
        self.embed = TimeEmbedding(embed_W)
        self.lstm = TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=False)
        
        # パラメータと勾配をリストに格納
        self.params = self.embed.params + self.lstm.params # パラメータ
        self.grads = self.embed.grads + self.lstm.grads    # 勾配
        
        # LSTMレイヤの中間変数を初期化
        self.hs = None
    
    # 順伝播メソッド
    def forward(self, xs):
        # 各レイヤの順伝播を計算
        xs = self.embed.forward(xs)
        hs = self.lstm.forward(xs)
        return hs
    
    # 逆伝播メソッド
    def backward(self, dhs):
        # 各レイヤの逆伝播を逆順に計算
        dout = self.lstm.backward(dhs)
        dout = self.embed.backward(dout)
        return dout

　7章で実装したEncoderでは、順伝播時に「最後の時刻の隠れ状態$\mathbf{h}_{T-1}$(h)」を出力し、逆伝播時に「最後の時刻の隠れ状態の勾配$\frac{\partial L}{\partial \mathbf{h}_{T-1}}$(dh)」を入力しました。
　Attention付きseq2seqのEncoderでは、全ての隠れ状態とその勾配をそのまま入出力します。順伝播ではTime LSTMレイヤの出力hsをEncoderの出力とし、逆伝播ではEncoderの入力をTime LSTMレイヤの入力とします。そのため、実装もシンプルになります。

　実装したクラスを試してみましょう。

　データとパラメータの形状に関する値を指定して、Encoderのインスタンスを作成します。

# データとパラメータの形状に関する値を指定
N = 3 # バッチサイズ(入力する文章数)
T_enc = 4 # Encoderの時系列サイズ(入力する単語数)
V = 12 # 単語の種類数
D = 6 # 単語ベクトルの次元数(Embedレイヤの中間層のニューロン数)
H = 5 # 隠れ状態のサイズ(LSTMレイヤの中間層のニューロン数)

# Encoderのインスタンスを作成
encoder = AttentionEncoder(V, D, H)

　Encoderの入力データ(文章)を簡易的に作成して、順伝播を計算します。

# (簡易的に)Encoderの入力データを作成
xs  = np.random.randint(low=0, high=V, size=(N, T_enc))
print(xs)
print(xs.shape)

# Encoderの隠れ状態を計算
hs_enc = encoder.forward(xs)
print(np.round(hs_enc, 3))
print(hs_enc.shape)

[[ 9 10  4  4]
 [ 2 11  4 11]
 [11 11  9  9]]
(3, 4)
[[[-0.004  0.002  0.001 -0.     0.001]
  [ 0.002  0.001  0.003 -0.     0.001]
  [ 0.006  0.002  0.001  0.001 -0.001]
  [ 0.006  0.002 -0.     0.001 -0.001]]

 [[ 0.005  0.002 -0.002 -0.002 -0.003]
  [ 0.002  0.002 -0.002 -0.002 -0.002]
  [ 0.003  0.002 -0.002 -0.001 -0.003]
  [ 0.001  0.002 -0.002 -0.002 -0.002]]

 [[ 0.     0.001 -0.001 -0.    -0.001]
  [-0.     0.001 -0.002 -0.    -0.001]
  [-0.005  0.002  0.    -0.001  0.001]
  [-0.006  0.002  0.001 -0.001  0.002]]]
(3, 4, 5)

　$T$個の単語がエンコードされhs_encとなりました。これをDecoderに入力します。

　Encoderの隠れ状態の勾配(DecoderのTime Attentionレイヤの出力)$\frac{\partial L}{\partial \mathbf{hs}}$も簡易的に作成して、逆伝播を計算します。

# (簡易的に)Encoderの隠れ状態の勾配を作成
dhs = np.random.randn(N, T_enc, H)
print(dhs.shape)

# 逆伝播を計算
dout = encoder.backward(dhs)
print(dout)

(3, 4, 5)
None

　インスタンス内に各レイヤのパラメータの勾配が保存されます。確率的勾配降下法により、それぞれ勾配を用いてパラメータを更新します。

　以上でAttention付きseq2seqで用いるEncoderを実装できました。次項では、Decoderを実装します。

8.2.2 Decoderの実装

　Attention付きseq2seqの出力側のRNNであるDecoderを実装します。

# 8.2.2項で利用するライブラリ
import numpy as np

　Decoderの実装には、RNNで用いるレイヤを利用します。そのため、各レイヤのクラス定義を再実行するか、次の方法で実装済みのクラスを読み込む必要があります。各レイヤのクラスは、「common」フォルダ内の「time_layers.py」ファイルに実装されています。各レイヤについては、5・6章の資料を参照してください。

# 実装済みクラスの読み込み用の設定
import sys
#sys.path.append('C://Users//「ユーザー名」//Documents//・・・//deep-learning-from-scratch-2-master')

# 実装済みのレイヤを読み込み
from common.time_layers import TimeEmbedding # 5.4.2.1項
from common.time_layers import TimeLSTM # 6.3.1項
from ch08.attention_layer import TimeAttention # 8.1.5項
from common.time_layers import TimeAffine # 5.4.2.2項

　「deep-learning-from-scratch-2-master」フォルダにパスを設定しておく必要があります。

　または、本のように7.3.2項で実装したDecoderクラスを継承します。

・処理の確認

　図8-21を参考にして、Decoderで行う処理を確認していきます。また、基本的な処理は共通するので7.3.2項も参照してください。

・ネットワークの設定

　まずは、RNNを構築します。

　データとパラメータの形状に関する値を設定して、「Decoderの入力データ$\mathbf{xs} = (x_{0,0}, \cdots, x_{N-1,T-1})$」と「Encoderの隠れ状態$\mathbf{hs}^{(\mathrm{Enc})} = (h_{0,0,0}, \cdots, h_{N-1,T-1,H-1})$」を簡易的に作成します。$\mathbf{hs}^{(\mathrm{Enc})}$の$T$はEncoderの時系列サイズです。

# データとパラメータの形状に関する値を指定
N = 3 # バッチサイズ
T_enc = 4 # Encoderの時系列サイズ(入力する単語数)
T_dec = 7 # Decoderの時系列サイズ(入力する単語数)
V = 12 # 単語の種類数
D = 6 # 単語ベクトル(Embedレイヤの中間層)のサイズ
H = 5 # 隠れ状態(LSTMレイヤの中間層)のサイズ

# (簡易的に)Decoderの入力データを作成
xs = np.random.randint(low=0, high=V, size=(N, T_dec))
print(xs)
print(xs.shape)

# (簡易的に)Encoderの隠れ状態を作成
hs_enc = np.random.randn(N, T_enc, H)
print(hs_enc.shape)

[[ 5  7  6  8  9  7  2]
 [ 2  5 11  7  6  6  4]
 [ 5  3  7  1  6  4 11]]
(3, 7)
(3, 4, 5)

　xsの各要素は単語IDを表します。

　各レイヤの重みとバイアスの初期値をランダムに生成します。

# Time Embedレイヤのパラメータを初期化
embed_W = (np.random.randn(V, D) * 0.01)

# Time LSTMレイヤのパラメータを初期化
lstm_Wx = (np.random.randn(D, 4 * H) / np.sqrt(D))
lstm_Wh = (np.random.randn(H, 4 * H) / np.sqrt(H))
lstm_b = np.zeros(4 * H)

# Time Affineレイヤのパラメータを初期化
affine_W = (np.random.randn(2 * H, V) / np.sqrt(H))
affine_b = np.zeros(V)

　Affineレイヤにはコンテキストと隠れ状態を結合して入力するため、Affineレイヤの重みの行数は(本ではグレーにし忘れててるっぽいですが)2 * Hです。
　各パラメータの形状については各レイヤの記事を、初期値の設定については1巻の6.2節を参照してください。

　作成したパラメータを渡して、各レイヤのインスタンスを作成します。

# Time Embedレイヤのインスタンスを作成
embed_layer = TimeEmbedding(embed_W)

# Time Embedレイヤのインスタンスを作成
lstm_layer = TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True)

# Time Attentionレイヤのインスタンスを作成
attention_layer = TimeAttention()

# Time Embedレイヤのインスタンスを作成
affine_layer = TimeAffine(affine_W, affine_b)

　Attentionレイヤはパラメータを持ちません。

　Encoderの最後の隠れ状態$\mathbf{h}_{T-1}^{(\mathrm{Enc})} = (h_{0,0}^{(T-1)}, \cdots, h_{N-1,H-1}^{(T-1)})$が、Decoderの最初のLSTMレイヤに入力します。$\mathbf{hs}^{(\mathrm{Enc})}$から$T-1$番目の要素を取り出して、TimeLSTMクラスのset_state()メソッドに渡します。(この$T$はEncoder側の時系列サイズです。)

# EncoderのT-1番目の隠れ状態を取得
h = hs_enc[:, -1]
print(h.shape)

# Encoderの隠れ状態を入力
lstm_layer.set_state(h)

(3, 5)

　以上でRNNを構築できました。次は順伝播の処理を確認します。

・順伝播の計算

　各レイヤの順伝播を計算して、「Decoderの隠れ状態$\mathbf{hs}^{(\mathrm{Dec})} = (h_{0,0,0}, \cdots, h_{N-1,T-1,H-1})$」と「コンテキスト$\mathbf{cs} = (c_{0,0,0}, \cdots, c_{N-1,T-1,H-1})$」を求めます。

# 単語ベクトルを計算
out = embed_layer.forward(xs)
print(out.shape)

# Decoderの隠れ状態を計算
hs_dec = lstm_layer.forward(out)
print(np.round(hs_dec[:, 0, :], 2))
print(hs_dec.shape)

# コンテキストを計算
cs = attention_layer.forward(hs_enc, hs_dec)
print(np.round(cs[:, 0, :], 2))
print(cs.shape)

(3, 7, 6)
[[ 0.1   0.03  0.3   0.05 -0.1 ]
 [ 0.01 -0.07 -0.11 -0.06  0.11]
 [ 0.11 -0.14  0.31 -0.04 -0.2 ]]
(3, 7, 5)
[[-0.72  0.27  0.61 -0.75 -0.18]
 [ 0.09 -0.09  0.25  0.17  0.63]
 [ 0.18 -0.54  0.31  0.49  0.8 ]]
(3, 7, 5)

　$\mathbf{cs}$と$\mathbf{hs}^{(\mathrm{Dec})}$を(0から数えて)2次元方向に結合します。

# コンテキストとDecoderの隠れ状態を結合
out = np.concatenate((cs, hs_dec), axis=2)
print(np.round(out[:, 0, :], 2))
print(out.shape)

[[-0.72  0.27  0.61 -0.75 -0.18  0.1   0.03  0.3   0.05 -0.1 ]
 [ 0.09 -0.09  0.25  0.17  0.63  0.01 -0.07 -0.11 -0.06  0.11]
 [ 0.18 -0.54  0.31  0.49  0.8   0.11 -0.14  0.31 -0.04 -0.2 ]]
(3, 7, 10)

　結合した$(N \times T \times 2 H)$の3次元配列から時刻$t$の要素を取り出すと

$$ \begin{pmatrix} c_{0,t,0} & \cdots & c_{0,t,H-1} & h_{0,t,0} & \cdots & h_{0,t,H-1} \\ \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ c_{N-1,t,0} & \cdots & c_{N-1,t,H-1} & h_{N-1,t,0} & \cdots & h_{N-1,t,H-1} \end{pmatrix} $$

となっています。

　結合した配列をTime Affineレイヤに入力して、スコア$\mathbf{ss} = (s_{0,0,0}, \cdots, s_{N-1,T-1,V-1})$を計算します。

# スコアを計算
score = affine_layer.forward(out)
print(np.round(score[:, 0, :], 2))
print(score.shape)

[[ 0.66 -0.26 -0.5  -0.1  -0.31  0.24  0.11  0.08 -0.52 -0.02 -0.53  0.92]
 [-0.29 -0.02  0.14  0.33  0.34  0.24  0.16  0.44  0.05  0.3   0.07  0.25]
 [-0.09  0.4   0.09  0.35  0.47  0.37  0.25  0.81 -0.35  0.12  0.09 -0.05]]
(3, 7, 12)

　$\mathbf{ss}$をTime Softmax with Lossレイヤに入力します

　以上が順伝播の処理です。続いて、逆伝播の処理を確認します。

・逆伝播の計算

　Time Softmax with Lossレイヤからスコアの勾配$\frac{\partial L}{\partial \mathbf{ss}} = \Bigl( \frac{\partial L}{\partial s_{0,0,0}}, \cdots \frac{\partial L}{\partial s_{N-1,T-1,V-1}} \Bigr)$がTime Affineレイヤに入力します。ここでは簡易的に作成して、逆伝播を計算します。

# (簡易的に)スコアの勾配を作成
dscore = np.random.randn(N, T_dec, V)
print(dscore.shape)

# Time Affineレイヤの逆伝播を計算
dout = affine_layer.backward(dscore)
print(np.round(dout[:, 0, :], 2))
print(dout.shape)

(3, 7, 12)
[[-1.25 -0.49  0.88 -0.62  2.4   3.04  0.13  0.9  -0.49  3.18]
 [-1.33  0.92  1.51 -0.55 -1.09 -0.99 -0.42 -0.37  0.26 -1.72]
 [ 0.34  0.81  1.84 -0.21 -2.49  2.1  -3.21  0.8   1.59 -2.1 ]]
(3, 7, 10)

　出力はコンテキストの勾配とDecoderの隠れ状態の勾配を結合した$(N \times T \times 2 H)$の3次元配列です。ここから時刻$t$の要素を取り出すと

$$ \begin{pmatrix} \frac{\partial L}{\partial c_{0,t,0}} & \cdots & \frac{\partial L}{\partial c_{0,t,H-1}} & \frac{\partial L}{\partial h_{0,t,0}} & \cdots & \frac{\partial L}{\partial h_{0,t,H-1}} \\ \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial c_{N-1,t,0}} & \cdots & \frac{\partial L}{\partial c_{N-1,t,H-1}} & \frac{\partial L}{\partial h_{N-1,t,0}} & \cdots & \frac{\partial L}{\partial h_{N-1,t,H-1}} \end{pmatrix} $$

となっています。

　コンテキストの勾配「$\frac{\partial L}{\partial \mathbf{cs}} = \Bigl( \frac{\partial L}{\partial c_{0,0,0}}, \cdots, \frac{\partial L}{\partial c_{N-1,T-1,H-1}} \Bigr)$」とDecoderの隠れ状態の勾配「$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Dec})}} = \Bigl( \frac{\partial L}{\partial h_{0,0,0}}, \cdots, \frac{\partial L}{\partial h_{N-1,T-1,H-1}} \Bigr)$」に分割します。

# コンテキストの勾配と隠れ状態の勾配に分割
dcs, dhs_dec0 = dout[:, :, :H], dout[:, :, H:]
print(dcs.shape)
print(dhs_dec0.shape)

(3, 7, 5)
(3, 7, 5)

　$\frac{\partial L}{\partial \mathbf{cs}}$をTime Attentionレイヤに入力して、Enncoderの隠れ状態の勾配「$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}} = \Bigl( \frac{\partial L}{\partial h_{0,0,0}}, \cdots, \frac{\partial L}{\partial h_{N-1,T-1,H-1}} \Bigr)$」とDecoderの隠れ状態の勾配「$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Dec})}}$」を計算します。

# Time Attentionレイヤの逆伝播を計算
dhs_enc, dhs_dec1 = attention_layer.backward(dcs)
print(dhs_enc.shape)
print(dhs_dec1.shape)

(3, 4, 5)
(3, 7, 5)

　$\mathbf{hs}^{(\mathrm{Dec})}$は、Time AttentionレイヤとTime Affineレイヤに分岐して入力しました。よって、2つのレイヤで求まる$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Dec})}}$の和をとります。

# 分岐したDecoderの隠れ状態を合算
dhs_dec = dhs_dec0 + dhs_dec1
print(dhs_dec.shape)

(3, 7, 5)

　$\mathbf{hs}^{(\mathrm{Dec})}$をTime LSTMレイヤに入力して、単語ベクトルの勾配$\frac{\partial L}{\partial \mathbf{xs}} = \Bigl( \frac{\partial L}{\partial x_{0,0,0}}, \cdots, \frac{\partial L}{\partial x_{N-1,T-1,D-1}} \Bigr)$を計算します。
　その際に、Encoderの最後の隠れ状態の勾配$\frac{\partial L}{\partial \mathbf{h}_{T-1}^{(\mathrm{Enc})}}$が計算され、インスタンス変数dhとして保存されます。これは、Encoderの最後のLSTMレイヤからDecoderの最初のLSTMレイヤに入力した隠れ状態の勾配です。これも分岐ノードの逆伝播として、dhs_decの時系列方向(1次元方向)のT-1番目の要素に加算します。

# Time LSTMレイヤの逆伝播を計算
dout = lstm_layer.backward(dhs_dec)
print(dout.shape)

# 分岐したEncoderのT-1番目の隠れ状態の勾配を合算
dh = lstm_layer.dh
dhs_enc[:, -1] += dh

(3, 7, 6)

　$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$はEncoderに入力します。

　$\frac{\partial L}{\partial \mathbf{xs}}$をTime Embedレイヤに入力して、逆伝播を計算します。

# Time Embedレイヤの逆伝播を計算
dout = embed_layer.backward(dout)
print(dout)

None

　最後のレイヤはNoneを返します。各レイヤのインスタンス内にそれぞれのパラメータの勾配が保存されています。確率的勾配降下法により各パラメータを更新します。

　以上がEncoderで行う処理です。文章生成メソッドについては7.3.2項を参照してください。

・実装

　処理の確認ができたので、Attention付きのDecoderをクラスとして実装します。

# Attention付きDecoderの実装
class AttentionDecoder:
    # 初期化メソッド
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        # 変数の形状に関する値を取得
        V, D, H = vocab_size, wordvec_size, hidden_size
        
        # パラメータを初期化
        embed_W = (np.random.randn(V, D) * 0.01).astype('f')
        lstm_Wx = (np.random.randn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh = (np.random.randn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')
        affine_W = (np.random.randn(2 * H, V) / np.sqrt(2 * H)).astype('f')
        affine_b = np.zeros(V).astype('f')
        
        # レイヤのインスタンスを作成
        self.embed = TimeEmbedding(embed_W)
        self.lstm = TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True)
        self.attention = TimeAttention()
        self.affine = TimeAffine(affine_W, affine_b)
        
        # レイヤをリストに格納
        layers = [self.embed, self.lstm, self.attention, self.affine]
        
        # パラメータと勾配をリストに格納
        self.params = [] # パラメータ
        self.grads = []  # 勾配
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads
        
    # 順伝播メソッド
    def forward(self, xs, enc_hs):
        # EncoderのT-1番目の隠れ状態を0番目のLSTMレイヤに入力
        h = enc_hs[:, -1]
        self.lstm.set_state(h)
        
        # 各レイヤの順伝播を計算
        out = self.embed.forward(xs)               # 単語ベクトル
        dec_hs = self.lstm.forward(out)            # 隠れ状態
        c = self.attention.forward(enc_hs, dec_hs) # コンテキスト
        out = np.concatenate((c, dec_hs), axis=2)  # コンテキストと隠れ状態を結合
        score = self.affine.forward(out)           # スコア
        return score
    
    # 逆伝播メソッド
    def backward(self, dscore):
        # Time Affineレイヤの逆伝播を計算
        dout = self.affine.backward(dscore)
        
        # 変数の形状に関する値を取得
        N, T, H2 = dout.shape
        H = H2 // 2
        
        # コンテキストの勾配と隠れ状態の勾配に分割
        dc, ddec_hs0 = dout[:, :, :H], dout[:, :, H:]
        
        # Time Attentionレイヤの逆伝播を計算
        denc_hs, ddec_hs1 = self.attention.backward(dc)
        
        # 分岐したDecoderの隠れ状態を合算
        ddec_hs = ddec_hs0 + ddec_hs1
        
        # Time LSTMレイヤの逆伝播を計算
        dout = self.lstm.backward(ddec_hs) # 単語ベクトルの勾配
        
        # 分岐したEncoderのT-1番目の隠れ状態の勾配を合算
        dh = self.lstm.dh
        denc_hs[:, -1] += dh
        
        # Time Embedレイヤの逆伝播を計算
        self.embed.backward(dout) # 出力はNone
        
        return denc_hs
    
    # 文章生成メソッド
    def generate(self, enc_hs, start_id, sample_size):
        # 文字IDの受け皿を初期化
        sampled = []
        
        # 区切り文字のIDを設定
        sample_id = start_id
        
        # Encoderの最後の隠れ状態をDecoderの最初のLSTMレイヤに入力
        h = enc_hs[:, -1, :]
        self.lstm.set_state(h)
        
        # 文章を生成
        for _ in range(sample_size):
            # 入力用に2次元配列に変換
            x = np.array(sample_id).reshape((1, 1))
            
            # スコアを計算
            out = self.embed.forward(x)                # 単語ベクトル
            dec_hs = self.lstm.forward(out)            # 隠れ状態
            c = self.attention.forward(enc_hs, dec_hs) # コンテキスト
            out = np.concatenate((c, dec_hs), axis=2)  # コンテキストと隠れ状態を結合
            score = self.affine.forward(out)           # スコア
            
            # スコアが最大の単語IDを取得
            sample_id = np.argmax(score.flatten()) # 入力単語を更新
            sampled.append(int(sample_id)) # サンプリングした単語を保存
        
        return sampled

　実装したクラスを試してみましょう。

　簡易的な「Decoderの入力データ(文章)$\mathbf{xs}$」と「Encoderからの入力$\mathbf{hs}^{(\mathrm{Dec})}$」とDecoderのインスタンスを作成します。

# (簡易的に)Decoderの入力データを作成
xs = np.random.randint(low=0, high=V, size=(N, T_dec))
print(xs)
print(xs.shape)

# (簡易的に)Encoderの隠れ状態を作成
hs_enc = np.random.randn(N, T_enc, H)
print(hs_enc.shape)

# Decoderのインスタンスを作成
decoder = AttentionDecoder(V, D, H)

[[ 0  2  0  0  5  2  2]
 [ 6  3  3  3  0  1  9]
 [ 1  5  1 11  7  3  3]]
(3, 7)
(3, 4, 5)

　スコア$\mathbf{ss}$を計算します。

# スコアを計算
score = decoder.forward(xs, hs_enc)
print(score.shape)

(3, 7, 12)

　スコアはTime Softmax with Lossレイヤに入力して、正規化され損失を求めます。

　スコアの勾配(Time Softmax with Lossレイヤの出力)$\frac{\partial L}{\partial \mathbf{ss}}$を簡易的に作成して、Encoderの隠れ状態の勾配$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$を計算します。

# (簡易的に)スコアの勾配を作成
dscore = np.random.randn(N, T_dec, V)
print(dscore.shape)

# Encoderの隠れ状態の勾配を計算
dhs_enc = decoder.backward(dscore)
print(dhs_enc.shape)

(3, 7, 12)
(3, 4, 5)

　$\frac{\partial L}{\partial \mathbf{hs}^{(\mathrm{Enc})}}$はEncodeに入力します。インスタンス内に各レイヤのパラメータの勾配が保存されます。確率的勾配降下法により、それぞれ勾配を用いてパラメータを更新します。

　以上でAttention付きseq2seqのEncoderとDecoderを実装しました。次項では、Attention付きseq2seqを実装します。

参考文献

ゼロから作るDeep Learning ❷ ―自然言語処理編

作者:斎藤康毅
オライリージャパン

Amazon

おわりに

　あと1つ！あっと1つ！

【次節の内容】

からっぽのしょこ

読んだら書く！書いたら読む！同じ事は二度調べ(たく)ない

8.2.1-2：EncoderとDecoderの実装【ゼロつく2のノート(実装)】