Listen, Attend and Spell¶

ListenAttendSpell¶

class kospeech.models.las.model.ListenAttendSpell(input_dim: int, num_classes: int, encoder_hidden_state_dim: int = 512, decoder_hidden_state_dim: int = 1024, num_encoder_layers: int = 3, num_decoder_layers: int = 2, bidirectional: bool = True, extractor: str = 'vgg', activation: str = 'hardtanh', rnn_type: str = 'lstm', max_length: int = 400, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, encoder_dropout_p: int = 0.2, decoder_dropout_p: int = 0.2, joint_ctc_attention: bool = False)[source]¶

Listen, Attend and Spell model with configurable encoder and decoder.

Parameters

input_dim (int) – dimension of input vector
num_classes (int) – number of classification
encoder_hidden_state_dim (int) – the number of features in the encoder hidden state h
decoder_hidden_state_dim (int) – the number of features in the decoder hidden state h
num_encoder_layers (int, optional) – number of recurrent layers (default: 3)
num_decoder_layers (int, optional) – number of recurrent layers (default: 2)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: False)
extractor (str) – type of CNN extractor (default: vgg)
activation (str) – type of activation function (default: hardtanh)
rnn_type (str, optional) – type of RNN cell (default: lstm)
encoder_dropout_p (float, optional) – dropout probability of encoder (default: 0.2)
decoder_dropout_p (float, optional) – dropout probability of decoder (default: 0.2)
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)
num_heads (int, optional) – number of attention heads. (default: 4)
max_length (int, optional) – max decoding step (default: 400)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths, targets, teacher_forcing_ratio

inputs (torch.Tensor): tensor of sequences, whose length is the batch size and within which each sequence is a list of token IDs. This information is forwarded to the encoder.
input_lengths (torch.Tensor): tensor of sequences, whose contains length of inputs.
targets (torch.Tensor): tensor of sequences, whose length is the batch size and within which each sequence is a list of token IDs. This information is forwarded to the decoder.
teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0.90)

Returns

(Tensor, Tensor, Tensor)

predicted_log_probs (torch.FloatTensor): Log probability of model predictions.
encoder_output_lengths: The length of encoder outputs. (batch)
encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: Optional[torch.Tensor] = None, teacher_forcing_ratio: float = 1.0) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs and targets pair for training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

(Tensor, Tensor, Tensor)

predicted_log_probs (torch.FloatTensor): Log probability of model predictions.
encoder_output_lengths: The length of encoder outputs. (batch)
encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.

Encoder¶

class kospeech.models.las.encoder.EncoderRNN(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', extractor: str = 'vgg', activation: str = 'hardtanh', joint_ctc_attention: bool = False)[source]¶

Converts low level speech signals into higher level features

Parameters

input_dim (int) – dimension of input vector
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the encoder hidden state h
num_layers (int, optional) – number of recurrent layers (default: 3)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: False)
extractor (str) – type of CNN extractor (default: vgg)
activation (str) – type of activation function (default: hardtanh)
rnn_type (str, optional) – type of RNN cell (default: lstm)
dropout_p (float, optional) – dropout probability of encoder (default: 0.2)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns: encoder_outputs, encoder_log__probs, output_lengths

encoder_outputs: tensor containing the encoded features of the input sequence
encoder_log__probs: tensor containing log probability for ctc loss
output_lengths: list of sequence lengths produced by Listener

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward propagate a inputs for encoder training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

encoder_outputs: A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)
encoder_output_lengths: The length of encoder outputs. (batch)
encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.

Return type

(Tensor, Tensor, Tensor)

Decoder¶

class kospeech.models.las.decoder.BeamDecoderRNN(decoder: kospeech.models.las.decoder.DecoderRNN, beam_size: int, batch_size: int)[source]¶

Beam Search Decoder RNN

decode(encoder_outputs: torch.Tensor, encoder_output_lengths: torch.Tensor) → torch.Tensor [source]¶: Applies beam search decoing (Top k decoding)

forward(encoder_outputs: torch.Tensor) → list [source]¶

Forward propagate a encoder_outputs for training.

Parameters

targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)

Returns

Log probability of model predictions.

Return type

predicted_log_probs (torch.FloatTensor)

class kospeech.models.las.decoder.DecoderRNN(num_classes: int, max_length: int = 150, hidden_state_dim: int = 1024, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, num_layers: int = 2, rnn_type: str = 'lstm', dropout_p: float = 0.3)[source]¶

Converts higher level features (from encoder) into output utterances by specifying a probability distribution over sequences of characters.

Parameters

num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the decoder hidden state h
num_layers (int, optional) – number of recurrent layers (default: 2)
rnn_type (str, optional) – type of RNN cell (default: lstm)
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)
num_heads (int, optional) – number of attention heads. (default: 4)
dropout_p (float, optional) – dropout probability of decoder (default: 0.2)

Inputs: inputs, encoder_outputs, teacher_forcing_ratio

inputs (batch, seq_len, input_size): list of sequences, whose length is the batch size and within which each sequence is a list of token IDs. It is used for teacher forcing when provided. (default None)
encoder_outputs (batch, seq_len, hidden_state_dim): tensor with containing the outputs of the encoder. Used for attention mechanism (default is None).
teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0).

Returns: predicted_log_probs

predicted_log_probs: list contains decode result (log probability)

decode(encoder_outputs: torch.Tensor, encoder_output_lengths: torch.Tensor) → torch.Tensor [source]¶

Decode encoder_outputs.

Parameters

encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)
encoder_output_lengths (torch.LongTensor) – The length of encoder outputs. (batch)

Returns

Log probability of model predictions.

Return type

predicted_log_probs (torch.FloatTensor)

forward(targets: Optional[torch.Tensor], encoder_outputs: torch.Tensor, teacher_forcing_ratio: float = 1.0) → torch.Tensor [source]¶

Forward propagate a encoder_outputs for training.

Parameters

targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)
teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

Log probability of model predictions.

Return type

predicted_log_probs (torch.FloatTensor)