Listen, Attend and Spell¶
ListenAttendSpell¶
-
class
kospeech.models.las.model.
ListenAttendSpell
(input_dim: int, num_classes: int, encoder_hidden_state_dim: int = 512, decoder_hidden_state_dim: int = 1024, num_encoder_layers: int = 3, num_decoder_layers: int = 2, bidirectional: bool = True, extractor: str = 'vgg', activation: str = 'hardtanh', rnn_type: str = 'lstm', max_length: int = 400, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, encoder_dropout_p: int = 0.2, decoder_dropout_p: int = 0.2, joint_ctc_attention: bool = False)[source]¶ Listen, Attend and Spell model with configurable encoder and decoder.
- Parameters
input_dim (int) – dimension of input vector
num_classes (int) – number of classification
encoder_hidden_state_dim (int) – the number of features in the encoder hidden state h
decoder_hidden_state_dim (int) – the number of features in the decoder hidden state h
num_encoder_layers (int, optional) – number of recurrent layers (default: 3)
num_decoder_layers (int, optional) – number of recurrent layers (default: 2)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: False)
extractor (str) – type of CNN extractor (default: vgg)
activation (str) – type of activation function (default: hardtanh)
rnn_type (str, optional) – type of RNN cell (default: lstm)
encoder_dropout_p (float, optional) – dropout probability of encoder (default: 0.2)
decoder_dropout_p (float, optional) – dropout probability of decoder (default: 0.2)
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)
num_heads (int, optional) – number of attention heads. (default: 4)
max_length (int, optional) – max decoding step (default: 400)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not
- Inputs: inputs, input_lengths, targets, teacher_forcing_ratio
inputs (torch.Tensor): tensor of sequences, whose length is the batch size and within which each sequence is a list of token IDs. This information is forwarded to the encoder.
input_lengths (torch.Tensor): tensor of sequences, whose contains length of inputs.
targets (torch.Tensor): tensor of sequences, whose length is the batch size and within which each sequence is a list of token IDs. This information is forwarded to the decoder.
teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0.90)
- Returns
(Tensor, Tensor, Tensor)
predicted_log_probs (torch.FloatTensor): Log probability of model predictions.
encoder_output_lengths: The length of encoder outputs.
(batch)
- encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: Optional[torch.Tensor] = None, teacher_forcing_ratio: float = 1.0) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs and targets pair for training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
teacher_forcing_ratio (float) – ratio of teacher forcing
- Returns
(Tensor, Tensor, Tensor)
predicted_log_probs (torch.FloatTensor): Log probability of model predictions.
encoder_output_lengths: The length of encoder outputs.
(batch)
- encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
Encoder¶
-
class
kospeech.models.las.encoder.
EncoderRNN
(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', extractor: str = 'vgg', activation: str = 'hardtanh', joint_ctc_attention: bool = False)[source]¶ Converts low level speech signals into higher level features
- Parameters
input_dim (int) – dimension of input vector
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the encoder hidden state h
num_layers (int, optional) – number of recurrent layers (default: 3)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: False)
extractor (str) – type of CNN extractor (default: vgg)
activation (str) – type of activation function (default: hardtanh)
rnn_type (str, optional) – type of RNN cell (default: lstm)
dropout_p (float, optional) – dropout probability of encoder (default: 0.2)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not
- Inputs: inputs, input_lengths
inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths
- Returns: encoder_outputs, encoder_log__probs, output_lengths
encoder_outputs: tensor containing the encoded features of the input sequence
encoder_log__probs: tensor containing log probability for ctc loss
output_lengths: list of sequence lengths produced by Listener
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward propagate a inputs for encoder training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
encoder_outputs: A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
encoder_output_lengths: The length of encoder outputs.
(batch)
- encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
- Return type
(Tensor, Tensor, Tensor)
Decoder¶
-
class
kospeech.models.las.decoder.
BeamDecoderRNN
(decoder: kospeech.models.las.decoder.DecoderRNN, beam_size: int, batch_size: int)[source]¶ Beam Search Decoder RNN
-
decode
(encoder_outputs: torch.Tensor, encoder_output_lengths: torch.Tensor) → torch.Tensor[source]¶ Applies beam search decoing (Top k decoding)
-
forward
(encoder_outputs: torch.Tensor) → list[source]¶ Forward propagate a encoder_outputs for training.
- Parameters
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
- Returns
Log probability of model predictions.
- Return type
predicted_log_probs (torch.FloatTensor)
-
-
class
kospeech.models.las.decoder.
DecoderRNN
(num_classes: int, max_length: int = 150, hidden_state_dim: int = 1024, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, num_layers: int = 2, rnn_type: str = 'lstm', dropout_p: float = 0.3)[source]¶ Converts higher level features (from encoder) into output utterances by specifying a probability distribution over sequences of characters.
- Parameters
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the decoder hidden state h
num_layers (int, optional) – number of recurrent layers (default: 2)
rnn_type (str, optional) – type of RNN cell (default: lstm)
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)
num_heads (int, optional) – number of attention heads. (default: 4)
dropout_p (float, optional) – dropout probability of decoder (default: 0.2)
- Inputs: inputs, encoder_outputs, teacher_forcing_ratio
inputs (batch, seq_len, input_size): list of sequences, whose length is the batch size and within which each sequence is a list of token IDs. It is used for teacher forcing when provided. (default None)
encoder_outputs (batch, seq_len, hidden_state_dim): tensor with containing the outputs of the encoder. Used for attention mechanism (default is None).
teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0).
- Returns: predicted_log_probs
predicted_log_probs: list contains decode result (log probability)
-
decode
(encoder_outputs: torch.Tensor, encoder_output_lengths: torch.Tensor) → torch.Tensor[source]¶ Decode encoder_outputs.
- Parameters
encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
encoder_output_lengths (torch.LongTensor) – The length of encoder outputs.
(batch)
- Returns
Log probability of model predictions.
- Return type
predicted_log_probs (torch.FloatTensor)
-
forward
(targets: Optional[torch.Tensor], encoder_outputs: torch.Tensor, teacher_forcing_ratio: float = 1.0) → torch.Tensor[source]¶ Forward propagate a encoder_outputs for training.
- Parameters
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
teacher_forcing_ratio (float) – ratio of teacher forcing
- Returns
Log probability of model predictions.
- Return type
predicted_log_probs (torch.FloatTensor)