Listen, Attend and Spell

ListenAttendSpell

class kospeech.models.las.model.ListenAttendSpell(input_dim: int, num_classes: int, encoder_hidden_state_dim: int = 512, decoder_hidden_state_dim: int = 1024, num_encoder_layers: int = 3, num_decoder_layers: int = 2, bidirectional: bool = True, extractor: str = 'vgg', activation: str = 'hardtanh', rnn_type: str = 'lstm', max_length: int = 400, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, encoder_dropout_p: int = 0.2, decoder_dropout_p: int = 0.2, joint_ctc_attention: bool = False)[source]

Listen, Attend and Spell model with configurable encoder and decoder.

Parameters
  • input_dim (int) – dimension of input vector

  • num_classes (int) – number of classification

  • encoder_hidden_state_dim (int) – the number of features in the encoder hidden state h

  • decoder_hidden_state_dim (int) – the number of features in the decoder hidden state h

  • num_encoder_layers (int, optional) – number of recurrent layers (default: 3)

  • num_decoder_layers (int, optional) – number of recurrent layers (default: 2)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: False)

  • extractor (str) – type of CNN extractor (default: vgg)

  • activation (str) – type of activation function (default: hardtanh)

  • rnn_type (str, optional) – type of RNN cell (default: lstm)

  • encoder_dropout_p (float, optional) – dropout probability of encoder (default: 0.2)

  • decoder_dropout_p (float, optional) – dropout probability of decoder (default: 0.2)

  • pad_id (int, optional) – index of the pad symbol (default: 0)

  • sos_id (int, optional) – index of the start of sentence symbol (default: 1)

  • eos_id (int, optional) – index of the end of sentence symbol (default: 2)

  • attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)

  • num_heads (int, optional) – number of attention heads. (default: 4)

  • max_length (int, optional) – max decoding step (default: 400)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths, targets, teacher_forcing_ratio
  • inputs (torch.Tensor): tensor of sequences, whose length is the batch size and within which each sequence is a list of token IDs. This information is forwarded to the encoder.

  • input_lengths (torch.Tensor): tensor of sequences, whose contains length of inputs.

  • targets (torch.Tensor): tensor of sequences, whose length is the batch size and within which each sequence is a list of token IDs. This information is forwarded to the decoder.

  • teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0.90)

Returns

(Tensor, Tensor, Tensor)

  • predicted_log_probs (torch.FloatTensor): Log probability of model predictions.

  • encoder_output_lengths: The length of encoder outputs. (batch)

  • encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: Optional[torch.Tensor] = None, teacher_forcing_ratio: float = 1.0) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Forward propagate a inputs and targets pair for training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

  • targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

(Tensor, Tensor, Tensor)

  • predicted_log_probs (torch.FloatTensor): Log probability of model predictions.

  • encoder_output_lengths: The length of encoder outputs. (batch)

  • encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

Encoder

class kospeech.models.las.encoder.EncoderRNN(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', extractor: str = 'vgg', activation: str = 'hardtanh', joint_ctc_attention: bool = False)[source]

Converts low level speech signals into higher level features

Parameters
  • input_dim (int) – dimension of input vector

  • num_classes (int) – number of classification

  • hidden_state_dim (int) – the number of features in the encoder hidden state h

  • num_layers (int, optional) – number of recurrent layers (default: 3)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: False)

  • extractor (str) – type of CNN extractor (default: vgg)

  • activation (str) – type of activation function (default: hardtanh)

  • rnn_type (str, optional) – type of RNN cell (default: lstm)

  • dropout_p (float, optional) – dropout probability of encoder (default: 0.2)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns: encoder_outputs, encoder_log__probs, output_lengths
  • encoder_outputs: tensor containing the encoded features of the input sequence

  • encoder_log__probs: tensor containing log probability for ctc loss

  • output_lengths: list of sequence lengths produced by Listener

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward propagate a inputs for encoder training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • encoder_outputs: A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)

  • encoder_output_lengths: The length of encoder outputs. (batch)

  • encoder_log_probs: Log probability of encoder outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

Return type

(Tensor, Tensor, Tensor)

Decoder

class kospeech.models.las.decoder.BeamDecoderRNN(decoder: kospeech.models.las.decoder.DecoderRNN, beam_size: int, batch_size: int)[source]

Beam Search Decoder RNN

decode(encoder_outputs: torch.Tensor, encoder_output_lengths: torch.Tensor)torch.Tensor[source]

Applies beam search decoing (Top k decoding)

forward(encoder_outputs: torch.Tensor)list[source]

Forward propagate a encoder_outputs for training.

Parameters
  • targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)

Returns

Log probability of model predictions.

Return type

  • predicted_log_probs (torch.FloatTensor)

class kospeech.models.las.decoder.DecoderRNN(num_classes: int, max_length: int = 150, hidden_state_dim: int = 1024, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, num_layers: int = 2, rnn_type: str = 'lstm', dropout_p: float = 0.3)[source]

Converts higher level features (from encoder) into output utterances by specifying a probability distribution over sequences of characters.

Parameters
  • num_classes (int) – number of classification

  • hidden_state_dim (int) – the number of features in the decoder hidden state h

  • num_layers (int, optional) – number of recurrent layers (default: 2)

  • rnn_type (str, optional) – type of RNN cell (default: lstm)

  • pad_id (int, optional) – index of the pad symbol (default: 0)

  • sos_id (int, optional) – index of the start of sentence symbol (default: 1)

  • eos_id (int, optional) – index of the end of sentence symbol (default: 2)

  • attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)

  • num_heads (int, optional) – number of attention heads. (default: 4)

  • dropout_p (float, optional) – dropout probability of decoder (default: 0.2)

Inputs: inputs, encoder_outputs, teacher_forcing_ratio
  • inputs (batch, seq_len, input_size): list of sequences, whose length is the batch size and within which each sequence is a list of token IDs. It is used for teacher forcing when provided. (default None)

  • encoder_outputs (batch, seq_len, hidden_state_dim): tensor with containing the outputs of the encoder. Used for attention mechanism (default is None).

  • teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0).

Returns: predicted_log_probs
  • predicted_log_probs: list contains decode result (log probability)

decode(encoder_outputs: torch.Tensor, encoder_output_lengths: torch.Tensor)torch.Tensor[source]

Decode encoder_outputs.

Parameters
  • encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)

  • encoder_output_lengths (torch.LongTensor) – The length of encoder outputs. (batch)

Returns

Log probability of model predictions.

Return type

  • predicted_log_probs (torch.FloatTensor)

forward(targets: Optional[torch.Tensor], encoder_outputs: torch.Tensor, teacher_forcing_ratio: float = 1.0)torch.Tensor[source]

Forward propagate a encoder_outputs for training.

Parameters
  • targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (batch, seq_length, dimension)

  • teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

Log probability of model predictions.

Return type

  • predicted_log_probs (torch.FloatTensor)