RNN Transducer

RNN Transducer

class kospeech.models.rnnt.model.RNNTransducer(num_classes: int, input_dim: int, num_encoder_layers: int = 4, num_decoder_layers: int = 1, encoder_hidden_state_dim: int = 320, decoder_hidden_state_dim: int = 512, output_dim: int = 512, rnn_type: str = 'lstm', bidirectional: bool = True, encoder_dropout_p: float = 0.2, decoder_dropout_p: float = 0.2, sos_id: int = 1, eos_id: int = 2)[source]

RNN-Transducer are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet.

Parameters
  • num_classes (int) – number of classification

  • input_dim (int) – dimension of input vector

  • num_encoder_layers (int, optional) – number of encoder layers (default: 4)

  • num_decoder_layers (int, optional) – number of decoder layers (default: 1)

  • encoder_hidden_state_dim (int, optional) – hidden state dimension of encoder (default: 320)

  • decoder_hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)

  • output_dim (int, optional) – output dimension of encoder and decoder (default: 512)

  • rnn_type (str, optional) – type of rnn cell (default: lstm)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: True)

  • encoder_dropout_p (float, optional) – dropout probability of encoder

  • decoder_dropout_p (float, optional) – dropout probability of decoder

  • sos_id (int, optional) – start of sentence identification

  • eos_id (int, optional) – end of sentence identification

Inputs: inputs, input_lengths, targets, target_lengths
inputs (torch.FloatTensor): A input sequence passed to encoder. Typically for inputs this will be a padded

FloatTensor of size (batch, seq_length, dimension).

input_lengths (torch.LongTensor): The length of input tensor. (batch) targets (torch.LongTensr): A target sequence passed to decoder. IntTensor of size (batch, seq_length) target_lengths (torch.LongTensor): The length of target tensor. (batch)

Returns

Result of model predictions.

Return type

  • predictions (torch.FloatTensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: torch.Tensor, target_lengths: torch.Tensor)torch.Tensor[source]

Forward propagate a inputs and targets pair for training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

  • targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • target_lengths (torch.LongTensor) – The length of target tensor. (batch)

Returns

Result of model predictions.

Return type

  • predictions (torch.FloatTensor)

Encoder

class kospeech.models.rnnt.encoder.EncoderRNNT(input_dim: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', dropout_p: float = 0.2, bidirectional: bool = True)[source]

Encoder of RNN-Transducer.

Parameters
  • input_dim (int) – dimension of input vector

  • hidden_state_dim (int, optional) – hidden state dimension of encoder (default: 320)

  • output_dim (int, optional) – output dimension of encoder and decoder (default: 512)

  • num_layers (int, optional) – number of encoder layers (default: 4)

  • rnn_type (str, optional) – type of rnn cell (default: lstm)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: True)

Inputs: inputs, input_lengths
inputs (torch.FloatTensor): A input sequence passed to encoder. Typically for inputs this will be a padded

FloatTensor of size (batch, seq_length, dimension).

input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

(Tensor, Tensor)

  • outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of encoder. FloatTensor of size

    (batch, seq_length, dimension)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoder training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor)

  • outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size

    (batch, seq_length, dimension)

  • output_lengths (torch.LongTensor): The length of output tensor. (batch)

Decoder

class kospeech.models.rnnt.decoder.DecoderRNNT(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]

Decoder of RNN-Transducer

Parameters
  • num_classes (int) – number of classification

  • hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)

  • output_dim (int, optional) – output dimension of encoder and decoder (default: 512)

  • num_layers (int, optional) – number of decoder layers (default: 1)

  • rnn_type (str, optional) – type of rnn cell (default: lstm)

  • sos_id (int, optional) – start of sentence identification

  • eos_id (int, optional) – end of sentence identification

  • dropout_p (float, optional) – dropout probability of decoder

Inputs: inputs, input_lengths

inputs (torch.LongTensor): A target sequence passed to decoder. IntTensor of size (batch, seq_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): A previous hidden state of decoder. FloatTensor of size

(batch, seq_length, dimension)

Returns

  • decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size

    (batch, seq_length, dimension)

Return type

(Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propage a inputs (targets) for training.

Parameters
  • inputs (torch.LongTensor) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

  • hidden_states (torch.FloatTensor) – A previous hidden state of decoder. FloatTensor of size (batch, seq_length, dimension)

Returns

  • decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size

    (batch, seq_length, dimension)

Return type

(Tensor, Tensor)