RNN Transducer¶

class kospeech.models.rnnt.model.RNNTransducer(num_classes: int, input_dim: int, num_encoder_layers: int = 4, num_decoder_layers: int = 1, encoder_hidden_state_dim: int = 320, decoder_hidden_state_dim: int = 512, output_dim: int = 512, rnn_type: str = 'lstm', bidirectional: bool = True, encoder_dropout_p: float = 0.2, decoder_dropout_p: float = 0.2, sos_id: int = 1, eos_id: int = 2)[source]¶

RNN-Transducer are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet.

Parameters

num_classes (int) – number of classification
input_dim (int) – dimension of input vector
num_encoder_layers (int, optional) – number of encoder layers (default: 4)
num_decoder_layers (int, optional) – number of decoder layers (default: 1)
encoder_hidden_state_dim (int, optional) – hidden state dimension of encoder (default: 320)
decoder_hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
rnn_type (str, optional) – type of rnn cell (default: lstm)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: True)
encoder_dropout_p (float, optional) – dropout probability of encoder
decoder_dropout_p (float, optional) – dropout probability of decoder
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification

Inputs: inputs, input_lengths, targets, target_lengths

inputs (torch.FloatTensor): A input sequence passed to encoder. Typically for inputs this will be a padded: FloatTensor of size (batch, seq_length, dimension).

input_lengths (torch.LongTensor): The length of input tensor. (batch) targets (torch.LongTensr): A target sequence passed to decoder. IntTensor of size (batch, seq_length) target_lengths (torch.LongTensor): The length of target tensor. (batch)

Returns

Result of model predictions.

Return type

predictions (torch.FloatTensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: torch.Tensor, target_lengths: torch.Tensor) → torch.Tensor [source]¶

Forward propagate a inputs and targets pair for training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
target_lengths (torch.LongTensor) – The length of target tensor. (batch)

Returns

Result of model predictions.

Return type

predictions (torch.FloatTensor)

Encoder¶

class kospeech.models.rnnt.encoder.EncoderRNNT(input_dim: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', dropout_p: float = 0.2, bidirectional: bool = True)[source]¶

Encoder of RNN-Transducer.

Parameters

input_dim (int) – dimension of input vector
hidden_state_dim (int, optional) – hidden state dimension of encoder (default: 320)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
num_layers (int, optional) – number of encoder layers (default: 4)
rnn_type (str, optional) – type of rnn cell (default: lstm)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: True)

Inputs: inputs, input_lengths

inputs (torch.FloatTensor): A input sequence passed to encoder. Typically for inputs this will be a padded: FloatTensor of size (batch, seq_length, dimension).

input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

(Tensor, Tensor)

outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of encoder. FloatTensor of size
(batch, seq_length, dimension)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoder training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor)

outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
output_lengths (torch.LongTensor): The length of output tensor. (batch)

Decoder¶

class kospeech.models.rnnt.decoder.DecoderRNNT(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]¶

Decoder of RNN-Transducer

Parameters

num_classes (int) – number of classification
hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
num_layers (int, optional) – number of decoder layers (default: 1)
rnn_type (str, optional) – type of rnn cell (default: lstm)
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
dropout_p (float, optional) – dropout probability of decoder

Inputs: inputs, input_lengths: inputs (torch.LongTensor): A target sequence passed to decoder. IntTensor of size (batch, seq_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): A previous hidden state of decoder. FloatTensor of size

(batch, seq_length, dimension)

Returns

decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)

Return type

(Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propage a inputs (targets) for training.

Parameters

inputs (torch.LongTensor) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
input_lengths (torch.LongTensor) – The length of input tensor. (batch)
hidden_states (torch.FloatTensor) – A previous hidden state of decoder. FloatTensor of size (batch, seq_length, dimension)

Returns

decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)

Return type

(Tensor, Tensor)