RNN Transducer¶
RNN Transducer¶
-
class
kospeech.models.rnnt.model.
RNNTransducer
(num_classes: int, input_dim: int, num_encoder_layers: int = 4, num_decoder_layers: int = 1, encoder_hidden_state_dim: int = 320, decoder_hidden_state_dim: int = 512, output_dim: int = 512, rnn_type: str = 'lstm', bidirectional: bool = True, encoder_dropout_p: float = 0.2, decoder_dropout_p: float = 0.2, sos_id: int = 1, eos_id: int = 2)[source]¶ RNN-Transducer are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet.
- Parameters
num_classes (int) – number of classification
input_dim (int) – dimension of input vector
num_encoder_layers (int, optional) – number of encoder layers (default: 4)
num_decoder_layers (int, optional) – number of decoder layers (default: 1)
encoder_hidden_state_dim (int, optional) – hidden state dimension of encoder (default: 320)
decoder_hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
rnn_type (str, optional) – type of rnn cell (default: lstm)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: True)
encoder_dropout_p (float, optional) – dropout probability of encoder
decoder_dropout_p (float, optional) – dropout probability of decoder
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
- Inputs: inputs, input_lengths, targets, target_lengths
- inputs (torch.FloatTensor): A input sequence passed to encoder. Typically for inputs this will be a padded
FloatTensor of size
(batch, seq_length, dimension)
.
input_lengths (torch.LongTensor): The length of input tensor.
(batch)
targets (torch.LongTensr): A target sequence passed to decoder. IntTensor of size(batch, seq_length)
target_lengths (torch.LongTensor): The length of target tensor.(batch)
- Returns
Result of model predictions.
- Return type
predictions (torch.FloatTensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: torch.Tensor, target_lengths: torch.Tensor) → torch.Tensor[source]¶ Forward propagate a inputs and targets pair for training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
target_lengths (torch.LongTensor) – The length of target tensor.
(batch)
- Returns
Result of model predictions.
- Return type
predictions (torch.FloatTensor)
Encoder¶
-
class
kospeech.models.rnnt.encoder.
EncoderRNNT
(input_dim: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', dropout_p: float = 0.2, bidirectional: bool = True)[source]¶ Encoder of RNN-Transducer.
- Parameters
input_dim (int) – dimension of input vector
hidden_state_dim (int, optional) – hidden state dimension of encoder (default: 320)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
num_layers (int, optional) – number of encoder layers (default: 4)
rnn_type (str, optional) – type of rnn cell (default: lstm)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (default: True)
- Inputs: inputs, input_lengths
- inputs (torch.FloatTensor): A input sequence passed to encoder. Typically for inputs this will be a padded
FloatTensor of size
(batch, seq_length, dimension)
.
input_lengths (torch.LongTensor): The length of input tensor.
(batch)
- Returns
(Tensor, Tensor)
- outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of encoder. FloatTensor of size
(batch, seq_length, dimension)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoder training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
(Tensor, Tensor)
- outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
output_lengths (torch.LongTensor): The length of output tensor.
(batch)
Decoder¶
-
class
kospeech.models.rnnt.decoder.
DecoderRNNT
(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]¶ Decoder of RNN-Transducer
- Parameters
num_classes (int) – number of classification
hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
num_layers (int, optional) – number of decoder layers (default: 1)
rnn_type (str, optional) – type of rnn cell (default: lstm)
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
dropout_p (float, optional) – dropout probability of decoder
- Inputs: inputs, input_lengths
inputs (torch.LongTensor): A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
input_lengths (torch.LongTensor): The length of input tensor.(batch)
hidden_states (torch.FloatTensor): A previous hidden state of decoder. FloatTensor of size(batch, seq_length, dimension)
- Returns
- decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)
- Return type
(Tensor, Tensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propage a inputs (targets) for training.
- Parameters
inputs (torch.LongTensor) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
hidden_states (torch.FloatTensor) – A previous hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)
- Returns
- decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)
- Return type
(Tensor, Tensor)