Conformer¶
Conformer¶
-
class
kospeech.models.conformer.model.
Conformer
(num_classes: int, input_dim: int = 80, encoder_dim: int = 512, decoder_dim: int = 640, num_encoder_layers: int = 17, num_decoder_layers: int = 1, decoder_rnn_type: str = 'lstm', num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, decoder_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda', decoder: str = None)[source]¶ Conformer: Convolution-augmented Transformer for Speech Recognition The paper used a one-lstm Transducer decoder, currently still only implemented the conformer encoder shown in the paper.
- Parameters
num_classes (int) – Number of classification classes
input_dim (int, optional) – Dimension of input vector
encoder_dim (int, optional) – Dimension of conformer encoder
decoder_dim (int, optional) – Dimension of conformer decoder
num_encoder_layers (int, optional) – Number of conformer blocks
num_decoder_layers (int, optional) – Number of decoder layers
decoder_rnn_type (str, optional) – type of RNN cell
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
decoder_dropout_p (float, optional) – Probability of conformer decoder dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
device (torch.device) – torch device (cuda or cpu)
decoder (str) – If decoder is None, train with CTC decoding
- Inputs: inputs
inputs (batch, time, dim): Tensor containing input vector
input_lengths (batch): list of sequence input lengths
- Returns
Result of model predictions.
- Return type
predictions (torch.FloatTensor)
-
decode
(encoder_outputs: torch.Tensor, max_length: int = None) → torch.Tensor[source]¶ Decode encoder_outputs.
- Parameters
encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size
(seq_length, dimension)
max_length (int) – max decoding time step
- Returns
Log probability of model predictions.
- Return type
predicted_log_probs (torch.FloatTensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: torch.Tensor, target_lengths: torch.Tensor) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Forward propagate a inputs and targets pair for training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
target_lengths (torch.LongTensor) – The length of target tensor.
(batch)
- Returns
Result of model predictions.
- Return type
predictions (torch.FloatTensor)
-
recognize
(inputs: torch.Tensor, input_lengths: torch.Tensor) → torch.Tensor[source]¶ Recognize input speech.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
Result of model predictions.
- Return type
predictions (torch.FloatTensor)
Encoder¶
-
class
kospeech.models.conformer.encoder.
ConformerBlock
(encoder_dim: int = 512, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda')[source]¶ Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. This sandwich structure is inspired by Macaron-Net, which proposes replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers, one before the attention layer and one after.
- Parameters
encoder_dim (int, optional) – Dimension of conformer encoder
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
device (torch.device) – torch device (cuda or cpu)
- Inputs: inputs
inputs (batch, time, dim): Tensor containing input vector
- Returns: outputs
outputs (batch, time, dim): Tensor produces by conformer block.
-
class
kospeech.models.conformer.encoder.
ConformerEncoder
(input_dim: int = 80, encoder_dim: int = 512, num_layers: int = 17, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda')[source]¶ Conformer encoder first processes the input with a convolution subsampling layer and then with a number of conformer blocks.
- Parameters
input_dim (int, optional) – Dimension of input vector
encoder_dim (int, optional) – Dimension of conformer encoder
num_layers (int, optional) – Number of conformer blocks
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
device (torch.device) – torch device (cuda or cpu)
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vector
input_lengths (batch): list of sequence input lengths
- Returns: outputs, output_lengths
outputs (batch, out_channels, time): Tensor produces by conformer encoder.
output_lengths (batch): list of sequence output lengths
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoder training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
(Tensor, Tensor)
- outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
output_lengths (torch.LongTensor): The length of output tensor.
(batch)
Decoder¶
-
class
kospeech.models.rnnt.decoder.
DecoderRNNT
(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]¶ Decoder of RNN-Transducer
- Parameters
num_classes (int) – number of classification
hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
num_layers (int, optional) – number of decoder layers (default: 1)
rnn_type (str, optional) – type of rnn cell (default: lstm)
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
dropout_p (float, optional) – dropout probability of decoder
- Inputs: inputs, input_lengths
inputs (torch.LongTensor): A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
input_lengths (torch.LongTensor): The length of input tensor.(batch)
hidden_states (torch.FloatTensor): A previous hidden state of decoder. FloatTensor of size(batch, seq_length, dimension)
- Returns
- decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)
- Return type
(Tensor, Tensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propage a inputs (targets) for training.
- Parameters
inputs (torch.LongTensor) – A target sequence passed to decoder. IntTensor of size
(batch, seq_length)
input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
hidden_states (torch.FloatTensor) – A previous hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)
- Returns
- decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)
- Return type
(Tensor, Tensor)
Modules¶
-
class
kospeech.models.conformer.modules.
ConformerConvModule
(in_channels: int, kernel_size: int = 31, expansion_factor: int = 2, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]¶ Conformer convolution module starts with a pointwise convolution and a gated linear unit (GLU). This is followed by a single 1-D depthwise convolution layer. Batchnorm is deployed just after the convolution to aid training deep models.
- Parameters
- Inputs: inputs
inputs (batch, time, dim): Tensor contains input sequences
- Outputs: outputs
outputs (batch, time, dim): Tensor produces by conformer convolution module.
-
class
kospeech.models.conformer.modules.
FeedForwardModule
(encoder_dim: int = 512, expansion_factor: int = 4, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]¶ Conformer Feed Forward Module follow pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. This module also apply Swish activation and dropout, which helps regularizing the network.
- Parameters
- Inputs: inputs
inputs (batch, time, dim): Tensor contains input sequences
- Outputs: outputs
outputs (batch, time, dim): Tensor produces by feed forward module.
-
class
kospeech.models.conformer.modules.
MultiHeadedSelfAttentionModule
(d_model: int, num_heads: int, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]¶ Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length. Conformer use prenorm residual units with dropout which helps training and regularizing deeper models.
- Parameters
- Inputs: inputs, mask
inputs (batch, time, dim): Tensor containing input vector
mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked
- Returns
Tensor produces by relative multi headed self attention module.
- Return type
outputs (batch, time, dim)