Conformer

Conformer

class kospeech.models.conformer.model.Conformer(num_classes: int, input_dim: int = 80, encoder_dim: int = 512, decoder_dim: int = 640, num_encoder_layers: int = 17, num_decoder_layers: int = 1, decoder_rnn_type: str = 'lstm', num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, decoder_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda', decoder: str = None)[source]

Conformer: Convolution-augmented Transformer for Speech Recognition The paper used a one-lstm Transducer decoder, currently still only implemented the conformer encoder shown in the paper.

Parameters
  • num_classes (int) – Number of classification classes

  • input_dim (int, optional) – Dimension of input vector

  • encoder_dim (int, optional) – Dimension of conformer encoder

  • decoder_dim (int, optional) – Dimension of conformer decoder

  • num_encoder_layers (int, optional) – Number of conformer blocks

  • num_decoder_layers (int, optional) – Number of decoder layers

  • decoder_rnn_type (str, optional) – type of RNN cell

  • num_attention_heads (int, optional) – Number of attention heads

  • feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module

  • conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module

  • feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout

  • attention_dropout_p (float, optional) – Probability of attention module dropout

  • conv_dropout_p (float, optional) – Probability of conformer convolution module dropout

  • decoder_dropout_p (float, optional) – Probability of conformer decoder dropout

  • conv_kernel_size (int or tuple, optional) – Size of the convolving kernel

  • half_step_residual (bool) – Flag indication whether to use half step residual or not

  • device (torch.device) – torch device (cuda or cpu)

  • decoder (str) – If decoder is None, train with CTC decoding

Inputs: inputs
  • inputs (batch, time, dim): Tensor containing input vector

  • input_lengths (batch): list of sequence input lengths

Returns

Result of model predictions.

Return type

  • predictions (torch.FloatTensor)

decode(encoder_outputs: torch.Tensor, max_length: int = None)torch.Tensor[source]

Decode encoder_outputs.

Parameters
  • encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (seq_length, dimension)

  • max_length (int) – max decoding time step

Returns

Log probability of model predictions.

Return type

  • predicted_log_probs (torch.FloatTensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: torch.Tensor, target_lengths: torch.Tensor) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Forward propagate a inputs and targets pair for training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

  • targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • target_lengths (torch.LongTensor) – The length of target tensor. (batch)

Returns

Result of model predictions.

Return type

  • predictions (torch.FloatTensor)

recognize(inputs: torch.Tensor, input_lengths: torch.Tensor)torch.Tensor[source]

Recognize input speech.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

Result of model predictions.

Return type

  • predictions (torch.FloatTensor)

Encoder

class kospeech.models.conformer.encoder.ConformerBlock(encoder_dim: int = 512, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda')[source]

Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. This sandwich structure is inspired by Macaron-Net, which proposes replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers, one before the attention layer and one after.

Parameters
  • encoder_dim (int, optional) – Dimension of conformer encoder

  • num_attention_heads (int, optional) – Number of attention heads

  • feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module

  • conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module

  • feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout

  • attention_dropout_p (float, optional) – Probability of attention module dropout

  • conv_dropout_p (float, optional) – Probability of conformer convolution module dropout

  • conv_kernel_size (int or tuple, optional) – Size of the convolving kernel

  • half_step_residual (bool) – Flag indication whether to use half step residual or not

  • device (torch.device) – torch device (cuda or cpu)

Inputs: inputs
  • inputs (batch, time, dim): Tensor containing input vector

Returns: outputs
  • outputs (batch, time, dim): Tensor produces by conformer block.

class kospeech.models.conformer.encoder.ConformerEncoder(input_dim: int = 80, encoder_dim: int = 512, num_layers: int = 17, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda')[source]

Conformer encoder first processes the input with a convolution subsampling layer and then with a number of conformer blocks.

Parameters
  • input_dim (int, optional) – Dimension of input vector

  • encoder_dim (int, optional) – Dimension of conformer encoder

  • num_layers (int, optional) – Number of conformer blocks

  • num_attention_heads (int, optional) – Number of attention heads

  • feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module

  • conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module

  • feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout

  • attention_dropout_p (float, optional) – Probability of attention module dropout

  • conv_dropout_p (float, optional) – Probability of conformer convolution module dropout

  • conv_kernel_size (int or tuple, optional) – Size of the convolving kernel

  • half_step_residual (bool) – Flag indication whether to use half step residual or not

  • device (torch.device) – torch device (cuda or cpu)

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vector

  • input_lengths (batch): list of sequence input lengths

Returns: outputs, output_lengths
  • outputs (batch, out_channels, time): Tensor produces by conformer encoder.

  • output_lengths (batch): list of sequence output lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoder training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor)

  • outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size

    (batch, seq_length, dimension)

  • output_lengths (torch.LongTensor): The length of output tensor. (batch)

Decoder

class kospeech.models.rnnt.decoder.DecoderRNNT(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]

Decoder of RNN-Transducer

Parameters
  • num_classes (int) – number of classification

  • hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)

  • output_dim (int, optional) – output dimension of encoder and decoder (default: 512)

  • num_layers (int, optional) – number of decoder layers (default: 1)

  • rnn_type (str, optional) – type of rnn cell (default: lstm)

  • sos_id (int, optional) – start of sentence identification

  • eos_id (int, optional) – end of sentence identification

  • dropout_p (float, optional) – dropout probability of decoder

Inputs: inputs, input_lengths

inputs (torch.LongTensor): A target sequence passed to decoder. IntTensor of size (batch, seq_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): A previous hidden state of decoder. FloatTensor of size

(batch, seq_length, dimension)

Returns

  • decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size

    (batch, seq_length, dimension)

Return type

(Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propage a inputs (targets) for training.

Parameters
  • inputs (torch.LongTensor) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

  • hidden_states (torch.FloatTensor) – A previous hidden state of decoder. FloatTensor of size (batch, seq_length, dimension)

Returns

  • decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size

    (batch, seq_length, dimension)

Return type

(Tensor, Tensor)

Modules

class kospeech.models.conformer.modules.ConformerConvModule(in_channels: int, kernel_size: int = 31, expansion_factor: int = 2, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]

Conformer convolution module starts with a pointwise convolution and a gated linear unit (GLU). This is followed by a single 1-D depthwise convolution layer. Batchnorm is deployed just after the convolution to aid training deep models.

Parameters
  • in_channels (int) – Number of channels in the input

  • kernel_size (int or tuple, optional) – Size of the convolving kernel Default: 31

  • dropout_p (float, optional) – probability of dropout

  • device (torch.device) – torch device (cuda or cpu)

Inputs: inputs

inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs

outputs (batch, time, dim): Tensor produces by conformer convolution module.

class kospeech.models.conformer.modules.FeedForwardModule(encoder_dim: int = 512, expansion_factor: int = 4, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]

Conformer Feed Forward Module follow pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. This module also apply Swish activation and dropout, which helps regularizing the network.

Parameters
  • encoder_dim (int) – Dimension of conformer encoder

  • expansion_factor (int) – Expansion factor of feed forward module.

  • dropout_p (float) – Ratio of dropout

  • device (torch.device) – torch device (cuda or cpu)

Inputs: inputs
  • inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs
  • outputs (batch, time, dim): Tensor produces by feed forward module.

class kospeech.models.conformer.modules.MultiHeadedSelfAttentionModule(d_model: int, num_heads: int, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]

Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length. Conformer use prenorm residual units with dropout which helps training and regularizing deeper models.

Parameters
  • d_model (int) – The dimension of model

  • num_heads (int) – The number of attention heads.

  • dropout_p (float) – probability of dropout

  • device (torch.device) – torch device (cuda or cpu)

Inputs: inputs, mask
  • inputs (batch, time, dim): Tensor containing input vector

  • mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

Returns

Tensor produces by relative multi headed self attention module.

Return type

  • outputs (batch, time, dim)