Conformer¶

class kospeech.models.conformer.model.Conformer(num_classes: int, input_dim: int = 80, encoder_dim: int = 512, decoder_dim: int = 640, num_encoder_layers: int = 17, num_decoder_layers: int = 1, decoder_rnn_type: str = 'lstm', num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, decoder_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda', decoder: str = None)[source]¶

Conformer: Convolution-augmented Transformer for Speech Recognition The paper used a one-lstm Transducer decoder, currently still only implemented the conformer encoder shown in the paper.

Parameters

num_classes (int) – Number of classification classes
input_dim (int, optional) – Dimension of input vector
encoder_dim (int, optional) – Dimension of conformer encoder
decoder_dim (int, optional) – Dimension of conformer decoder
num_encoder_layers (int, optional) – Number of conformer blocks
num_decoder_layers (int, optional) – Number of decoder layers
decoder_rnn_type (str, optional) – type of RNN cell
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
decoder_dropout_p (float, optional) – Probability of conformer decoder dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
device (torch.device) – torch device (cuda or cpu)
decoder (str) – If decoder is None, train with CTC decoding

Inputs: inputs

inputs (batch, time, dim): Tensor containing input vector
input_lengths (batch): list of sequence input lengths

Returns

Result of model predictions.

Return type

predictions (torch.FloatTensor)

decode(encoder_outputs: torch.Tensor, max_length: int = None) → torch.Tensor [source]¶

Decode encoder_outputs.

Parameters

encoder_outputs (torch.FloatTensor) – A output sequence of encoder. FloatTensor of size (seq_length, dimension)
max_length (int) – max decoding time step

Returns

Log probability of model predictions.

Return type

predicted_log_probs (torch.FloatTensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, targets: torch.Tensor, target_lengths: torch.Tensor) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

Forward propagate a inputs and targets pair for training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)
targets (torch.LongTensr) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
target_lengths (torch.LongTensor) – The length of target tensor. (batch)

Returns

Result of model predictions.

Return type

predictions (torch.FloatTensor)

recognize(inputs: torch.Tensor, input_lengths: torch.Tensor) → torch.Tensor [source]¶

Recognize input speech.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

Result of model predictions.

Return type

predictions (torch.FloatTensor)

Encoder¶

class kospeech.models.conformer.encoder.ConformerBlock(encoder_dim: int = 512, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda')[source]¶

Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. This sandwich structure is inspired by Macaron-Net, which proposes replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers, one before the attention layer and one after.

Parameters

encoder_dim (int, optional) – Dimension of conformer encoder
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
device (torch.device) – torch device (cuda or cpu)

Inputs: inputs

inputs (batch, time, dim): Tensor containing input vector

Returns: outputs

outputs (batch, time, dim): Tensor produces by conformer block.

class kospeech.models.conformer.encoder.ConformerEncoder(input_dim: int = 80, encoder_dim: int = 512, num_layers: int = 17, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, device: torch.device = 'cuda')[source]¶

Conformer encoder first processes the input with a convolution subsampling layer and then with a number of conformer blocks.

Parameters

input_dim (int, optional) – Dimension of input vector
encoder_dim (int, optional) – Dimension of conformer encoder
num_layers (int, optional) – Number of conformer blocks
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
device (torch.device) – torch device (cuda or cpu)

Inputs: inputs, input_lengths

inputs (batch, time, dim): Tensor containing input vector
input_lengths (batch): list of sequence input lengths

Returns: outputs, output_lengths

outputs (batch, out_channels, time): Tensor produces by conformer encoder.
output_lengths (batch): list of sequence output lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoder training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor)

outputs (torch.FloatTensor): A output sequence of encoder. FloatTensor of size
(batch, seq_length, dimension)
output_lengths (torch.LongTensor): The length of output tensor. (batch)

Decoder¶

class kospeech.models.rnnt.decoder.DecoderRNNT(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]¶

Decoder of RNN-Transducer

Parameters

num_classes (int) – number of classification
hidden_state_dim (int, optional) – hidden state dimension of decoder (default: 512)
output_dim (int, optional) – output dimension of encoder and decoder (default: 512)
num_layers (int, optional) – number of decoder layers (default: 1)
rnn_type (str, optional) – type of rnn cell (default: lstm)
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
dropout_p (float, optional) – dropout probability of decoder

Inputs: inputs, input_lengths: inputs (torch.LongTensor): A target sequence passed to decoder. IntTensor of size (batch, seq_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): A previous hidden state of decoder. FloatTensor of size

(batch, seq_length, dimension)

Returns

decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)

Return type

(Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propage a inputs (targets) for training.

Parameters

inputs (torch.LongTensor) – A target sequence passed to decoder. IntTensor of size (batch, seq_length)
input_lengths (torch.LongTensor) – The length of input tensor. (batch)
hidden_states (torch.FloatTensor) – A previous hidden state of decoder. FloatTensor of size (batch, seq_length, dimension)

Returns

decoder_outputs (torch.FloatTensor): A output sequence of decoder. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of decoder. FloatTensor of size
(batch, seq_length, dimension)

Return type

(Tensor, Tensor)

Modules¶

class kospeech.models.conformer.modules.ConformerConvModule(in_channels: int, kernel_size: int = 31, expansion_factor: int = 2, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]¶

Conformer convolution module starts with a pointwise convolution and a gated linear unit (GLU). This is followed by a single 1-D depthwise convolution layer. Batchnorm is deployed just after the convolution to aid training deep models.

Parameters

in_channels (int) – Number of channels in the input
kernel_size (int or tuple, optional) – Size of the convolving kernel Default: 31
dropout_p (float, optional) – probability of dropout
device (torch.device) – torch device (cuda or cpu)

Inputs: inputs: inputs (batch, time, dim): Tensor contains input sequences
Outputs: outputs: outputs (batch, time, dim): Tensor produces by conformer convolution module.

class kospeech.models.conformer.modules.FeedForwardModule(encoder_dim: int = 512, expansion_factor: int = 4, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]¶

Conformer Feed Forward Module follow pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. This module also apply Swish activation and dropout, which helps regularizing the network.

Parameters

encoder_dim (int) – Dimension of conformer encoder
expansion_factor (int) – Expansion factor of feed forward module.
dropout_p (float) – Ratio of dropout
device (torch.device) – torch device (cuda or cpu)

Inputs: inputs

inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs

outputs (batch, time, dim): Tensor produces by feed forward module.

class kospeech.models.conformer.modules.MultiHeadedSelfAttentionModule(d_model: int, num_heads: int, dropout_p: float = 0.1, device: torch.device = 'cuda')[source]¶

Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length. Conformer use prenorm residual units with dropout which helps training and regularizing deeper models.

Parameters

d_model (int) – The dimension of model
num_heads (int) – The number of attention heads.
dropout_p (float) – probability of dropout
device (torch.device) – torch device (cuda or cpu)

Inputs: inputs, mask

inputs (batch, time, dim): Tensor containing input vector
mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

Returns

Tensor produces by relative multi headed self attention module.

Return type

outputs (batch, time, dim)