Modules

Activation

class kospeech.models.activation.GLU(dim: int)[source]

The gating mechanism is called Gated Linear Units (GLU), which was first introduced for natural language processing in the paper “Language Modeling with Gated Convolutional Networks”

class kospeech.models.activation.Swish[source]

Swish is a smooth, non-monotonic function that consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as Image classification and Machine translation.

Attention

class kospeech.models.attention.AdditiveAttention(dim: int)[source]

Applies a additive attention (bahdanau) mechanism on the output features from the decoder. Additive attention proposed in “Neural Machine Translation by Jointly Learning to Align and Translate” paper.

Parameters

dim (int) – dimension of model

Inputs: query, key, value
  • query (batch_size, q_len, hidden_dim): tensor containing the output features from the decoder.

  • key (batch, k_len, d_model): tensor containing projection vector for encoder.

  • value (batch_size, v_len, hidden_dim): tensor containing features of the encoded input sequence.

Returns: context, attn
  • context: tensor containing the context vector from attention mechanism.

  • attn: tensor containing the alignment from the encoder outputs.

class kospeech.models.attention.LocationAwareAttention(dim: int = 1024, attn_dim: int = 1024, smoothing: bool = False)[source]

Applies a location-aware attention mechanism on the output features from the decoder. Location-aware attention proposed in “Attention-Based Models for Speech Recognition” paper. The location-aware attention mechanism is performing well in speech recognition tasks. We refer to implementation of ClovaCall Attention style.

Parameters
  • dim (int) – dimension of model

  • attn_dim (int) – dimension of attention

  • smoothing (bool) – flag indication whether to use smoothing or not.

Inputs: query, value, last_attn
  • query (batch, q_len, hidden_dim): tensor containing the output features from the decoder.

  • value (batch, v_len, hidden_dim): tensor containing features of the encoded input sequence.

  • last_attn (batch_size, v_len): tensor containing previous timestep`s attention (alignment)

Returns: output, attn
  • output (batch, output_len, dimensions): tensor containing the feature from encoder outputs

  • attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoder outputs.

Reference:
class kospeech.models.attention.MultiHeadAttention(dim: int = 512, num_heads: int = 8)[source]

Multi-Head Attention proposed in “Attention Is All You Need” Instead of performing a single attention function with d_model-dimensional keys, values, and queries, project the queries, keys and values h times with different, learned linear projections to d_head dimensions. These are concatenated and once again projected, resulting in the final values. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

MultiHead(Q, K, V) = Concat(head_1, …, head_h) · W_o

where head_i = Attention(Q · W_q, K · W_k, V · W_v)

Parameters
  • dim (int) – The dimension of model (default: 512)

  • num_heads (int) – The number of attention heads. (default: 8)

Inputs: query, key, value, mask
  • query (batch, q_len, d_model): tensor containing projection vector for decoder.

  • key (batch, k_len, d_model): tensor containing projection vector for encoder.

  • value (batch, v_len, d_model): tensor containing features of the encoded input sequence.

  • mask (-): tensor containing indices to be masked

Returns: output, attn
  • output (batch, output_len, dimensions): tensor containing the attended output features.

  • attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoder outputs.

class kospeech.models.attention.RelativeMultiHeadAttention(dim: int = 512, num_heads: int = 16, dropout_p: float = 0.1)[source]

Multi-head attention with relative positional encoding. This concept was proposed in the “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”

Parameters
  • dim (int) – The dimension of model

  • num_heads (int) – The number of attention heads.

  • dropout_p (float) – probability of dropout

Inputs: query, key, value, pos_embedding, mask
  • query (batch, time, dim): Tensor containing query vector

  • key (batch, time, dim): Tensor containing key vector

  • value (batch, time, dim): Tensor containing value vector

  • pos_embedding (batch, time, dim): Positional embedding tensor

  • mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

Returns

Tensor produces by relative multi head attention module.

Return type

  • outputs

class kospeech.models.attention.ScaledDotProductAttention(dim: int, scale: bool = True)[source]

Scaled Dot-Product Attention proposed in “Attention Is All You Need” Compute the dot products of the query with all keys, divide each by sqrt(dim), and apply a softmax function to obtain the weights on the values

Args: dim, mask

dim (int): dimension of attention mask (torch.Tensor): tensor containing indices to be masked

Inputs: query, key, value, mask
  • query (batch, q_len, d_model): tensor containing projection vector for decoder.

  • key (batch, k_len, d_model): tensor containing projection vector for encoder.

  • value (batch, v_len, d_model): tensor containing features of the encoded input sequence.

  • mask (-): tensor containing indices to be masked

Returns: context, attn
  • context: tensor containing the context vector from attention mechanism.

  • attn: tensor containing the attention (alignment) from the encoder outputs.

Convolution

class kospeech.models.convolution.Conv2dExtractor(input_dim: int, activation: str = 'hardtanh')[source]

Provides inteface of convolutional extractor.

Note

Do not use this class directly, use one of the sub classes. Define the ‘self.conv’ class variable.

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs, output_lengths
  • outputs: Tensor produced by the convolution

  • output_lengths: Tensor containing sequence lengths produced by the convolution

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

class kospeech.models.convolution.Conv2dSubsampling(input_dim: int, in_channels: int, out_channels: int, activation: str = 'relu')[source]

Convolutional 2D subsampling (to 1/4 length)

Parameters
  • input_dim (int) – Dimension of input vector

  • in_channels (int) – Number of channels in the input vector

  • out_channels (int) – Number of channels produced by the convolution

  • activation (str) – Activation function

Inputs: inputs
  • inputs (batch, time, dim): Tensor containing sequence of inputs

  • input_lengths (batch): list of sequence input lengths

Returns: outputs, output_lengths
  • outputs (batch, time, dim): Tensor produced by the convolution

  • output_lengths (batch): list of sequence output lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

class kospeech.models.convolution.DeepSpeech2Extractor(input_dim: int, in_channels: int = 1, out_channels: int = 32, activation: str = 'hardtanh')[source]

DeepSpeech2 extractor for automatic speech recognition described in “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” paper - https://arxiv.org/abs/1512.02595

Parameters
  • input_dim (int) – Dimension of input vector

  • in_channels (int) – Number of channels in the input vector

  • out_channels (int) – Number of channels produced by the convolution

  • activation (str) – Activation function

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs, output_lengths
  • outputs: Tensor produced by the convolution

  • output_lengths: Tensor containing sequence lengths produced by the convolution

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

class kospeech.models.convolution.DepthwiseConv1d(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, bias: bool = False)[source]

When groups == in_channels and out_channels == K * in_channels, where K is a positive integer, this operation is termed in literature as depthwise convolution.

Parameters
  • in_channels (int) – Number of channels in the input

  • out_channels (int) – Number of channels produced by the convolution

  • kernel_size (int or tuple) – Size of the convolving kernel

  • stride (int, optional) – Stride of the convolution. Default: 1

  • padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0

  • bias (bool, optional) – If True, adds a learnable bias to the output. Default: True

Inputs: inputs
  • inputs (batch, in_channels, time): Tensor containing input vector

Returns: outputs
  • outputs (batch, out_channels, time): Tensor produces by depthwise 1-D convolution.

class kospeech.models.convolution.MaskCNN(sequential: torch.nn.modules.container.Sequential)[source]

Masking Convolutional Neural Network

Adds padding to the output of the module based on the given lengths. This is to ensure that the results of the model do not change when batch sizes change during inference. Input needs to be in the shape of (batch_size, channel, hidden_dim, seq_len)

Refer to https://github.com/SeanNaren/deepspeech.pytorch/blob/master/model.py Copyright (c) 2017 Sean Naren MIT License

Parameters

sequential (torch.nn) – sequential list of convolution layer

Inputs: inputs, seq_lengths
  • inputs (torch.FloatTensor): The input of size BxCxHxT

  • seq_lengths (torch.IntTensor): The actual length of each sequence in the batch

Returns: output, seq_lengths
  • output: Masked output from the sequential

  • seq_lengths: Sequence length of output from the sequential

class kospeech.models.convolution.MaskConv1d(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = False)[source]

1D convolution with masking

Parameters
  • in_channels (int) – Number of channels in the input vector

  • out_channels (int) – Number of channels produced by the convolution

  • kernel_size (int or tuple) – Size of the convolving kernel

  • stride (int) – Stride of the convolution. Default: 1

  • padding (int) – Zero-padding added to both sides of the input. Default: 0

  • dilation (int) – Spacing between kernel elements. Default: 1

  • groups (int) – Number of blocked connections from input channels to output channels. Default: 1

  • bias (bool) – If True, adds a learnable bias to the output. Default: True

Inputs: inputs, seq_lengths
  • inputs (torch.FloatTensor): The input of size (batch, dimension, time)

  • seq_lengths (torch.IntTensor): The actual length of each sequence in the batch

Returns: output, seq_lengths
  • output: Masked output from the conv1d

  • seq_lengths: Sequence length of output from the conv1d

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: (batch, dimension, time) input_lengths: (batch)

class kospeech.models.convolution.PointwiseConv1d(in_channels: int, out_channels: int, stride: int = 1, padding: int = 0, bias: bool = True)[source]

When kernel size == 1 conv1d, this operation is termed in literature as pointwise convolution. This operation often used to match dimensions.

Parameters
  • in_channels (int) – Number of channels in the input

  • out_channels (int) – Number of channels produced by the convolution

  • stride (int, optional) – Stride of the convolution. Default: 1

  • padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0

  • bias (bool, optional) – If True, adds a learnable bias to the output. Default: True

Inputs: inputs
  • inputs (batch, in_channels, time): Tensor containing input vector

Returns: outputs
  • outputs (batch, out_channels, time): Tensor produces by pointwise 1-D convolution.

class kospeech.models.convolution.VGGExtractor(input_dim: int, in_channels: int = 1, out_channels: int = 64, 128, activation: str = 'hardtanh')[source]

VGG extractor for automatic speech recognition described in “Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM” paper - https://arxiv.org/pdf/1706.02737.pdf

Parameters
  • input_dim (int) – Dimension of input vector

  • in_channels (int) – Number of channels in the input image

  • out_channels (int or tuple) – Number of channels produced by the convolution

  • activation (str) – Activation function

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs, output_lengths
  • outputs: Tensor produced by the convolution

  • output_lengths: Tensor containing sequence lengths produced by the convolution

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)