Modules¶
Activation¶
Attention¶
-
class
kospeech.models.attention.
AdditiveAttention
(dim: int)[source]¶ Applies a additive attention (bahdanau) mechanism on the output features from the decoder. Additive attention proposed in “Neural Machine Translation by Jointly Learning to Align and Translate” paper.
- Parameters
dim (int) – dimension of model
- Inputs: query, key, value
query (batch_size, q_len, hidden_dim): tensor containing the output features from the decoder.
key (batch, k_len, d_model): tensor containing projection vector for encoder.
value (batch_size, v_len, hidden_dim): tensor containing features of the encoded input sequence.
- Returns: context, attn
context: tensor containing the context vector from attention mechanism.
attn: tensor containing the alignment from the encoder outputs.
-
class
kospeech.models.attention.
LocationAwareAttention
(dim: int = 1024, attn_dim: int = 1024, smoothing: bool = False)[source]¶ Applies a location-aware attention mechanism on the output features from the decoder. Location-aware attention proposed in “Attention-Based Models for Speech Recognition” paper. The location-aware attention mechanism is performing well in speech recognition tasks. We refer to implementation of ClovaCall Attention style.
- Parameters
- Inputs: query, value, last_attn
query (batch, q_len, hidden_dim): tensor containing the output features from the decoder.
value (batch, v_len, hidden_dim): tensor containing features of the encoded input sequence.
last_attn (batch_size, v_len): tensor containing previous timestep`s attention (alignment)
- Returns: output, attn
output (batch, output_len, dimensions): tensor containing the feature from encoder outputs
attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoder outputs.
- Reference:
Attention-Based Models for Speech Recognition: https://arxiv.org/abs/1506.07503
ClovaCall: https://github.com/clovaai/ClovaCall/blob/master/las.pytorch/models/attention.py
-
class
kospeech.models.attention.
MultiHeadAttention
(dim: int = 512, num_heads: int = 8)[source]¶ Multi-Head Attention proposed in “Attention Is All You Need” Instead of performing a single attention function with d_model-dimensional keys, values, and queries, project the queries, keys and values h times with different, learned linear projections to d_head dimensions. These are concatenated and once again projected, resulting in the final values. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
- MultiHead(Q, K, V) = Concat(head_1, …, head_h) · W_o
where head_i = Attention(Q · W_q, K · W_k, V · W_v)
- Parameters
- Inputs: query, key, value, mask
query (batch, q_len, d_model): tensor containing projection vector for decoder.
key (batch, k_len, d_model): tensor containing projection vector for encoder.
value (batch, v_len, d_model): tensor containing features of the encoded input sequence.
mask (-): tensor containing indices to be masked
- Returns: output, attn
output (batch, output_len, dimensions): tensor containing the attended output features.
attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoder outputs.
-
class
kospeech.models.attention.
RelativeMultiHeadAttention
(dim: int = 512, num_heads: int = 16, dropout_p: float = 0.1)[source]¶ Multi-head attention with relative positional encoding. This concept was proposed in the “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”
- Parameters
- Inputs: query, key, value, pos_embedding, mask
query (batch, time, dim): Tensor containing query vector
key (batch, time, dim): Tensor containing key vector
value (batch, time, dim): Tensor containing value vector
pos_embedding (batch, time, dim): Positional embedding tensor
mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked
- Returns
Tensor produces by relative multi head attention module.
- Return type
outputs
-
class
kospeech.models.attention.
ScaledDotProductAttention
(dim: int, scale: bool = True)[source]¶ Scaled Dot-Product Attention proposed in “Attention Is All You Need” Compute the dot products of the query with all keys, divide each by sqrt(dim), and apply a softmax function to obtain the weights on the values
- Args: dim, mask
dim (int): dimension of attention mask (torch.Tensor): tensor containing indices to be masked
- Inputs: query, key, value, mask
query (batch, q_len, d_model): tensor containing projection vector for decoder.
key (batch, k_len, d_model): tensor containing projection vector for encoder.
value (batch, v_len, d_model): tensor containing features of the encoded input sequence.
mask (-): tensor containing indices to be masked
- Returns: context, attn
context: tensor containing the context vector from attention mechanism.
attn: tensor containing the attention (alignment) from the encoder outputs.
Convolution¶
-
class
kospeech.models.convolution.
Conv2dExtractor
(input_dim: int, activation: str = 'hardtanh')[source]¶ Provides inteface of convolutional extractor.
Note
Do not use this class directly, use one of the sub classes. Define the ‘self.conv’ class variable.
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs, output_lengths
outputs: Tensor produced by the convolution
output_lengths: Tensor containing sequence lengths produced by the convolution
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
-
class
kospeech.models.convolution.
Conv2dSubsampling
(input_dim: int, in_channels: int, out_channels: int, activation: str = 'relu')[source]¶ Convolutional 2D subsampling (to 1/4 length)
- Parameters
- Inputs: inputs
inputs (batch, time, dim): Tensor containing sequence of inputs
input_lengths (batch): list of sequence input lengths
- Returns: outputs, output_lengths
outputs (batch, time, dim): Tensor produced by the convolution
output_lengths (batch): list of sequence output lengths
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
-
class
kospeech.models.convolution.
DeepSpeech2Extractor
(input_dim: int, in_channels: int = 1, out_channels: int = 32, activation: str = 'hardtanh')[source]¶ DeepSpeech2 extractor for automatic speech recognition described in “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” paper - https://arxiv.org/abs/1512.02595
- Parameters
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs, output_lengths
outputs: Tensor produced by the convolution
output_lengths: Tensor containing sequence lengths produced by the convolution
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
-
class
kospeech.models.convolution.
DepthwiseConv1d
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, bias: bool = False)[source]¶ When groups == in_channels and out_channels == K * in_channels, where K is a positive integer, this operation is termed in literature as depthwise convolution.
- Parameters
in_channels (int) – Number of channels in the input
out_channels (int) – Number of channels produced by the convolution
stride (int, optional) – Stride of the convolution. Default: 1
padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0
bias (bool, optional) – If True, adds a learnable bias to the output. Default: True
- Inputs: inputs
inputs (batch, in_channels, time): Tensor containing input vector
- Returns: outputs
outputs (batch, out_channels, time): Tensor produces by depthwise 1-D convolution.
-
class
kospeech.models.convolution.
MaskCNN
(sequential: torch.nn.modules.container.Sequential)[source]¶ Masking Convolutional Neural Network
Adds padding to the output of the module based on the given lengths. This is to ensure that the results of the model do not change when batch sizes change during inference. Input needs to be in the shape of (batch_size, channel, hidden_dim, seq_len)
Refer to https://github.com/SeanNaren/deepspeech.pytorch/blob/master/model.py Copyright (c) 2017 Sean Naren MIT License
- Parameters
sequential (torch.nn) – sequential list of convolution layer
- Inputs: inputs, seq_lengths
inputs (torch.FloatTensor): The input of size BxCxHxT
seq_lengths (torch.IntTensor): The actual length of each sequence in the batch
- Returns: output, seq_lengths
output: Masked output from the sequential
seq_lengths: Sequence length of output from the sequential
-
class
kospeech.models.convolution.
MaskConv1d
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = False)[source]¶ 1D convolution with masking
- Parameters
in_channels (int) – Number of channels in the input vector
out_channels (int) – Number of channels produced by the convolution
stride (int) – Stride of the convolution. Default: 1
padding (int) – Zero-padding added to both sides of the input. Default: 0
dilation (int) – Spacing between kernel elements. Default: 1
groups (int) – Number of blocked connections from input channels to output channels. Default: 1
bias (bool) – If True, adds a learnable bias to the output. Default: True
- Inputs: inputs, seq_lengths
inputs (torch.FloatTensor): The input of size (batch, dimension, time)
seq_lengths (torch.IntTensor): The actual length of each sequence in the batch
- Returns: output, seq_lengths
output: Masked output from the conv1d
seq_lengths: Sequence length of output from the conv1d
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: (batch, dimension, time) input_lengths: (batch)
-
class
kospeech.models.convolution.
PointwiseConv1d
(in_channels: int, out_channels: int, stride: int = 1, padding: int = 0, bias: bool = True)[source]¶ When kernel size == 1 conv1d, this operation is termed in literature as pointwise convolution. This operation often used to match dimensions.
- Parameters
in_channels (int) – Number of channels in the input
out_channels (int) – Number of channels produced by the convolution
stride (int, optional) – Stride of the convolution. Default: 1
padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0
bias (bool, optional) – If True, adds a learnable bias to the output. Default: True
- Inputs: inputs
inputs (batch, in_channels, time): Tensor containing input vector
- Returns: outputs
outputs (batch, out_channels, time): Tensor produces by pointwise 1-D convolution.
-
class
kospeech.models.convolution.
VGGExtractor
(input_dim: int, in_channels: int = 1, out_channels: int = 64, 128, activation: str = 'hardtanh')[source]¶ VGG extractor for automatic speech recognition described in “Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM” paper - https://arxiv.org/pdf/1706.02737.pdf
- Parameters
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs, output_lengths
outputs: Tensor produced by the convolution
output_lengths: Tensor containing sequence lengths produced by the convolution
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)