Deep Speech 2¶

class kospeech.models.deepspeech2.model.BNReluRNN(input_size: int, hidden_state_dim: int = 512, rnn_type: str = 'gru', bidirectional: bool = True, dropout_p: float = 0.1)[source]¶

Recurrent neural network with batch normalization layer & ReLU activation function.

Parameters

input_size (int) – size of input
hidden_state_dim (int) – the number of features in the hidden state h
rnn_type (str, optional) – type of RNN cell (default: gru)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (defulat: True)
dropout_p (float, optional) – dropout probability (default: 0.1)

Inputs: inputs, input_lengths

inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths

Returns: outputs

outputs: Tensor produced by the BNReluRNN module

class kospeech.models.deepspeech2.model.DeepSpeech2(input_dim: int, num_classes: int, rnn_type='gru', num_rnn_layers: int = 5, rnn_hidden_dim: int = 512, dropout_p: float = 0.1, bidirectional: bool = True, activation: str = 'hardtanh', device: torch.device = 'cuda')[source]¶

Deep Speech2 model with configurable encoder and decoder. Paper: https://arxiv.org/abs/1512.02595

Parameters

input_dim (int) – dimension of input vector
num_classes (int) – number of classfication
rnn_type (str, optional) – type of RNN cell (default: gru)
num_rnn_layers (int, optional) – number of recurrent layers (default: 5)
rnn_hidden_dim (int) – the number of features in the hidden state h
dropout_p (float, optional) – dropout probability (default: 0.1)
bidirectional (bool, optional) – if True, becomes a bidirectional encoder (defulat: True)
activation (str) – type of activation function (default: hardtanh)
device (torch.device) – device - ‘cuda’ or ‘cpu’

Inputs: inputs, input_lengths

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns: output

output: tensor containing the encoded features of the input sequence

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for ctc training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)