Deep Speech 2

Deep Speech 2

class kospeech.models.deepspeech2.model.BNReluRNN(input_size: int, hidden_state_dim: int = 512, rnn_type: str = 'gru', bidirectional: bool = True, dropout_p: float = 0.1)[source]

Recurrent neural network with batch normalization layer & ReLU activation function.

Parameters
  • input_size (int) – size of input

  • hidden_state_dim (int) – the number of features in the hidden state h

  • rnn_type (str, optional) – type of RNN cell (default: gru)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoder (defulat: True)

  • dropout_p (float, optional) – dropout probability (default: 0.1)

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs
  • outputs: Tensor produced by the BNReluRNN module

class kospeech.models.deepspeech2.model.DeepSpeech2(input_dim: int, num_classes: int, rnn_type='gru', num_rnn_layers: int = 5, rnn_hidden_dim: int = 512, dropout_p: float = 0.1, bidirectional: bool = True, activation: str = 'hardtanh', device: torch.device = 'cuda')[source]

Deep Speech2 model with configurable encoder and decoder. Paper: https://arxiv.org/abs/1512.02595

Parameters
  • input_dim (int) – dimension of input vector

  • num_classes (int) – number of classfication

  • rnn_type (str, optional) – type of RNN cell (default: gru)

  • num_rnn_layers (int, optional) – number of recurrent layers (default: 5)

  • rnn_hidden_dim (int) – the number of features in the hidden state h

  • dropout_p (float, optional) – dropout probability (default: 0.1)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoder (defulat: True)

  • activation (str) – type of activation function (default: hardtanh)

  • device (torch.device) – device - ‘cuda’ or ‘cpu’

Inputs: inputs, input_lengths
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns: output
  • output: tensor containing the encoded features of the input sequence

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for ctc training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoder. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)