Data

Augment

class kospeech.data.audio.augment.NoiseInjector(dataset_path, noiseset_size, sample_rate=16000, noise_level=0.7)[source]

Provides noise injection for noise augmentation. The noise augmentation process is as follows:

Step 1: Randomly sample audios by noise_size from dataset Step 2: Extract noise from audio_paths Step 3: Add noise to sound

Parameters
  • dataset_path (str) – path of dataset

  • noiseset_size (int) – size of noise dataset

  • sample_rate (int) – sampling rate

  • noise_level (float) – level of noise

Inputs: signal
  • signal: signal from pcm file

Returns: signal
  • signal: noise added signal

class kospeech.data.audio.augment.SpecAugment(freq_mask_para: int = 18, time_mask_num: int = 10, freq_mask_num: int = 2)[source]

Provides Spec Augment. A simple data augmentation method for speech recognition. This concept proposed in https://arxiv.org/abs/1904.08779

Parameters
  • freq_mask_para (int) – maximum frequency masking length

  • time_mask_num (int) – how many times to apply time masking

  • freq_mask_num (int) – how many times to apply frequency masking

Inputs: feature_vector
  • feature_vector (torch.FloatTensor): feature vector from audio file.

Returns: feature_vector:
  • feature_vector: masked feature vector.

Core

kospeech.data.audio.core.load_audio(audio_path: str, del_silence: bool = False, extension: str = 'pcm')numpy.ndarray[source]

Load audio file (PCM) to sound. if del_silence is True, Eliminate all sounds below 30dB. If exception occurs in numpy.memmap(), return None.

kospeech.data.audio.core.split(y, top_db=60, ref=<function amax>, frame_length=2048, hop_length=512)[source]

codes from https://github.com/librosa/librosa use this code fragments instead of importing librosa package, because of our server has a problem with importing librosa.

Feature

class kospeech.data.audio.feature.FilterBank(sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10)[source]

Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats

Parameters
  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • n_mels (int) – Number of mfc coefficients to retain. (Default: 80)

  • frame_length (int) – frame length for spectrogram (ms) (Default : 20)

  • frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)

class kospeech.data.audio.feature.MFCC(sample_rate: int = 16000, n_mfcc: int = 40, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'librosa')[source]

Create the Mel-frequency cepstrum coefficients (MFCCs) from an audio signal.

Parameters
  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • n_mfcc (int) – Number of mfc coefficients to retain. (Default: 40)

  • frame_length (int) – frame length for spectrogram (ms) (Default : 20)

  • frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)

  • feature_extract_by (str) – which library to use for feature extraction(default: librosa)

class kospeech.data.audio.feature.MelSpectrogram(sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'librosa')[source]

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Parameters
  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • n_mels (int) – Number of mfc coefficients to retain. (Default: 80)

  • frame_length (int) – frame length for spectrogram (ms) (Default : 20)

  • frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)

  • feature_extract_by (str) – which library to use for feature extraction (default: librosa)

class kospeech.data.audio.feature.Spectrogram(sample_rate: int = 16000, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'torch')[source]

Create a spectrogram from a audio signal.

Parameters
  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • frame_length (int) – frame length for spectrogram (ms) (Default : 20)

  • frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)

  • feature_extract_by (str) – which library to use for feature extraction (default: torch)

Parser

class kospeech.data.audio.parser.AudioParser(dataset_path)[source]

Provides inteface of audio parser.

Note

Do not use this class directly, use one of the sub classes.

Method:
  • parse_audio(): abstract method. you have to override this method.

  • parse_transcript(): abstract method. you have to override this method.

class kospeech.data.audio.parser.SpectrogramParser(feature_extract_by: str = 'librosa', sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10, del_silence: bool = False, input_reverse: bool = True, normalize: bool = False, transform_method: str = 'mel', freq_mask_para: int = 12, time_mask_num: int = 2, freq_mask_num: int = 2, sos_id: int = 1, eos_id: int = 2, dataset_path: str = None, audio_extension: str = 'pcm')[source]

Parses audio file into (spectrogram / mel spectrogram / mfcc) with various options.

Parameters
  • transform_method (str) – which feature to use (default: mel)

  • sample_rate (int) – Sample rate of audio signal. (Default: 16000)

  • n_mels (int) – Number of mfc coefficients to retain. (Default: 40)

  • frame_length (int) – frame length for spectrogram (ms) (Default : 20)

  • frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)

  • feature_extract_by (str) – which library to use for feature extraction (default: librosa)

  • del_silence (bool) – flag indication whether to delete silence or not (default: True)

  • input_reverse (bool) – flag indication whether to reverse input or not (default: True)

  • normalize (bool) – flag indication whether to normalize spectrum or not (default:True)

  • freq_mask_para (int) – Hyper Parameter for Freq Masking to limit freq masking length

  • time_mask_num (int) – how many time-masked area to make

  • freq_mask_num (int) – how many freq-masked area to make

  • sos_id (int) – start of sentence token`s identification

  • eos_id (int) – end of sentence token`s identification

  • dataset_path (str) – noise dataset path

parse_audio(audio_path: str, augment_method: int)torch.Tensor[source]

Parses audio.

Parameters
  • audio_path (str) – path of audio file

  • augment_method (int) – flag indication which augmentation method to use.

Returns: feature_vector
  • feature_vector (torch.FloatTensor): feature from audio file.

DataLoader

class kospeech.data.data_loader.AudioDataLoader(dataset, queue, batch_size, thread_id, pad_id)[source]

Audio Data Loader

Parameters
  • dataset (SpectrogramDataset) – dataset for feature & transcript matching

  • queue (Queue.queue) – queue for threading

  • batch_size (int) – size of batch

  • thread_id (int) – identification of thread

run()[source]

Load data from MelSpectrogramDataset

class kospeech.data.data_loader.MultiDataLoader(dataset_list, queue, batch_size, num_workers, pad_id)[source]

Multi Data Loader using Threads.

Parameters
  • dataset_list (list) – list of MelSpectrogramDataset

  • queue (Queue.queue) – queue for threading

  • batch_size (int) – size of batch

  • num_workers (int) – the number of cpu cores used

join()[source]

Wait for the other threads

start()[source]

Run threads

class kospeech.data.data_loader.SpectrogramDataset(audio_paths: list, transcripts: list, sos_id: int, eos_id: int, config: omegaconf.dictconfig.DictConfig, spec_augment: bool = False, dataset_path: str = None, audio_extension: str = 'pcm')[source]

Dataset for feature & transcript matching

Parameters
  • audio_paths (list) – list of audio path

  • transcripts (list) – list of transcript

  • sos_id (int) – identification of <start of sequence>

  • eos_id (int) – identification of <end of sequence>

  • spec_augment (bool) – flag indication whether to use spec-augmentation or not (default: True)

  • config (DictConfig) – set of configurations

  • dataset_path (str) – path of dataset

get_item(idx)[source]

get feature vector & transcript

parse_transcript(transcript)[source]

Parses transcript

shuffle()[source]

Shuffle dataset

kospeech.data.data_loader.split_dataset(config: omegaconf.dictconfig.DictConfig, transcripts_path: str, vocab: kospeech.vocabs.Vocabulary)[source]

split into training set and validation set.

Parameters
  • opt (ArgumentParser) – set of options

  • transcripts_path (str) – path of transcripts

Returns: train_batch_num, train_dataset_list, valid_dataset
  • train_time_step (int): number of time step for training

  • trainset_list (list): list of training dataset

  • validset (data_loader.MelSpectrogramDataset): validation dataset

LabelLoader

kospeech.data.label_loader.load_dataset(transcripts_path: str) → Tuple[list, list][source]

Provides dictionary of filename and labels

Parameters

transcripts_path (str) – path of transcripts

Returns: target_dict
  • target_dict (dict): dictionary of filename and labels