Data¶

Augment¶

class kospeech.data.audio.augment.NoiseInjector(dataset_path, noiseset_size, sample_rate=16000, noise_level=0.7)[source]¶

Provides noise injection for noise augmentation. The noise augmentation process is as follows:

Step 1: Randomly sample audios by noise_size from dataset Step 2: Extract noise from audio_paths Step 3: Add noise to sound

Parameters

dataset_path (str) – path of dataset
noiseset_size (int) – size of noise dataset
sample_rate (int) – sampling rate
noise_level (float) – level of noise

Inputs: signal

signal: signal from pcm file

Returns: signal

signal: noise added signal

class kospeech.data.audio.augment.SpecAugment(freq_mask_para: int = 18, time_mask_num: int = 10, freq_mask_num: int = 2)[source]¶

Provides Spec Augment. A simple data augmentation method for speech recognition. This concept proposed in https://arxiv.org/abs/1904.08779

Parameters

freq_mask_para (int) – maximum frequency masking length
time_mask_num (int) – how many times to apply time masking
freq_mask_num (int) – how many times to apply frequency masking

Inputs: feature_vector

feature_vector (torch.FloatTensor): feature vector from audio file.

Returns: feature_vector:

feature_vector: masked feature vector.

Core¶

kospeech.data.audio.core.load_audio(audio_path: str, del_silence: bool = False, extension: str = 'pcm') → numpy.ndarray [source]¶: Load audio file (PCM) to sound. if del_silence is True, Eliminate all sounds below 30dB. If exception occurs in numpy.memmap(), return None.

kospeech.data.audio.core.split(y, top_db=60, ref=<function amax>, frame_length=2048, hop_length=512)[source]¶: codes from https://github.com/librosa/librosa use this code fragments instead of importing librosa package, because of our server has a problem with importing librosa.

Feature¶

class kospeech.data.audio.feature.FilterBank(sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10)[source]¶

Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats

Parameters

sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mels (int) – Number of mfc coefficients to retain. (Default: 80)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)

class kospeech.data.audio.feature.MFCC(sample_rate: int = 16000, n_mfcc: int = 40, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'librosa')[source]¶

Create the Mel-frequency cepstrum coefficients (MFCCs) from an audio signal.

Parameters

sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mfcc (int) – Number of mfc coefficients to retain. (Default: 40)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction(default: librosa)

class kospeech.data.audio.feature.MelSpectrogram(sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'librosa')[source]¶

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Parameters

sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mels (int) – Number of mfc coefficients to retain. (Default: 80)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction (default: librosa)

class kospeech.data.audio.feature.Spectrogram(sample_rate: int = 16000, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'torch')[source]¶

Create a spectrogram from a audio signal.

Parameters

sample_rate (int) – Sample rate of audio signal. (Default: 16000)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction (default: torch)

Parser¶

class kospeech.data.audio.parser.AudioParser(dataset_path)[source]¶

Provides inteface of audio parser.

Note

Do not use this class directly, use one of the sub classes.

Method:

parse_audio(): abstract method. you have to override this method.
parse_transcript(): abstract method. you have to override this method.

class kospeech.data.audio.parser.SpectrogramParser(feature_extract_by: str = 'librosa', sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10, del_silence: bool = False, input_reverse: bool = True, normalize: bool = False, transform_method: str = 'mel', freq_mask_para: int = 12, time_mask_num: int = 2, freq_mask_num: int = 2, sos_id: int = 1, eos_id: int = 2, dataset_path: str = None, audio_extension: str = 'pcm')[source]¶

Parses audio file into (spectrogram / mel spectrogram / mfcc) with various options.

Parameters

transform_method (str) – which feature to use (default: mel)
sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mels (int) – Number of mfc coefficients to retain. (Default: 40)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction (default: librosa)
del_silence (bool) – flag indication whether to delete silence or not (default: True)
input_reverse (bool) – flag indication whether to reverse input or not (default: True)
normalize (bool) – flag indication whether to normalize spectrum or not (default:True)
freq_mask_para (int) – Hyper Parameter for Freq Masking to limit freq masking length
time_mask_num (int) – how many time-masked area to make
freq_mask_num (int) – how many freq-masked area to make
sos_id (int) – start of sentence token`s identification
eos_id (int) – end of sentence token`s identification
dataset_path (str) – noise dataset path

parse_audio(audio_path: str, augment_method: int) → torch.Tensor [source]¶

Parses audio.

Parameters

audio_path (str) – path of audio file
augment_method (int) – flag indication which augmentation method to use.

Returns: feature_vector

feature_vector (torch.FloatTensor): feature from audio file.

DataLoader¶

class kospeech.data.data_loader.AudioDataLoader(dataset, queue, batch_size, thread_id, pad_id)[source]¶

Audio Data Loader

Parameters

dataset (SpectrogramDataset) – dataset for feature & transcript matching
queue (Queue.queue) – queue for threading
batch_size (int) – size of batch
thread_id (int) – identification of thread

run()[source]¶: Load data from MelSpectrogramDataset

class kospeech.data.data_loader.MultiDataLoader(dataset_list, queue, batch_size, num_workers, pad_id)[source]¶

Multi Data Loader using Threads.

Parameters

dataset_list (list) – list of MelSpectrogramDataset
queue (Queue.queue) – queue for threading
batch_size (int) – size of batch
num_workers (int) – the number of cpu cores used

join()[source]¶: Wait for the other threads

start()[source]¶: Run threads

class kospeech.data.data_loader.SpectrogramDataset(audio_paths: list, transcripts: list, sos_id: int, eos_id: int, config: omegaconf.dictconfig.DictConfig, spec_augment: bool = False, dataset_path: str = None, audio_extension: str = 'pcm')[source]¶

Dataset for feature & transcript matching

Parameters

audio_paths (list) – list of audio path
transcripts (list) – list of transcript
sos_id (int) – identification of <start of sequence>
eos_id (int) – identification of <end of sequence>
spec_augment (bool) – flag indication whether to use spec-augmentation or not (default: True)
config (DictConfig) – set of configurations
dataset_path (str) – path of dataset

get_item(idx)[source]¶: get feature vector & transcript

parse_transcript(transcript)[source]¶: Parses transcript

shuffle()[source]¶: Shuffle dataset

kospeech.data.data_loader.split_dataset(config: omegaconf.dictconfig.DictConfig, transcripts_path: str, vocab: kospeech.vocabs.Vocabulary)[source]¶

split into training set and validation set.

Parameters

opt (ArgumentParser) – set of options
transcripts_path (str) – path of transcripts

Returns: train_batch_num, train_dataset_list, valid_dataset

train_time_step (int): number of time step for training
trainset_list (list): list of training dataset
validset (data_loader.MelSpectrogramDataset): validation dataset

LabelLoader¶

kospeech.data.label_loader.load_dataset(transcripts_path: str) → Tuple[list, list][source]¶

Provides dictionary of filename and labels

Parameters: transcripts_path (str) – path of transcripts

Returns: target_dict

target_dict (dict): dictionary of filename and labels