Data¶
Augment¶
-
class
kospeech.data.audio.augment.
NoiseInjector
(dataset_path, noiseset_size, sample_rate=16000, noise_level=0.7)[source]¶ Provides noise injection for noise augmentation. The noise augmentation process is as follows:
Step 1: Randomly sample audios by noise_size from dataset Step 2: Extract noise from audio_paths Step 3: Add noise to sound
- Parameters
- Inputs: signal
signal: signal from pcm file
- Returns: signal
signal: noise added signal
-
class
kospeech.data.audio.augment.
SpecAugment
(freq_mask_para: int = 18, time_mask_num: int = 10, freq_mask_num: int = 2)[source]¶ Provides Spec Augment. A simple data augmentation method for speech recognition. This concept proposed in https://arxiv.org/abs/1904.08779
- Parameters
- Inputs: feature_vector
feature_vector (torch.FloatTensor): feature vector from audio file.
- Returns: feature_vector:
feature_vector: masked feature vector.
Core¶
-
kospeech.data.audio.core.
load_audio
(audio_path: str, del_silence: bool = False, extension: str = 'pcm') → numpy.ndarray[source]¶ Load audio file (PCM) to sound. if del_silence is True, Eliminate all sounds below 30dB. If exception occurs in numpy.memmap(), return None.
-
kospeech.data.audio.core.
split
(y, top_db=60, ref=<function amax>, frame_length=2048, hop_length=512)[source]¶ codes from https://github.com/librosa/librosa use this code fragments instead of importing librosa package, because of our server has a problem with importing librosa.
Feature¶
-
class
kospeech.data.audio.feature.
FilterBank
(sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10)[source]¶ Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats
-
class
kospeech.data.audio.feature.
MFCC
(sample_rate: int = 16000, n_mfcc: int = 40, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'librosa')[source]¶ Create the Mel-frequency cepstrum coefficients (MFCCs) from an audio signal.
- Parameters
sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mfcc (int) – Number of mfc coefficients to retain. (Default: 40)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction(default: librosa)
-
class
kospeech.data.audio.feature.
MelSpectrogram
(sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10, feature_extract_by: str = 'librosa')[source]¶ Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
- Parameters
sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mels (int) – Number of mfc coefficients to retain. (Default: 80)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction (default: librosa)
Parser¶
-
class
kospeech.data.audio.parser.
AudioParser
(dataset_path)[source]¶ Provides inteface of audio parser.
Note
Do not use this class directly, use one of the sub classes.
- Method:
parse_audio(): abstract method. you have to override this method.
parse_transcript(): abstract method. you have to override this method.
-
class
kospeech.data.audio.parser.
SpectrogramParser
(feature_extract_by: str = 'librosa', sample_rate: int = 16000, n_mels: int = 80, frame_length: int = 20, frame_shift: int = 10, del_silence: bool = False, input_reverse: bool = True, normalize: bool = False, transform_method: str = 'mel', freq_mask_para: int = 12, time_mask_num: int = 2, freq_mask_num: int = 2, sos_id: int = 1, eos_id: int = 2, dataset_path: str = None, audio_extension: str = 'pcm')[source]¶ Parses audio file into (spectrogram / mel spectrogram / mfcc) with various options.
- Parameters
transform_method (str) – which feature to use (default: mel)
sample_rate (int) – Sample rate of audio signal. (Default: 16000)
n_mels (int) – Number of mfc coefficients to retain. (Default: 40)
frame_length (int) – frame length for spectrogram (ms) (Default : 20)
frame_shift (int) – Length of hop between STFT windows. (ms) (Default: 10)
feature_extract_by (str) – which library to use for feature extraction (default: librosa)
del_silence (bool) – flag indication whether to delete silence or not (default: True)
input_reverse (bool) – flag indication whether to reverse input or not (default: True)
normalize (bool) – flag indication whether to normalize spectrum or not (default:True)
freq_mask_para (int) – Hyper Parameter for Freq Masking to limit freq masking length
time_mask_num (int) – how many time-masked area to make
freq_mask_num (int) – how many freq-masked area to make
sos_id (int) – start of sentence token`s identification
eos_id (int) – end of sentence token`s identification
dataset_path (str) – noise dataset path
DataLoader¶
-
class
kospeech.data.data_loader.
AudioDataLoader
(dataset, queue, batch_size, thread_id, pad_id)[source]¶ Audio Data Loader
- Parameters
dataset (SpectrogramDataset) – dataset for feature & transcript matching
queue (Queue.queue) – queue for threading
batch_size (int) – size of batch
thread_id (int) – identification of thread
-
class
kospeech.data.data_loader.
MultiDataLoader
(dataset_list, queue, batch_size, num_workers, pad_id)[source]¶ Multi Data Loader using Threads.
- Parameters
-
class
kospeech.data.data_loader.
SpectrogramDataset
(audio_paths: list, transcripts: list, sos_id: int, eos_id: int, config: omegaconf.dictconfig.DictConfig, spec_augment: bool = False, dataset_path: str = None, audio_extension: str = 'pcm')[source]¶ Dataset for feature & transcript matching
- Parameters
audio_paths (list) – list of audio path
transcripts (list) – list of transcript
sos_id (int) – identification of <start of sequence>
eos_id (int) – identification of <end of sequence>
spec_augment (bool) – flag indication whether to use spec-augmentation or not (default: True)
config (DictConfig) – set of configurations
dataset_path (str) – path of dataset
-
kospeech.data.data_loader.
split_dataset
(config: omegaconf.dictconfig.DictConfig, transcripts_path: str, vocab: kospeech.vocabs.Vocabulary)[source]¶ split into training set and validation set.
- Parameters
opt (ArgumentParser) – set of options
transcripts_path (str) – path of transcripts
- Returns: train_batch_num, train_dataset_list, valid_dataset
train_time_step (int): number of time step for training
trainset_list (list): list of training dataset
validset (data_loader.MelSpectrogramDataset): validation dataset