In this example, we will use linear regression to recover or ‘fill out’ a completely deleted portion of an audio file!
For this, we use the FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson:
cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.
You can follow along with the associated notebook in GitHub.
get the data
import os import scipy.io.wavfile as wavfile zero = [] directory = "../datasets/free-spoken-digit-dataset-master/recordings/" for fname in os.listdir(directory): if fname.startswith("0_jackson"): fullname = os.path.join(directory, fname) sample_rate, data = wavfile.read(fullname) zero.append( data )
There are 500 recordings, 50 of each digit.
Each .wav file is actually just a bunch of numeric samples, “sampled”
from the analog signal. Sampling is a type of discretisation.
In our case the ‘audio samples’ are the actual ‘features’ of the audio file.
The goal of this notebook is to use multi-target linear regression to generate by extrapolation the missing portion of the test audio file.
Each one audio_sample features will be the output of an equation, which is a function of the provided portion of the audio_samples:
missing_samples = f (provided_samples)
prepare the data
Convert the data read from the file into a DataFrame and set the dtype to np.int16, since the input audio files are 16 bits per sample. This is important otherwise the produced audio samples will be encoded as 64 bits per sample and will be too short.
import numpy as np import pandas as pd zeroDF = pd.DataFrame(zero, dtype=np.int16) zeroDF.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Columns: 6273 entries, 0 to 6272 dtypes: float64(2186), int16(4087) memory usage: 1.2 MB
Since these audio clips are unfortunately not length-normalized, we have to just hard chop them to all be the same length.
Since Pandas would have inserted NANs at any spot to make zero a perfectly rectangular [n_observed_samples, n_audio_samples] array, we do a dropna on the Y axis here. Then, convert the data back into an NDArray using .values:
if zeroDF.isnull().values.any() == True: print("Preprocessing data: dropping all NaN") zeroDF.dropna(axis=1, inplace=True) else: print("Preprocessing data: No NaN found!") zero = zeroDF.values # this is an array
Preprocessing data: dropping all NaN
n_audio_samples = zero.shape[1] n_audio_samples
4087
split the data into training and testing sets
There are 50 takes of each clip. We want to pull out just one of them, randomly, and that one will NOT be used in the training of the model. In other words, the one file we’ll be testing / scoring on will be an unseen sample, independent to the rest of the training set.
from sklearn.utils.validation import check_random_state rng = check_random_state(7) random_idx = rng.randint(zero.shape[0]) test = zero[random_idx] # the test sample train = np.delete(zero, [random_idx], axis=0) print(train.shape) print(test.shape)
(49, 4087) (4087,)
Save the original ‘test’ clip, the one we’re about to delete half of, so that we can compare it to the ‘patched’ clip once we have generated it.
This assume the sample rate is always the same for all samples
wavfile.write('../outputs/OriginalTestClip.wav', sample_rate, test)
You can get the audio files in GitHub.
carve out the labels Y
The data will have two parts: X and y (the true labels). X is going to be the first portion of the audio file, which we will be providing the computer as input (the “chopped” audio).
The “label”, y, is going to be the remaining portion of the audio file.
In this way the computer will use linear regression to derive the missing portion of the sound file based off of the training data it has received.
ProvidedPortion is how much of the audio file will be provided, in percent. The remaining percent of the file will be generated via linear extrapolation.
Provided_Portion = 0.5 # let's delete half of the audio test_samples = int(Provided_Portion * n_audio_samples) X_test = test[0:test_samples] # first ones
You can get the audio files in GitHub.
Can you hear it? Now it’s only the first syllable, “ze” …
But we can delete even more and leave only the first quarter!
Provided_Portion = 0.25 # let's delete three quarters of the audio! test_samples = int(Provided_Portion * n_audio_samples) X_test = test[0:test_samples] # first ones
Almost unrecognisable. Will the linear regression model be able to reconstruct the audio?
y_test = test[test_samples:] # remaining audio part is the label
Duplicate the same process for X_train, y_train.
X_train = train[:, 0:test_samples] # first ones: data y_train = train[:, test_samples:] # remaining ones: label
SciKit-Learn gets mad if you don’t supply your training data in the form of a 2D arrays: [n_samples, n_features]. So if you only have one sample, such as is our case with X_test, and y_test, then by calling .reshape(1, -1), you can turn [n_features] into [1, n_features].
X_test = X_test.reshape(1,-1) y_test = y_test.reshape(1,-1)
Create and train the linear regression model
from sklearn import linear_model model = linear_model.LinearRegression() model.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Use the model to predict the ‘label’ of X_test.
SciKit-Learn will use float64 to generate the predictions so let’s take those values back to int16
y_test_prediction = model.predict(X_test) y_test_prediction = y_test_prediction.astype(dtype=np.int16)
Evaluate the result
score = model.score(X_test, y_test) # test samples X and true values for X print ("Extrapolation R^2 Score: ", score)
Extrapolation R^2 Score: 0.0
Obviously, if you look only at Rsquared it seems that it was a totally useless result.
But let’s listen to the generated audio.
First, take the first Provided_Portion portion of the test clip, the part you fed into your linear regression model. Then, stitch that together with the abomination the predictor model generated for you, and then save the completed audio clip:
completed_clip = np.hstack((X_test, y_test_prediction)) wavfile.write('../outputs/ExtrapolatedClip.wav', sample_rate, completed_clip[0])
Again, you can listen it from the GitHub repository. Well, it is not bad!