Recover audio using linear regression

In this example, we will use linear regression to recover or ‘fill out’ a completely deleted portion of an audio file!
For this, we use the FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson:

cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.

You can follow along with the associated notebook in GitHub.

get the data

import os
import scipy.io.wavfile as wavfile

zero = []
directory = "../datasets/free-spoken-digit-dataset-master/recordings/"
for fname in os.listdir(directory):
    if fname.startswith("0_jackson"):
        fullname = os.path.join(directory, fname)
        sample_rate, data = wavfile.read(fullname)
        zero.append( data )

There are 500 recordings, 50 of each digit.
Each .wav file is actually just a bunch of numeric samples, “sampled”
from the analog signal. Sampling is a type of discretisation.
In our case the ‘audio samples’ are the actual ‘features’ of the audio file.

The goal of this notebook is to use multi-target linear regression to generate by extrapolation the missing portion of the test audio file.

Each one audio_sample features will be the output of an equation, which is a function of the provided portion of the audio_samples:

missing_samples = f (provided_samples)

prepare the data

Convert the data read from the file into a DataFrame and set the dtype to np.int16, since the input audio files are 16 bits per sample. This is important otherwise the produced audio samples will be encoded as 64 bits per sample and will be too short.

import numpy as np
import pandas as pd

zeroDF = pd.DataFrame(zero, dtype=np.int16)
zeroDF.info()
<class 'pandas.core.frame.DataFrame'>
 RangeIndex: 50 entries, 0 to 49
 Columns: 6273 entries, 0 to 6272
 dtypes: float64(2186), int16(4087)
 memory usage: 1.2 MB

Since these audio clips are unfortunately not length-normalized, we have to just hard chop them to all be the same length.
Since Pandas would have inserted NANs at any spot to make zero a perfectly rectangular [n_observed_samples, n_audio_samples] array, we do a dropna on the Y axis here. Then, convert the data back into an NDArray using .values:

if zeroDF.isnull().values.any() == True:
  print("Preprocessing data: dropping all NaN")
  zeroDF.dropna(axis=1, inplace=True)
else:
  print("Preprocessing data: No NaN found!")

zero = zeroDF.values # this is an array
Preprocessing data: dropping all NaN
n_audio_samples = zero.shape[1]
n_audio_samples
4087

split the data into training and testing sets

There are 50 takes of each clip. We want to pull out just one of them, randomly, and that one will NOT be used in the training of the model. In other words, the one file we’ll be testing / scoring on will be an unseen sample, independent to the rest of the training set.

from sklearn.utils.validation import check_random_state

rng   = check_random_state(7)
random_idx = rng.randint(zero.shape[0])

test  = zero[random_idx] # the test sample
train = np.delete(zero, [random_idx], axis=0)

print(train.shape)
print(test.shape)
(49, 4087)
 (4087,)

Save the original ‘test’ clip, the one we’re about to delete half of, so that we can compare it to the ‘patched’ clip once we have generated it.
This assume the sample rate is always the same for all samples

wavfile.write('../outputs/OriginalTestClip.wav', sample_rate, test)

You can get the audio files in GitHub.

carve out the labels Y

The data will have two parts: X and y (the true labels). X is going to be the first portion of the audio file, which we will be providing the computer as input (the “chopped” audio).
The “label”, y, is going to be the remaining portion of the audio file.

In this way the computer will use linear regression to derive the missing portion of the sound file based off of the training data it has received.
ProvidedPortion is how much of the audio file will be provided, in percent. The remaining percent of the file will be generated via linear extrapolation.

Provided_Portion = 0.5 # let's delete half of the audio

test_samples = int(Provided_Portion * n_audio_samples)
X_test = test[0:test_samples] # first ones

You can get the audio files in GitHub.
Can you hear it? Now it’s only the first syllable, “ze” …
But we can delete even more and leave only the first quarter!

Provided_Portion = 0.25 # let's delete three quarters of the audio!

test_samples = int(Provided_Portion * n_audio_samples)
X_test = test[0:test_samples] # first ones

Almost unrecognisable. Will the linear regression model be able to reconstruct the audio?

y_test = test[test_samples:] # remaining audio part is the label

Duplicate the same process for X_train, y_train.

X_train = train[:, 0:test_samples] # first ones: data
y_train = train[:, test_samples:]  # remaining ones: label

SciKit-Learn gets mad if you don’t supply your training data in the form of a 2D arrays: [n_samples, n_features]. So if you only have one sample, such as is our case with X_test, and y_test, then by calling .reshape(1, -1), you can turn [n_features] into [1, n_features].

X_test = X_test.reshape(1,-1)
y_test = y_test.reshape(1,-1)

Create and train the linear regression model

from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Use the model to predict the ‘label’ of X_test.
SciKit-Learn will use float64 to generate the predictions so let’s take those values back to int16

y_test_prediction = model.predict(X_test)
y_test_prediction = y_test_prediction.astype(dtype=np.int16)

Evaluate the result

score = model.score(X_test, y_test) # test samples X and true values for X
print ("Extrapolation R^2 Score: ", score)
Extrapolation R^2 Score: 0.0

Obviously, if you look only at Rsquared it seems that it was a totally useless result.
But let’s listen to the generated audio.

First, take the first Provided_Portion portion of the test clip, the part you fed into your linear regression model. Then, stitch that together with the abomination the predictor model generated for you, and then save the completed audio clip:

completed_clip = np.hstack((X_test, y_test_prediction))
wavfile.write('../outputs/ExtrapolatedClip.wav', sample_rate, completed_clip[0])

Again, you can listen it from the GitHub repository. Well, it is not bad!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s