Cosmin Catalin Sanda
th
11 December 2018
Cosmin Catalin Sanda
Data Scientist and Engineer at AudienceProject
Blogging at https://cosminsanda.com
The ability to identify a speaker based on the sound of their voice
Who of was used for a given recording ?
Salli
Kimberly
Kendra
Joanna
Ivy
Matthew
Justin
Joey
client = boto3.client("polly")
voices = ["Ivy", "Joanna", "Joey", "Justin",
"Kendra", "Kimberly", "Matthew", "Salli"]
response = client.synthesize_speech(
OutputFormat="mp3",
Text="Polly wants a cracker",
TextType="text",
VoiceId=random.choice(voices)
)
with open("out.mp3", "wb") as out:
with closing(response["AudioStream"]) as stream:
out.write(stream.read())
text
text-to-speech
sound
sound
image
image
Joanna
Joanna
Joanna
Kimberly
Kimberly
Kimberly
34 | 13 | 54 | 45 | 45 | 34 |
34 | 34 | 34 | 54 | 43 | 34 |
34 | 56 | 34 | 54 | 45 | 23 |
34 | 43 | 34 | 44 | 45 | 56 |
34 | 54 | 45 | 46 | 34 | 6 |
34 | 54 | 56 | 65 | 56 | 56 |
20 | 13 | 54 | 45 | 45 | 34 |
34 | 34 | 34 | 54 | 43 | 34 |
34 | 56 | 34 | 54 | 45 | 23 |
34 | 43 | 34 | 44 | 45 | 56 |
34 | 54 | 45 | 46 | 34 | 6 |
34 | 54 | 56 | 65 | 56 | 56 |
20 | 13 | 54 | 45 | 45 | 34 |
34 | 34 | 34 | 54 | 43 | 34 |
34 | 56 | 34 | 54 | 45 | 23 |
34 | 43 | 34 | 44 | 45 | 56 |
34 | 54 | 45 | 46 | 34 | 6 |
34 | 54 | 56 | 65 | 56 | 56 |
34 | 13 | 54 |
34 | 34 | 34 |
34 | 56 | 34 |
34 | 13 | 54 |
34 | 34 | 34 |
34 | 56 | 34 |
34 | 13 | 54 |
34 | 34 | 34 |
34 | 56 | 34 |
Original
image
Numerical
representation
Filtered
output
simplification
"Quickly" going through the data processing steps, we have the following:
- Download a list of short sentences from a publicly available dataset.
- Use Amazon Polly to get the spoken versions of the sentences in mp3 format.
- Convert the mp3 to wav files.
- Use a window interval of about 3 seconds from each file to generate the spectrogram.
- Convert the spectrogram image to a numerical representation.
- Make sure there are three stratified datasets: training/validation and test.
- Store the training and validation numerical arrays in Python pickles.
- Upload pickles to S3.
- Set the test data aside.
from mxnet.gluon.nn import MaxPool2D, Sequential
from mxnet.gluon.nn import Conv2D, Dense, Dropout
net = Sequential()
with net.name_scope():
net.add(Conv2D(channels=32, kernel_size=(3, 3),
padding=0, activation="relu"))
net.add(Conv2D(channels=32, kernel_size=(3, 3),
padding=0, activation="relu"))
net.add(MaxPool2D(pool_size=(2, 2)))
net.add(Dropout(.25))
net.add(Dense(8))
Number of filters. Used to balance
between under/over fitting
Dimension of the convolution window
Rectified Linear Unit (y=max(x,0))
Reduces the spatial size of the input and contributes to reducing overfitting and computation
Regularization layer. A percentage of neurons
do not get activated
import mxnet as mx
from mxnet.initializer import Xavier # [Bengio and Glorot 2010]
ctx = mx.gpu()
net.collect_params().initialize(Xavier(magnitude=2.24), ctx=ctx)
from mxnet.gluon.loss import SoftmaxCrossEntropyLoss
loss = SoftmaxCrossEntropyLoss()
from mxnet.gluon import Trainer
trainer = Trainer(net.collect_params(), optimizer="adam")
Relies on predicted probabilities to compute a score in classification problems
A SGD flavoured optimizer
epochs= 5
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
loss_result = loss(output, label)
loss_result.backward()
trainer.step(batch_size)
How many times I go through
the whole training data
(batch_size, channels, height, width) NDArray
Forward pass
Update weights
epochs = 5
for e in range(epochs):
moving_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
loss_result = loss(output, label)
loss_result.backward()
trainer.step(batch_size)
validation_acc = measure_performance(net, ctx, validation_data)
train_acc = measure_performance(net, ctx, train_data)
print("Epoch {}. Train_acc {}, Test_acc {}" \
.format(e, train_acc, validation_acc))
def measure_performance(model, ctx, data_iter):
acc = mx.metric.Accuracy()
for _, (data, labels) in enumerate(data_iter):
data = data.as_in_context(ctx)
labels = labels.as_in_context(ctx)
output = model(data)
predictions = nd.argmax(output, axis=1)
acc.update(preds=predictions, labels=labels)
return acc.get()[1]
Epoch 0. Loss: 1.19020674213, Train_acc 0.927615951994, Test_acc 0.924924924925
Epoch 1. Loss: 0.0955917794597, Train_acc 0.910488811101, Test_acc 0.904904904905
Epoch 2. Loss: 0.0780380586131, Train_acc 0.982872859107, Test_acc 0.967967967968
Epoch 3. Loss: 0.0515212092374, Train_acc 0.987123390424, Test_acc 0.95995995996
Epoch 4. Loss: 0.0513322874282, Train_acc 0.995874484311, Test_acc 0.978978978979
===== Job Complete =====
Billable seconds: 337