Vincent Ogloblinsky - @vogloblinsky
Vincent Ogloblinsky
Compodoc maintainer
Google Developer Expert on Web Technologies
Software architect / Open-Source referent
- guide her daughter by dissecting the syllables of words with finger
- guide and correct oral deciphering
- work in a professional context "dealing with the voice" (Orange - Data IA)
- imagine that an application based on an adapted "speech to text" engine + a good dose of interactivity
- do a "sectoral" analysis and realize that it does not exist
Perfect! New technical challenge in the pocket for the geek dad 😀
Reading help
Web application
🗣️
Child's voice
Machine learning
Speech to text
100% "web" technologies
- JavaScript
- WebGL and/or WebAssembly
Offline & privacy by design
- no API calls possible
- no identification of the child
1 - Awareness of spoken sounds
2 - Awareness of the link between oral and written
3 - The discovery of the alphabet composed of 26 letters
4 - Understand the association “sounds and letters”
5 - Understanding Syllabic Fusion
6 - Recognize words
7 - Understand texts
"château"
ch ça fait "chhh"
Synthesis mental skill: bringing together the speech of a consonant and the speech of a vowel
/p/ et /a/ → pa
The child needs to know that language is segmented into words and also into smaller sound segments: phonemes and syllables (phoneme fusion)
a ça fait "aa"
t ça fait "ttt"
eau ça fait "ooo"
26 letters in the alphabet
36 phonemes
vowels : [a] (table, patte), [é] (éléphant, parler), [o] (bonnet, chaud), ...
semi-vowels : [J] (fille, rail), ...
consonants : [b] (billets, abbé), [g] (gâteau, aggraver), ...
190 graphemes
[o] : o, au, eau
[k] : c, qu (coque)
Subcategory of "Artificial Intelligence"
Algorithms discovering "patterns" in datasets
4 steps :
- select and prepare data
- select the algorithm to apply
- training of the algorithm (= model)
- use (and improvement of the model)
3 main types of machine learning
- supervised learning: labeled data - task driven (expensive)
- unsupervised learning: unlabeled data - data driven (autonomous search for patterns)
- reinforcement learning: the algorithm learns from its mistakes to achieve an objective
Also called "Automatic Speech Recognition (ASR)"
Service 1
Service 2
Service 3
Service 4
Service 5
Service 6
Current voice assistants "trained" with "adult" datasets
Vocally richer children's voices speaking: high-pitched, thinner vocal canal, smaller vocal cords; in short they "grow up"
Spectrum richer
Few voice datasets
2 possible approaches: "from scratch" or by "transfer learning"
- from scratch :
Advantage :
- full model control
Drawback :
- requires a lot of data
- transfer learning :
Advantage :
- benefits from initial training of the model
Drawback :
- less mastery of the model
2 possible approaches: "from scratch" or by "transfer learning"
Simpler use case than an ASR
🔘 Speech commands dataset (www.tensorflow.org/datasets/catalog/speech_commands)
- proposed by Google in 2017
- 65000 sounds of 1s of 30 short words spoken by thousands of people
🔘 Using Tensorflow as a Machine Learning Framework
🔘 "Training" locally (Python) then "export"
Developped by Google Brain
Released in 2017 in v1.0.0 - (current 2.8.0)
Uses "under the hood" GPU and WebGL APIs
Data gathering web interface
- simple syllable set (20)
- receiving wav files
- no information collected on the child (age, gender)
Data cleaning web interface
- one sound per syllable per child
- shortening to 1s
- cleaning of parasitic sounds (uh, ...)
+ increase (pitch variation)
1. Separation of training data
80% for training
10% for Tensorflow internal validation
10% for testing
3. Loading the base model
2. Inspection of some spectrograms
4. Freezing all layers of the model except the last one
for layer in model.layers[:-1]:
layer.trainable = False
model.compile(optimizer="sgd", loss="sparse_categorical_crossentropy", metrics=["acc"])
Print layers information
5. Training : ~ 5min
6. Loss function control
Difference between the predictions made by the neural network and the actual values of the observations used during learning
Itération
Loss
7. Accuracy check
It measures the effectiveness of a model in correctly predicting both positive and negative individuals.
Itération
Accuracy
8. Confusion Matrix Display
8. Control with additional test files (labelled)
# Convert the model to TensorFlow.js Layers model format.
tfjs_model_dir = "./thot-model-tfjs-1"
tfjs.converters.save_keras_model(model, tfjs_model_dir)
# Create the metadata.json file.
metadata = {
"words": list(commands),
"frameSize": model.input_shape[-2],
"generated_at": now.strftime("%Y-%m-%d-%H:%M:%S")
}
with open(os.path.join(tfjs_model_dir, "metadata.json"), "w") as f:
json.dump(metadata, f)
4.1 Mo
1.6 Mo
@tensorflow-models/speech-commands : package JavaScript de pilotage du modèle
import * as tf from '@tensorflow/tfjs-core';
import * as tfl from '@tensorflow/tfjs-layers';
import * as speechCommands from '@tensorflow-models/speech-commands';
const recognizer = speechCommands.create(
'BROWSER_FFT',
null,
'http://test.com/my-audio-model/model.json',
'http://test.com/my-audio-model/metadata.json'
);
Continuous listening
API getUserMedia
setInterval
~ 1s
Recovery of audio frequencies
Creation of the spectrogram
Send to Tensorflow model
Retrieving predictions
Model scaling with crowdsourcing
Adaptation layer on the application side: correction, guidance
Detection of phonological dyslexia
Gamification of the "child" course
Customization of the model to the voice of the child (on-device)
Super technical adventure
Exciting and growing ML domain (OpenAI, etc)
Test, fail & learn approach perfect for this side-project
Questions ?
Slides : bit.ly/3uBPDYR
Pictures credits - Unsplash.com