How I helped my daughter read with machine learning
📖 👨👧
Vincent Ogloblinsky - @vogloblinsky
Vincent Ogloblinsky
Compodoc maintainer
Google Developer Expert on Web Technologies
Software architect / Open-Source referent
Disclaimer
This talk is just a "technical" overview of machine learning from a developer's perspective.
I don't have a data-scientist training. 😉
Some topics (eg model optimization) are not yet covered.
Agenda
1.
Genesis of the idea
2.
Learning to read
3.
Machine learning
4.
Speech to text
5.
The construction of the model
6.
Results and outlook
Genesis of the idea
Genesis of the idea
Like any geek dad who does the evening reading :
- guide her daughter by dissecting the syllables of words with finger
- guide and correct oral deciphering
- work in a professional context "dealing with the voice" (Orange - Data IA)
- imagine that an application based on an adapted "speech to text" engine + a good dose of interactivity
- do a "sectoral" analysis and realize that it does not exist
Perfect! New technical challenge in the pocket for the geek dad 😀
Genesis of the idea
Reading help
Web application
🗣️
Child's voice
Machine learning
Speech to text
Definition of "ready"
Let's enforce some technical constraints
100% "web" technologies
- JavaScript
- WebGL and/or WebAssembly
Offline & privacy by design
- no API calls possible
- no identification of the child
Learning to read
Learning to read
7 step process
1 - Awareness of spoken sounds
2 - Awareness of the link between oral and written
3 - The discovery of the alphabet composed of 26 letters
4 - Understand the association “sounds and letters”
5 - Understanding Syllabic Fusion
6 - Recognize words
7 - Understand texts
Syllabic Fusion
"château"
ch ça fait "chhh"
Synthesis mental skill: bringing together the speech of a consonant and the speech of a vowel
/p/ et /a/ → pa
The child needs to know that language is segmented into words and also into smaller sound segments: phonemes and syllables (phoneme fusion)
a ça fait "aa"
t ça fait "ttt"
eau ça fait "ooo"
Wealth of "French" language
26 letters in the alphabet
36 phonemes
vowels : [a] (table, patte), [é] (éléphant, parler), [o] (bonnet, chaud), ...
semi-vowels : [J] (fille, rail), ...
consonants : [b] (billets, abbé), [g] (gâteau, aggraver), ...
190 graphemes
[o] : o, au, eau
[k] : c, qu (coque)
Machine learning
IA in Orange
IA in Orange
Machine learning
Subcategory of "Artificial Intelligence"
Algorithms discovering "patterns" in datasets
4 steps :
- select and prepare data
- select the algorithm to apply
- training of the algorithm (= model)
- use (and improvement of the model)
Machine learning
3 main types of machine learning
- supervised learning: labeled data - task driven (expensive)
- unsupervised learning: unlabeled data - data driven (autonomous search for patterns)
- reinforcement learning: the algorithm learns from its mistakes to achieve an objective
Speech to text
Speech to text
Also called "Automatic Speech Recognition (ASR)"
Speech to text
Speech to text
Speech to text
Speech to text in Orange
Speech to text in Orange
Service 1
Service 2
Service 3
Service 4
Service 5
Service 6
"Speech to text" and children voices
Current voice assistants "trained" with "adult" datasets
Vocally richer children's voices speaking: high-pitched, thinner vocal canal, smaller vocal cords; in short they "grow up"
Spectrum richer
Few voice datasets
Model building
Model building
2 possible approaches: "from scratch" or by "transfer learning"
- from scratch :
Advantage :
- full model control
Drawback :
- requires a lot of data
- transfer learning :
Advantage :
- benefits from initial training of the model
Drawback :
- less mastery of the model
Model building
2 possible approaches: "from scratch" or by "transfer learning"
Transfer learning
Transfer learning
Sound classification
Simpler use case than an ASR
Model building
🔘 Speech commands dataset (www.tensorflow.org/datasets/catalog/speech_commands)
- proposed by Google in 2017
- 65000 sounds of 1s of 30 short words spoken by thousands of people
🔘 Using Tensorflow as a Machine Learning Framework
🔘 "Training" locally (Python) then "export"
Tensorflow
Developped by Google Brain
Released in 2017 in v1.0.0 - (current 2.8.0)
Tensorflow
Tensorflow.js
Uses "under the hood" GPU and WebGL APIs
Data gathering
Data gathering web interface
- simple syllable set (20)
- receiving wav files
- no information collected on the child (age, gender)
Data preparation
Data cleaning web interface
- one sound per syllable per child
- shortening to 1s
- cleaning of parasitic sounds (uh, ...)
+ increase (pitch variation)
Model training
1. Separation of training data
80% for training
10% for Tensorflow internal validation
10% for testing
3. Loading the base model
2. Inspection of some spectrograms
Model training
4. Freezing all layers of the model except the last one
for layer in model.layers[:-1]:
layer.trainable = False
model.compile(optimizer="sgd", loss="sparse_categorical_crossentropy", metrics=["acc"])
Model training
Print layers information
Model training
5. Training : ~ 5min
Model training
6. Loss function control
Difference between the predictions made by the neural network and the actual values of the observations used during learning
Itération
Loss
Model training
7. Accuracy check
It measures the effectiveness of a model in correctly predicting both positive and negative individuals.
Itération
Accuracy
Model training
8. Confusion Matrix Display
Model training
8. Control with additional test files (labelled)
Model export
# Convert the model to TensorFlow.js Layers model format.
tfjs_model_dir = "./thot-model-tfjs-1"
tfjs.converters.save_keras_model(model, tfjs_model_dir)
# Create the metadata.json file.
metadata = {
"words": list(commands),
"frameSize": model.input_shape[-2],
"generated_at": now.strftime("%Y-%m-%d-%H:%M:%S")
}
with open(os.path.join(tfjs_model_dir, "metadata.json"), "w") as f:
json.dump(metadata, f)
4.1 Mo
1.6 Mo
Model import in JavaScript
@tensorflow-models/speech-commands : package JavaScript de pilotage du modèle
import * as tf from '@tensorflow/tfjs-core';
import * as tfl from '@tensorflow/tfjs-layers';
import * as speechCommands from '@tensorflow-models/speech-commands';
const recognizer = speechCommands.create(
'BROWSER_FFT',
null,
'http://test.com/my-audio-model/model.json',
'http://test.com/my-audio-model/metadata.json'
);
Use of the model in JavaScript
Continuous listening
API getUserMedia
setInterval
~ 1s
Recovery of audio frequencies
Creation of the spectrogram
Send to Tensorflow model
Retrieving predictions
Results and outlook
Demo : syllabe
Demo : word syllabe by syllabe
Demo : word by word
Outlook
Model scaling with crowdsourcing
Adaptation layer on the application side: correction, guidance
Detection of phonological dyslexia
Gamification of the "child" course
Customization of the model to the voice of the child (on-device)
Personal feedbacks
Super technical adventure
Exciting and growing ML domain (OpenAI, etc)
Test, fail & learn approach perfect for this side-project
Ressource
Thank you for your attention !
Questions ?
Slides : bit.ly/3uBPDYR
Pictures credits - Unsplash.com
Feedback ? 👉🏻 Here
How I helped my daughter read with machine learning
By Vincent Ogloblinsky
How I helped my daughter read with machine learning
- 946