How I helped my daughter read with machine learning

📖 👨‍👧

Vincent Ogloblinsky - @vogloblinsky

Vincent Ogloblinsky

Compodoc maintainer

Google Developer Expert on Web Technologies

Software architect / Open-Source referent

Disclaimer

This talk is just a "technical" overview of machine learning from a developer's perspective.

I don't have a data-scientist training. 😉

Some topics (eg model optimization) are not yet covered.

Agenda

1.

Genesis of the idea

2.

Learning to read

3.

Machine learning

4.

Speech to text

5.

The construction of the model

6.

Results and outlook

Genesis of the idea

Genesis of the idea

Like any geek dad who does the evening reading :

- guide her daughter by dissecting the syllables of words with finger

- guide and correct oral deciphering

- work in a professional context "dealing with the voice" (Orange - Data IA)

- imagine that an application based on an adapted "speech to text" engine + a good dose of interactivity

- do a "sectoral" analysis and realize that it does not exist

Perfect! New technical challenge in the pocket for the geek dad 😀

Genesis of the idea

Reading help

Web application

🗣️

Child's voice

Machine learning

Speech to text

Definition of "ready"

Let's enforce some technical constraints 

100% "web" technologies

- JavaScript

- WebGL and/or WebAssembly

Offline & privacy by design

- no API calls possible

- no identification of the child

Learning to read

Learning to read

7 step process

1 - Awareness of spoken sounds

2 - Awareness of the link between oral and written

3 - The discovery of the alphabet composed of 26 letters

4 - Understand the association “sounds and letters”

5 - Understanding Syllabic Fusion

6 - Recognize words

7 - Understand texts

Syllabic Fusion

"château"

ch ça fait "chhh"

Synthesis mental skill: bringing together the speech of a consonant and the speech of a vowel

/p/ et /a/ → pa

The child needs to know that language is segmented into words and also into smaller sound segments: phonemes and syllables (phoneme fusion)

a ça fait "aa"

t ça fait "ttt"

eau ça fait "ooo"

Wealth of "French" language

26 letters in the alphabet

36 phonemes

vowels : [a] (table, patte), [é] (éléphant, parler), [o] (bonnet, chaud), ...

semi-vowels : [J] (fille, rail), ...

consonants : [b] (billets, abbé), [g] (gâteau, aggraver), ...

190 graphemes

[o] : o, au, eau

[k] : c, qu (coque)

Machine learning

IA in Orange

IA in Orange

Machine learning

Subcategory of "Artificial Intelligence"

Algorithms discovering "patterns" in datasets

4 steps :

- select and prepare data

- select the algorithm to apply

- training of the algorithm (= model)

- use (and improvement of the model)

Machine learning

3 main types of machine learning

- supervised learning: labeled data - task driven (expensive)

- unsupervised learning: unlabeled data - data driven (autonomous search for patterns)

- reinforcement learning: the algorithm learns from its mistakes to achieve an objective

Speech to text

Speech to text

Also called "Automatic Speech Recognition (ASR)"

Speech to text

Speech to text

Speech to text

Speech to text in Orange

Speech to text in Orange

Service 1

Service 2

Service 3

Service 4

Service 5

Service 6

"Speech to text" and children voices

Current voice assistants "trained" with "adult" datasets

Vocally richer children's voices speaking: high-pitched, thinner vocal canal, smaller vocal cords; in short they "grow up"

Spectrum richer

Few voice datasets

Model building

Model building

2 possible approaches: "from scratch" or by "transfer learning"

- from scratch

Advantage :

- full model control

Drawback :

- requires a lot of data

- transfer learning

Advantage :

- benefits from initial training of the model

Drawback :

- less mastery of the model

Model building

2 possible approaches: "from scratch" or by "transfer learning"

Transfer learning

Transfer learning

Sound classification

Simpler use case than an ASR

Model building

- proposed by Google in 2017

- 65000 sounds of 1s of 30 short words spoken by thousands of people

🔘 Using Tensorflow as a Machine Learning Framework

🔘 "Training" locally (Python) then "export"

Tensorflow

Developped by Google Brain

Released in 2017 in v1.0.0 - (current 2.8.0)

Tensorflow

Tensorflow.js

Uses "under the hood" GPU and WebGL APIs

Data gathering

Data gathering web interface

- simple syllable set (20)

- receiving wav files

- no information collected on the child (age, gender)

Data preparation

Data cleaning web interface

- one sound per syllable per child

- shortening to 1s

- cleaning of parasitic sounds (uh, ...)

+ increase (pitch variation)

Model training

1. Separation of training data

80% for training

10% for Tensorflow internal validation

10% for testing

3. Loading the base model

2. Inspection of some spectrograms

Model training

4. Freezing all layers of the model except the last one

for layer in model.layers[:-1]:
  layer.trainable = False

model.compile(optimizer="sgd", loss="sparse_categorical_crossentropy", metrics=["acc"])

Model training

Print layers information

Model training

5. Training : ~ 5min

Model training

6. Loss function control

Difference between the predictions made by the neural network and the actual values of the observations used during learning

Itération

Loss

Model training

7. Accuracy check

It measures the effectiveness of a model in correctly predicting both positive and negative individuals.

Itération

Accuracy

Model training

8. Confusion Matrix Display

Model training

8. Control with additional test files (labelled)

Model export

# Convert the model to TensorFlow.js Layers model format.

tfjs_model_dir = "./thot-model-tfjs-1"
tfjs.converters.save_keras_model(model, tfjs_model_dir)

# Create the metadata.json file.
metadata = {
    "words": list(commands),
    "frameSize": model.input_shape[-2],
    "generated_at": now.strftime("%Y-%m-%d-%H:%M:%S")
}
with open(os.path.join(tfjs_model_dir, "metadata.json"), "w") as f:
    json.dump(metadata, f)

4.1 Mo

1.6 Mo

Model import in JavaScript

@tensorflow-models/speech-commands : package JavaScript de pilotage du modèle

import * as tf from '@tensorflow/tfjs-core';
import * as tfl from '@tensorflow/tfjs-layers';
import * as speechCommands from '@tensorflow-models/speech-commands';

const recognizer = speechCommands.create(
    'BROWSER_FFT',
    null,
    'http://test.com/my-audio-model/model.json',
    'http://test.com/my-audio-model/metadata.json'
);

Use of the model in JavaScript

Continuous listening

API getUserMedia

setInterval

~ 1s

Recovery of audio frequencies

Creation of the spectrogram

Send to Tensorflow model

Retrieving predictions

Results and outlook

Demo : syllabe

Demo : word syllabe by syllabe

Demo : word by word

Outlook

Model scaling with crowdsourcing

Adaptation layer on the application side: correction, guidance

Detection of phonological dyslexia

Gamification of the "child" course

Customization of the model to the voice of the child (on-device)

Personal feedbacks

Super technical adventure

Exciting and growing ML domain (OpenAI, etc)

Test, fail & learn approach perfect for this side-project

Ressource

Thank you for your attention !

Questions ?

Slides : bit.ly/3uBPDYR

Pictures credits - Unsplash.com

Feedback ? 👉🏻 Here

Made with Slides.com