Fun with Speech
Things you can do with HTML5 Natural Language APIs
- find a need
- create a solution
- find some interesting technology
- do something stupid with it
Speech Input for HTML Forms
Web Speech API
Bing Translator API
Google Translate Text-to-Speech
minimal api demos
toy browser "apps"
Talkshow: shows images of the things you are talking about
Translating Telephone: multilingual video conferencing
two kinds of apps
- apps that change the world
apps that have to wait for the world to change around them
For these two:
- underlying technology needs a few more quantum leaps forward
until then, users must acquire special skills
- hands free
- no screen needed
- no keyboard
- magic -- action at a distance
talk is cheap
- It’s easy, almost effortless
Our preferred way of communicating
In fact, we love to do it
All the time
Some of us can’t stop
The easiest form of work
One of the first things you learn
and the last thing to go
ease of use
ergonomics -- reduce number of clicks and keystrokes
one-click is nice
voice allows zero-click interface
- No clicking, no typing, just say the magic words
- Very hard to do well
Frustrating for users
To reduce errors, we have to limit vocabulary
- The user has to know or guess the available commands
- Big penalty for getting a command wrong
- Errors in speech recognition cannot be avoided
- User errors are also unavoidable
type over the error
repeat entire command until it is recognized
how to use speech
- something completely different: react to overheard conversation
Can a computer make itself useful by listening to my conversations?
speech input for HTML forms
- Speech recognition stops at the first pause
No feedback during recognition
You have to click on the microphone icon to speak
continuous speech recognition
In 2013, Google Chrome gets the Web Speech API
Continuous ASR sessions, lasting several minutes
Now we can process and respond to conversational speech
Speech meant for other humans, not computers
We can analyze conversational speech as sentences of text
- Speech is different from written text, but let’s worry about that later
We can (mis)apply standard text analytics to speech
Named Entity Recognition
Extract semantic frames
named entity recognition
Alchemy API provides online NER for 8 European languages
Free for up to:
1,000 daily transactions
- 5 concurrent requests
Install their php library on your server and use ajax
start with some cool technology...
We have this input:
text from continuous speech recognition
list of names that were mentioned
What can we do with it?
good / evil
Clearly this is information is interesting to the intelligence community.
Metadata: create a graph of the people you talk to
- NER on content: superimpose a graph of the people you talk about
But can it be used for good, to give something of value back to the speaker or listener?
everybody gets an inset
Have you ever struggled to describe something or someone in words?
…when a picture would explain everything?
And yet it’s not worth interrupting your conversation to do an internet search?
You need Computer Aided Conversation!
- The computer eavesdrops on your conversation, listening for names
- When it hears a name, it searches for images of that thing
- And displays the images on a nearby screen
finding related images
Microsoft Bing Search API
- Free for up to 5,000 transactions per month
the disappearing user
That could have been me talking to a friend.
I wasn’t really using the computer.
It was just there, listening, acting when it had something to offer to the conversation.
Like an attentive servant.
pictures and meaning
Picture Theory of Meaning -- Wittgenstein
- A statement is meaningful if it pictures a state of the world
- Let's take that literally -- a statement is like a picture
- A proper noun "pictures" the thing it names
- Grand project: can we turn speech into pictures on the fly?
- Deep Learning, recursive neural nets, semantic grounding
what we talk about
connect to a private database of pictures of friends, vacation spots
add last year's sales figures so they will pop up on a screen in the hallway at work during a water cooler conversation
using pictures to avoid miscommunication
the trouble with screens
The value may small, but the effort is smaller. Close to null.
But there's a setup problem...
We need a screen that both of us can see. If only there were a shared screen nearby…
processing that happens when you are busy doing something else
At first it will be creepy that computers are listening to us
But we will get over it
- Supermarket floors will become wall-to-wall advertisements
- Robo-calls will grow until everyone's phone is constantly ringing
web real time communication
Real-time, peer-to-peer voice and video communications through a browser without plug-ins
Click-to-call from any web page
No phone number, navigate to the same page to connect
- Data Channels
Trained on human-translated bilingual corpora
Therefore well-formed sentences
High bar for grammaticality
Little training data for speech
Disfluencies -- um, huh, ah, er, duh
Bing Translator API
- Free for up to 2,000,000 characters a month
Google Translate’s unofficial API
100 characters at a time
- use https
Web Speech is supposed to do synthesis too, it’s coming
Multilingual Video Conferencing
Everyone broadcasts in their own language
Foreign language is translated locally to mine
Fun with speech
By John Dimm