Fun with Speech
business
- find a need
- create a solution
toys
- find some interesting technology
- do something stupid with it
-
Speech Input for HTML Forms
-
Web Speech API
-
AlchemyAPI
-
Bing Search
-
flickr.photos.search
-
WebRTC
-
Bing Translator API
-
Google Translate Text-to-Speech
minimal api demos
toy browser "apps"

Talkshow: shows images of the things you are talking about

Translating Telephone: multilingual video conferencing
two kinds of apps
- apps that change the world
-
apps that have to wait for the world to change around them
For these two:
- underlying technology needs a few more quantum leaps forward
-
until then, users must acquire special skills
speech
- natural
- hands free
- no screen needed
- no keyboard
- magic -- action at a distance
talk is cheap
- It’s easy, almost effortless
-
Our preferred way of communicating
-
High bandwidth
-
In fact, we love to do it
-
All the time
-
Some of us can’t stop
-
The easiest form of work
-
One of the first things you learn
-
and the last thing to go
ease of use
ergonomics -- reduce number of clicks and keystrokes
-
one-click is nice
-
voice allows zero-click interface
- effortless
- No clicking, no typing, just say the magic words
voice commands
- Very hard to do well
-
Frustrating for users
-
To reduce errors, we have to limit vocabulary
- The user has to know or guess the available commands
- Big penalty for getting a command wrong
- Errors in speech recognition cannot be avoided
- User errors are also unavoidable
error correction
command line
-
up arrow
-
arrow left
-
type over the error
- submit
voice
-
repeat entire command until it is recognized
ERROR CORRECTION
how to use speech
-
voice commands
- transcription
- something completely different: react to overheard conversation
Can a computer make itself useful by listening to my conversations?
speech input for HTML forms

Deficiencies:
- Speech recognition stops at the first pause
-
No feedback during recognition
-
You have to click on the microphone icon to speak
continuous speech recognition
-
In 2013, Google Chrome gets the Web Speech API
-
Continuous ASR sessions, lasting several minutes
-
Now we can process and respond to conversational speech
-
Speech meant for other humans, not computers
interim results


text analytics
We can analyze conversational speech as sentences of text
- Speech is different from written text, but let’s worry about that later
We can (mis)apply standard text analytics to speech
-
Named Entity Recognition
-
Machine Translation
-
Sentiment Analysis
-
Domain Classification
-
Fact Finding
-
Summarization
-
Extract semantic frames
-
Normalization
-
Segmentation
- Simplification
named entity recognition
Alchemy API provides online NER for 8 European languages
Free for up to:
-
1,000 daily transactions
- 5 concurrent requests
Install their php library on your server and use ajax
start with some cool technology...
We have this input:
-
text from continuous speech recognition
-
list of names that were mentioned
good / evil
Clearly this is information is interesting to the intelligence community.
-
Metadata: create a graph of the people you talk to
- NER on content: superimpose a graph of the people you talk about
But can it be used for good, to give something of value back to the speaker or listener?


everybody gets an inset
Have you ever struggled to describe something or someone in words?
…when a picture would explain everything?
And yet it’s not worth interrupting your conversation to do an internet search?
…when a picture would explain everything?
And yet it’s not worth interrupting your conversation to do an internet search?
You need Computer Aided Conversation!
- The computer eavesdrops on your conversation, listening for names
- When it hears a name, it searches for images of that thing
- And displays the images on a nearby screen
finding related images
Microsoft Bing Search API
- Free for up to 5,000 transactions per month
flickr.photos.search API
-
Free

the disappearing user
That could have been me talking to a friend.
I wasn’t really using the computer.
It was just there, listening, acting when it had something to offer to the conversation.
Like an attentive servant.
pictures and meaning
Picture Theory of Meaning -- Wittgenstein
- A statement is meaningful if it pictures a state of the world
- Let's take that literally -- a statement is like a picture
- A proper noun "pictures" the thing it names
- Grand project: can we turn speech into pictures on the fly?
- Deep Learning, recursive neural nets, semantic grounding
what we talk about
connect to a private database of pictures of friends, vacation spots
add last year's sales figures so they will pop up on a screen in the hallway at work during a water cooler conversation
using pictures to avoid miscommunication
the trouble with screens
The value may small, but the effort is smaller. Close to null.
But there's a setup problem...
We need a screen that both of us can see. If only there were a shared screen nearby…
ambient computing
processing that happens when you are busy doing something else
At first it will be creepy that computers are listening to us
But we will get over it
caveat lector
Previous predictions
- Supermarket floors will become wall-to-wall advertisements
- Robo-calls will grow until everyone's phone is constantly ringing
translating telephone

-
WebRTC
-
Machine Translation
-
Text-to-Speech
web real time communication
-
Real-time, peer-to-peer voice and video communications through a browser without plug-ins
-
Click-to-call from any web page
-
No phone number, navigate to the same page to connect
-
Video conferencing
- Data Channels
- Javascript control of the screen each participant sees
machine translation
-
Voice-challenged
-
Trained on human-translated bilingual corpora
-
Therefore well-formed sentences
-
High bar for grammaticality
-
Not speech
-
Little training data for speech
-
Disfluencies -- um, huh, ah, er, duh
- Stop/start
- Backtracking
-
Parenthetic speech
TRANSLATION API
speech synthesis
-
Google Translate’s unofficial API
-
100 characters at a time
- use https
-
Web Speech is supposed to do synthesis too, it’s coming
translating telephone

-
Multilingual Video Conferencing
-
Everyone broadcasts in their own language
-
Foreign language is translated locally to mine
demo
questions?
John Dimm
jdimm@yahoo.com
http://www.johndimm.com/
https://github.com/johndimm/FunWithSpeech
Fun with speech
By John Dimm
Fun with speech
- 6,780