Fun with Speech

Things you can do with HTML5 Natural Language APIs

John Dimm

http://www.johndimm.com/

Feb, 2014

business

find a need
create a solution

toys

find some interesting technology
do something stupid with it

Speech Input for HTML Forms
Web Speech API
AlchemyAPI
Bing Search
flickr.photos.search
WebRTC
Bing Translator API
Google Translate Text-to-Speech

minimal api demos

toy browser "apps"

Talkshow: shows images of the things you are talking about

Translating Telephone: multilingual video conferencing

two kinds of apps

apps that change the world
apps that have to wait for the world to change around them

For these two:

underlying technology needs a few more quantum leaps forward
until then, users must acquire special skills

speech

natural
hands free
no screen needed
no keyboard
magic -- action at a distance

talk is cheap

It’s easy, almost effortless
Our preferred way of communicating
High bandwidth
In fact, we love to do it
All the time
Some of us can’t stop
The easiest form of work
One of the first things you learn
and the last thing to go

ease of use

ergonomics -- reduce number of clicks and keystrokes

one-click is nice
voice allows zero-click interface
effortless
No clicking, no typing, just say the magic words

voice commands

Very hard to do well
Frustrating for users
To reduce errors, we have to limit vocabulary
The user has to know or guess the available commands
Big penalty for getting a command wrong
Errors in speech recognition cannot be avoided
User errors are also unavoidable

error correction

command line

up arrow
arrow left
type over the error
submit

voice

repeat entire command until it is recognized

ERROR CORRECTION

how to use speech

voice commands
transcription
something completely different: react to overheard conversation

Can a computer make itself useful by listening to my conversations?

speech input for HTML forms

Demo

Deficiencies:

Speech recognition stops at the first pause
No feedback during recognition
You have to click on the microphone icon to speak

continuous speech recognition

In 2013, Google Chrome gets the Web Speech API
Continuous ASR sessions, lasting several minutes
Now we can process and respond to conversational speech
Speech meant for other humans, not computers

Demo

interim results

text analytics

We can analyze conversational speech as sentences of text

Speech is different from written text, but let’s worry about that later

We can (mis)apply standard text analytics to speech

Named Entity Recognition
Machine Translation
Sentiment Analysis
Domain Classification
Fact Finding
Summarization
Extract semantic frames
Normalization
Segmentation
Simplification

named entity recognition

Alchemy API provides online NER for 8 European languages

Free for up to:

1,000 daily transactions
5 concurrent requests

Install their php library on your server and use ajax

Demo

start with some cool technology...

We have this input:

text from continuous speech recognition
list of names that were mentioned

What can we do with it?

good / evil

Clearly this is information is interesting to the intelligence community.

Metadata: create a graph of the people you talk to
NER on content: superimpose a graph of the people you talk about

But can it be used for good, to give something of value back to the speaker or listener?

everybody gets an inset

Have you ever struggled to describe something or someone in words?
…when a picture would explain everything?
And yet it’s not worth interrupting your conversation to do an internet search?

You need Computer Aided Conversation!

The computer eavesdrops on your conversation, listening for names
When it hears a name, it searches for images of that thing
And displays the images on a nearby screen

finding related images

Microsoft Bing Search API

Free for up to 5,000 transactions per month

Demo

flickr.photos.search API

Free

Demo

the disappearing user

That could have been me talking to a friend.

I wasn’t really using the computer.

It was just there, listening, acting when it had something to offer to the conversation.

Like an attentive servant.

pictures and meaning

Picture Theory of Meaning -- Wittgenstein
A statement is meaningful if it pictures a state of the world
Let's take that literally -- a statement is like a picture
A proper noun "pictures" the thing it names
Grand project: can we turn speech into pictures on the fly?
Deep Learning, recursive neural nets, semantic grounding

what we talk about

connect to a private database of pictures of friends, vacation spots

add last year's sales figures so they will pop up on a screen in the hallway at work during a water cooler conversation

using pictures to avoid miscommunication

the trouble with screens

The value may small, but the effort is smaller. Close to null.

But there's a setup problem...

We need a screen that both of us can see. If only there were a shared screen nearby…

ambient computing

processing that happens when you are busy doing something else

At first it will be creepy that computers are listening to us

But we will get over it

caveat lector

Previous predictions

Supermarket floors will become wall-to-wall advertisements
Robo-calls will grow until everyone's phone is constantly ringing

translating telephone

WebRTC
Machine Translation
Text-to-Speech

web real time communication

Real-time, peer-to-peer voice and video communications through a browser without plug-ins
Click-to-call from any web page
No phone number, navigate to the same page to connect
Video conferencing
Data Channels
Javascript control of the screen each participant sees