Fun with Speech

Things you can do with HTML5 Natural Language APIs

John Dimm

Feb, 2014


  1. find a need
  2. create a solution


  1. find some interesting technology
  2. do something stupid with it

  • Speech Input for HTML Forms
  • Web Speech API
  • AlchemyAPI
  • Bing Search
  • WebRTC
  • Bing Translator API
  • Google Translate Text-to-Speech

minimal api demos

toy browser "apps"

Talkshow: shows images of the things you are talking about

Translating Telephone: multilingual video conferencing

two kinds of apps

  1. apps that change the world
  2. apps that have to wait for the world to change around them 

For these two:

  • underlying technology needs a few more quantum leaps forward
  • until then, users must acquire special skills


  • natural
  • hands free
  • no screen needed
  • no keyboard
  • magic -- action at a distance

talk is cheap

  • It’s easy, almost effortless
  • Our preferred way of communicating
  • High bandwidth
  • In fact, we love to do it
  • All the time
  • Some of us can’t stop
  • The easiest form of work
  • One of the first things you learn
  • and the last thing to go

ease of use

ergonomics -- reduce number of clicks and keystrokes

  • one-click is nice
  • voice allows zero-click interface
  • effortless 
  • No clicking, no typing, just say the magic words

voice commands

  • Very hard to do well
  • Frustrating for users
  • To reduce errors, we have to limit vocabulary 
  • The user has to know or guess the available commands 
  • Big penalty for getting a command wrong 
  • Errors in speech recognition cannot be avoided
  • User errors are also unavoidable

    error correction

    command line

    1. up arrow
    2. arrow left
    3. type over the error
    4. submit


    1. repeat entire command until it is recognized



    how to use speech

    • voice commands
    • transcription
    • something completely different: react to overheard conversation

    Can a computer make itself useful by listening to my conversations?

    speech input for HTML forms

    • Speech recognition stops at the first pause 
    • No feedback during recognition
    • You have to click on the microphone icon to speak 

      continuous speech recognition

      • In 2013, Google Chrome gets the Web Speech API
      • Continuous ASR sessions, lasting several minutes
      • Now we can process and respond to conversational speech
      • Speech meant for other humans, not computers

      interim results

      text analytics

      We can analyze conversational speech as sentences of text
      • Speech is different from written text, but let’s worry about that later

      We can (mis)apply standard text analytics to speech

      • Named Entity Recognition
      • Machine Translation
      • Sentiment Analysis 
      • Domain Classification 
      • Fact Finding 
      • Summarization 
      • Extract semantic frames
      • Normalization
      • Segmentation
      • Simplification

      named entity recognition

      Alchemy API provides online NER for 8 European languages

      Free for up to:
      • 1,000 daily transactions
      • 5 concurrent requests

      Install their php library on your server and use ajax

      start with some cool technology...

      We have this input: 

      • text from continuous speech recognition
      • list of names that were mentioned

      What can we do with it?

      good / evil

       Clearly this is information is interesting to the intelligence community.
      • Metadata:  create a graph of the people you talk to
      • NER on content: superimpose a graph of the people you talk about

      But can it be used for good, to give something of value back to the speaker or listener?

      everybody gets an inset

      Have you ever struggled to describe something or someone in words?
      …when a picture would explain everything?
      And yet it’s not worth interrupting your conversation to do an internet search?

      You need Computer Aided Conversation!

      • The computer eavesdrops on your conversation, listening for names
      • When it hears a name, it searches for images of that thing
      • And displays the images on a nearby screen

      finding related images

      Microsoft Bing Search API
      • Free for up to 5,000 transactions per month API
      • Free

      the disappearing user

      That could have been me talking to a friend.  
      I wasn’t really using the computer.  
      It was just there, listening, acting when it had something to offer to the conversation.  
      Like an attentive servant.

        pictures and meaning

        • Picture Theory of Meaning -- Wittgenstein

        • A statement is meaningful if it pictures a state of the world
        • Let's take that literally -- a statement is like a picture
        • A proper noun "pictures" the thing it names
        • Grand project: can we turn speech into pictures on the fly?
        • Deep Learning, recursive neural nets, semantic grounding

        what we talk about

        connect to a private database of pictures of friends, vacation spots

        add last year's sales figures so they will pop up on a screen in the hallway at work during a water cooler conversation

        using pictures to avoid miscommunication

        the trouble with screens

        The value may small, but the effort is smaller.   Close to null.

        But there's a setup problem...

        We need a screen that both of us can see.  If only there were a shared screen nearby…

        ambient computing

        processing that happens when  you are busy doing something else

        At first it will be creepy that computers are listening to us

        But we will get over it

        caveat lector

        Previous predictions

        • Supermarket floors will become wall-to-wall advertisements
        • Robo-calls will grow until everyone's phone is constantly ringing 

        translating telephone

        • WebRTC
        • Machine Translation
        • Text-to-Speech

        web real time communication

        • Real-time, peer-to-peer voice and video communications through a browser without plug-ins
        • Click-to-call from any web page
        • No phone number, navigate to the same page to connect
        • Video conferencing
        • Data Channels
        • Javascript control of the screen each participant sees

        machine translation

        • Voice-challenged
        • Trained on human-translated bilingual corpora
        • Therefore well-formed sentences
        • High bar for grammaticality
        • Not speech
        • Little training data for speech
        • Disfluencies -- um, huh, ah, er, duh
        • Stop/start
        • Backtracking
        • Parenthetic speech


        Bing Translator API
        • Free for up to 2,000,000 characters a month

        speech synthesis

        • Google Translate’s unofficial API
        • 100 characters at a time
        • use https
        • Web Speech is supposed to do synthesis too, it’s coming

        translating telephone

        • Multilingual Video Conferencing
        • Everyone broadcasts in their own language
        • Foreign language is translated locally to mine



        John Dimm

        Fun with speech

        By John Dimm