Teaching the browser how to chat
@dbozhinovski
About
- Does JavaScript for fun and profit
- Works at Virtask
- Organizes beer.js Skopje
- Likes long walks on the beach (not)
What's this about
- NLP - the boring stuff
- 1000 words
- Did I mention this is offline?
- A small example
- Takeaway
First, the boring stuff
NLP
According to Wikipedia (as is the custom):
Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data.
https://en.wikipedia.org/wiki/Natural-language_processing
NLP basics in practice
- Largely statistical
- Lemmatisation
- Part of speech tagging
Example
I am learning about chatbots
Parsing
[I, am, learning, about, chatbots]
Lemmatizing
[I, be, learn, about, chatbots]
PoS tagging
[pronoun, verb, verb, preposition, noun (plural)]
What can be done with the most common 1000 words
English language facts
If the 80-20 rule applies for most things, the ''94-6 rule'' applies when working with language - by Zipfs law:
- The top 10 words account for 25% of used language.
- The top 100 words account for 50% of used language.
- The top 1,000 words account for 80% of used language. (sweet spot)
- The top 50,000 words account for 95% of used language.
https://github.com/spencermountain/compromise/wiki/Justification
Enter compromise.js
https://github.com/spencermountain/compromise/wiki/Justification
The process is to get some curated data, find the patterns, and list the exceptions. Bada bing, bada boom. In this way a satisfactory NLP library can be built with breathtaking lightness. Namely, it can be run right on the user's computer instead of a server.
Online and offline NLP
But first, a war story
Compromise.js
- The most used 1000 words make ~80% of used English
- With those words plus a bit of "magic", we create a good-enough NLP that is able to work offline
- Considering it can do plugins, custom lexicons and all sorts of config, we can cover a lot
- That said, there are a ton of edge cases best left to linguistics experts and CompSci people
- Best of all - it's ~200kb (dictionary included)
const sample = `The journey took us through Rome, Madrid and finally, Paris...`;
// The only part that actually matters :)
const places = nlp(sample).places().out('array');
document.querySelector('.out').innerText = places;
// No dice - compromise has no idea what Bitola and Ohrid even mean :)
const sample = "The journey took us through Skopje, Bitola and finally, Ohrid...";
const places = nlp(sample).places().out('array');
document.querySelector('.out').innerText = places;
Stuff compromise.js can do
- Verb analysis (tense)
- Noun analysis (singular / plural, place, name, organization, unit...)
- Dates, Numbers, Values
- Tags
- Transformations
http://compromise.cool/
Stuff compromise.js sucks at
L A N G U A G E (s)
Now, for something (hopefully) cool
BeerBot
a short demo:
https://beerbot.darko.io
"The Brain"
import nlp from 'compromise';
import skills from './skills/'; // a skill - something that the bot knows how to do
import { get, set } from 'lodash';
const getReply = async (input) => {
const skillMatch = skills.find((s) => {
const rules = s.matchRules; // each skill comes with match rules
const ruleMatch = s.matchRules.find(
r => nlp(input).normalize().match(r).found
); // we look through a skill's match rules, and look for one that
// works with the given input
if (ruleMatch) { // we return the first match we find
return true;
}
});
console.log(nlp(input).debug()); // VERY useful debugging info
if (skillMatch) {
// keep some history, for more fancy stuff
const topicHistory = get(context, 'topics') || [];
topicHistory.push(skillMatch.ID);
set(context, 'topics', topicHistory);
// reply = a function that gets executed on a skill match
const reply = await skillMatch.reply(input, context);
return reply;
} else {
// Otherwise, fall back to something really basic
return { mode: 'text', value: 'Hi there!' };
}
};
Skills
import { set, get, random } from 'lodash';
const ID = 'greet'; // The name of the skill
const lexicon = {}; // Custom lexicon / tagging if we happen to need it
// Rule(s) to match input against
// Simple lookup, tagging, logic + full support for regex
const matchRules = [
'(hi|hello|ahoy|greetings|#Expression) bot?'
];
// Some replies to return
const replies = [
() => ({ mode: 'text', value: 'Hi there.' }),
() => ({ mode: 'text', value: 'What\'s up?' }),
];
const reply = (input, context) => {
// Store some metadata to localStorage
const timesMatched = get(context, 'greet.matched', 0);
set(context, 'greet.matched', timesMatched + 1);
localStorage.setItem('bjs-bot-context', JSON.stringify(context));
// get a random-ish reply, to avoid being very repetitive
const replyRoll = random(0, replies.length - 1);
return replies[replyRoll](input, context); // Return said reply
};
// Export stuff
export default { ID, lexicon, matchRules, reply };
That said, let's make the beerbot smarter!
or, how to make a weather skill
The takeaway
- Chatbots aren't rocket science
- Text-based experiences can be made better
- Offline can be good enough
- Browsers are really damn capable these days
Thanks!
- github: @dbozhinovski
- twitter: @d_bozhinovski