ToPS

for developers

Renato Cordeiro Ferreira

2017

Architecture

Lang

Config

Model

Exception

App

Lang

Config

Model

Exception

App

ToPS' main component. It holds all code for probabilistic models as an independent shared library

All probabilistic models make their calculations based on a discrete numeric alphabet

Lang

Config

Model

Exception

App

ToPS' language component.
It holds the implementation of a domain specific language (DSL)
to describe probabilistic models

ToPS' DSL is based on ChaiScript, an embedded script language designed to be integrated with C++

Lang

Config

Model

Exception

App

ToPS' auxiliary layer. It holds a C++ based intermediate representation of probabilistic models

ToPS' config structures store parameters to train and define of probabilistic models

Lang

Config

Model

Exception

App

ToPS' exceptions, representing all errors that can happen during the execution of ToPS

Lang

Config

Model

Exception

App

ToPS' command-line applications, allowing end users to execute tasks on the probabilistic models

Model

tops::model

Probabilistic Models

Model	Acronym
Discrete and Idependent Distribution	IID
Inhomogeneous Markov Chain	IMC
Maximal Dependence Decomposition	MDD
Multiple Sequential Model	MSM
Periodic Inhomogeneous Markov Chain	PIMC
Similarity Based Sequence Weighting	SBSW
Variable Length Markov Chain	VLMC
Hidden Markov Model	HMM
Generalized Hidden Markov Model	GHMM

Decodable

Simple

Features

P(x)

Evaluate

Calculate the probability of a sequence given the model

Generate

Draw random sequences from the model

Train

Estimate parameters of the model from a dataset

Serialize

Save parameters of the model for later reuse

Calculate

Estimate intermediate probabilities for algorithms

Label

Find the categories of each symbol of an input sequence

Decodable models only

Architecture (SECRETARY)

P(x)

Probabilistic Models
(COMPOSITE)

Trainer

Estimates parameters of the model from a dataset

Evaluator

Calculates the probability of a sequence given the model

Generator

Draws random sequences from the model

Serializer

Saves parameters of the model for later reuse

Calculator

Estimates sintermediate probabilities for algorithms

Labeler

Finds the categories of each symbol of an input sequence

Boss

Secretary

SECRETARY pattern explained

Boss

Has multiple behaviors

Secretary

Represents only one behavior

Has multiple secretaries

Represents only one boss

Is used indirectly by clients

Interacts directly with clients

Holds data shared among behaviors

Keeps all the code that implements algorithms

Holds data used only by the behavior they represent

Keeps no meaningful logic, forwarding calls to its boss

SECRETARY: Class diagram

SECRETARY: Sequence diagram

Example: IID for a Coin

std::vector<Sequence> training_set = {
  Sequence { 0, 0, 0, 1, 1 },
  Sequence { 0, 0, 0, 1, 0, 0, 1, 1 },
  Sequence { 0, 0, 0, 1, 1, 0, 0 }
};

// Train
auto iid_trainer = DiscreteIIDModel::standardTrainer();

iid_trainer->add_training_set(std::move(training_set));

auto iid = iid_trainer->train(
  DiscreteIIDModel::maximum_likehood_algorithm{}, 2);

// Evaluate
auto evaluator = iid->standardEvaluator({0, 1, 0});
evaluator->evaluateSequence(0, 2);

// Generate
auto generator = iid->standardGenerator();
generator->drawSequence(5);

Example: HMM for a Coin (ML)

std::vector<Labeling<Sequence>> training_set = {
  Labeling<Sequence>{ {0, 0, 0, 1, 1},          {1, 1, 1, 1, 1}          },
  Labeling<Sequence>{ {0, 0, 0, 1, 0, 0, 1, 1}, {0, 1, 1, 0, 0, 0, 1, 1} },
  Labeling<Sequence>{ {0, 0, 0, 1, 1, 0, 0},    {0, 0, 0, 0, 0, 1, 0}    }
};

// Train
auto hmm_trainer = HiddenMarkovModel::labelingTrainer();

hmm_trainer->add_training_set(std::move(training_set));

auto hmm = hmm_trainer->train(
  HiddenMarkovModel::maximum_likehood_algorithm{}, 2, 2, 0.1);

// Evaluate
auto evaluator = hmm->labelingEvaluator({ {0, 0, 0}, {1, 1, 1} });
evaluator->evaluateSequence(0, 3);

// Generate
auto generator = hmm->labelingGenerator();
generator->drawSequence(5);

Example: HMM for a Coin (EM)

std::vector<Sequence> training_set = {
  Sequence{ 0, 0, 0, 1, 1 },
  Sequence{ 0, 0, 0, 1, 0, 0, 1, 1 },
  Sequence{ 0, 0, 0, 1, 1, 0, 0 }
};

// Train
auto hmm_trainer = HiddenMarkovModel::standardTrainer();

hmm_trainer->add_training_set(std::move(training_set));

auto hmm = hmm_trainer->train(
  HiddenMarkovModel::baum_welch_algorithm{}, initial_hmm, 1000, 1e-4);

// Evaluate
auto evaluator = hmm->labelingEvaluator({ {0, 0, 0}, {1, 1, 1} });
evaluator->evaluateSequence(0, 3);

// Generate
auto generator = hmm->labelingGenerator();
generator->drawSequence(5);

Model Hierarchy: Interfaces

tops::model::ProbabilisticModel

Root of the hierarchy, implements 4 secretaries: Trainer, Evaluator, Generator and Serializer

tops::model::DecodableModel

Node of the hierarchy, descends directly from ProbabilisticModel and implements all parent's secretaries plus 2: Calculator and Labeler

tops::model::ProbabilisticModelDecorator

Node of the hierarchy, descends directly from ProbabilisticModel and adds functionalities around the implementation of parent's secretaries

Model Hierarchy: CRTP

The curiously recurring template pattern (CRTP) is an idiom in C++ in which a class X derives from a class template instantiation using itself as template argument. [...] Some use cases for this pattern are static polymorphism and other metaprogramming techniques [...]

Wikipedia, Curiously Recurring Template Pattern

tops::model::ProbabilisticModelCRTP

tops::model::DecodableModelCRTP

tops::model::ProbabilisticModelDecoratorCRTP

Implement FACTORY METHOD's for secretaries, define virtual methods that secretaries delegate to and host code reused between subclasses

Implementation details

All probabilistic models use an alias Symbol, defined as an unsigned int, to make their calculations

Algorithms make calculations with a class Probability, which use operator overloading to calculate probabilities in log-space while keeping the sum / multiplication syntax

The alias Sequence is a std::vector<Symbol> and the alias Matrix is a std::vector<std::vector<Symbol>>

Most secretaries have two subtypes: simple and cached. The latter pre-calculates results to answer queries in O(1)

Other hierarchies (Duration, State, etc.) also use CRTP

Development details

ToPS is compiled with GCC and Clang

Tests are implemented with Google Test

Benchmarks are implemented with Google Benchmark

Test coverage is collected with LCOV

Code style is enforced using CppLint

Memory errors are verified with Valgrind

Static errors are verified with CppCheck

Compilation is automated with UMake

Repository details

Test are run online in TravisCI
(configurations for the test environment are set in .travis.yml)

Coverage statistics are collected with Coveralls
(statistics are collected after each test in Travis)

Code documentation is generated with Doxygen
(comments are put above with Javadoc-style)

Contribution guidelines are listed in CONTRIBUTING.md
(pull request instructions for peer-review are listed there)

Lang

tops::lang

In computing, configuration files, or config files configure the parameters and initial settings for some computer programs.

Wikipedia, Configuration file

Configuration files

Definition

Represents trained parameters for a given model

Training

Represents training parameters for a given model

Configuration files: Definiton

Simple models

Decodable models

Model
name

Observations'
domain

Model-specific
parameters

Simple model's options

Other observations'
domains

Labels'
domain

Unidimensional sequences

Multidimensional sequences

Example: IID for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "IID"

observations = [ "1", "2", "3", "4", "5", "6" ]

emission_probabilities = [
  "1" : 1.0/6,
  "2" : 1.0/6,
  "3" : 1.0/6,
  "4" : 1.0/6,
  "5" : 1.0/6,
  "6" : 1.0/6
]

Example: HMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "HMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

labels = [ "Fair", "Loaded" ]

initial_probabilities = [
  "Fair"   : 0.5,
  "Loaded" : 0.5
]

transition_probabilities = [
  "Loaded" | "Fair"   : 0.1,
  "Fair"   | "Fair"   : 0.9,
  "Fair"   | "Loaded" : 0.1,
  "Loaded" | "Loaded" : 0.9
]

emission_probabilities = [
  "1" | "Fair"   : 1.0/6,
  "2" | "Fair"   : 1.0/6,
  "3" | "Fair"   : 1.0/6,
  "4" | "Fair"   : 1.0/6,
  "5" | "Fair"   : 1.0/6,
  "6" | "Fair"   : 1.0/6,
  "1" | "Loaded" : 1.0/2,
  "2" | "Loaded" : 1.0/10,
  "3" | "Loaded" : 1.0/10,
  "4" | "Loaded" : 1.0/10,
  "5" | "Loaded" : 1.0/10,
  "6" | "Loaded" : 1.0/10
]

Example: GHMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

// Dishonest Cassino

model_type = "GHMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

labels = [ "Fair", "Loaded" ]

states = [
  "Fair"   : [ duration: geometric(),
               emission: model("fair_dice.tops") ],
  "Loaded" : [ duration: geometric(),
               emission: model("loaded_dice.tops") ]
]

initial_probabilities = [
  "Fair"   : 0.5,
  "Loaded" : 0.5
]

transition_probabilities = [
  "Loaded" | "Fair"   : 0.1,
  "Fair"   | "Fair"   : 0.9,
  "Fair"   | "Loaded" : 0.1,
  "Loaded" | "Loaded" : 0.9
]

Configuration files: Training

Simple models

Decodable models

Simple model's options

Other observations'
domains

Labels'
domain

Model
name

Observations'
domain

Training
algorithm

Algorithm-specific
parameters

Training
set

TSV with sequences in columns

Example: IID for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "IID"

observations = [ "1", "2", "3", "4", "5", "6" ]

training_algorithm = "MaximumLikehood"

training_set = dataset("data/cassino-dice.tsv")

Example: HMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "HMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

training_algorithm = "BaumWelch"

training_set = dataset("data/cassino-dice.tsv")

initial_model = pretrained_model("hmm.tops")

maximum_iterations = 100

diff_threshold = 1.23

Example: GHMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "GHMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

training_algorithm = "MaximumLikehood"

training_set = dataset("data/cassino-dice.tsv")

labels = [ "Fair", "Loaded" ]

states = [
  "Fair"   : [ duration: geometric(),
               emission: pretrained_model("fair_dice.tops") ],
  "Loaded" : [ duration: fixed(7),
               emission: untrained_model("loaded_dice.tops") ]
]

pseudo_counter = 1.23

Config

tops::config

Exception

tops::exception

Apps

ToPS for Developers

By Renato Cordeiro Ferreira

ToPS for Developers

Presentation about ToPS (Toolkit of Probabilistic Models of Sequences) for developers.

Renato Cordeiro Ferreira

Lead Machine Learning Engineer @ Elo7 | PhD Student @ USP | Co-founder & Coordinator @CodeLab

ToPS

for developers

Architecture

Model

Lang

Config

Exception

Apps

ToPS for Developers

More from Renato Cordeiro Ferreira