ToPS

for developers

Renato Cordeiro Ferreira

2017

Architecture

Lang

Config

Model

Exception

App

Lang

Config

Model

Exception

App

ToPS' main component. It holds all code for probabilistic models as an independent shared library

All probabilistic models make their calculations based on a discrete numeric alphabet

Lang

Config

Model

Exception

App

ToPS' language component.
It holds the implementation of a domain specific language (DSL)
to describe probabilistic models

ToPS' DSL is based on ChaiScript, an embedded script language designed to be integrated with C++

Lang

Config

Model

Exception

App

ToPS' auxiliary layer. It holds a C++ based intermediate representation of probabilistic models

ToPS' config structures store parameters to train and define of probabilistic models

Lang

Config

Model

Exception

App

ToPS' exceptions, representing all errors that can happen during the execution of ToPS

Lang

Config

Model

Exception

App

ToPS' command-line applications, allowing end users to execute tasks on the probabilistic models

Model

tops::model

Probabilistic Models

Model Acronym
Discrete and Idependent Distribution IID
Inhomogeneous Markov Chain IMC
Maximal Dependence Decomposition MDD
Multiple Sequential Model MSM
Periodic Inhomogeneous Markov Chain PIMC
Similarity Based Sequence Weighting SBSW
Variable Length Markov Chain VLMC
Hidden Markov Model HMM
Generalized Hidden Markov Model GHMM

Decodable

Simple

Features

P(x)

Evaluate

Calculate the probability of a sequence given the model

Generate

Draw random sequences from the model

Train

Estimate parameters of the model from a dataset

Serialize

Save parameters of the model for later reuse

Calculate

Estimate intermediate probabilities for algorithms

Label

Find the categories of each symbol of an input sequence

Decodable models only

Architecture (SECRETARY)

P(x)

Probabilistic Models
(COMPOSITE)

Trainer

Estimates parameters of the model from a dataset

Evaluator

Calculates the probability of a sequence given the model

Generator

Draws random sequences from the model

Serializer

Saves parameters of the model for later reuse

Calculator

Estimates sintermediate probabilities for algorithms

Labeler

Finds the categories of each symbol of an input sequence

Boss

Secretary

Secretary

Secretary

Secretary

Secretary

Secretary

SECRETARY pattern explained

Boss

  • Has multiple behaviors

Secretary

  • Represents only one behavior
  • Has multiple secretaries
  • Represents only one boss
  • Is used indirectly by clients
  • Interacts directly with clients
  • Holds data shared among behaviors
  • Keeps all the code that implements algorithms
  • Holds data used only by the behavior they represent
  • Keeps no meaningful logic, forwarding calls to its boss

SECRETARY: Class diagram

SECRETARY: Sequence diagram

Example: IID for a Coin

std::vector<Sequence> training_set = {
  Sequence { 0, 0, 0, 1, 1 },
  Sequence { 0, 0, 0, 1, 0, 0, 1, 1 },
  Sequence { 0, 0, 0, 1, 1, 0, 0 }
};

// Train
auto iid_trainer = DiscreteIIDModel::standardTrainer();

iid_trainer->add_training_set(std::move(training_set));

auto iid = iid_trainer->train(
  DiscreteIIDModel::maximum_likehood_algorithm{}, 2);

// Evaluate
auto evaluator = iid->standardEvaluator({0, 1, 0});
evaluator->evaluateSequence(0, 2);

// Generate
auto generator = iid->standardGenerator();
generator->drawSequence(5);

Example: HMM for a Coin (ML)

std::vector<Labeling<Sequence>> training_set = {
  Labeling<Sequence>{ {0, 0, 0, 1, 1},          {1, 1, 1, 1, 1}          },
  Labeling<Sequence>{ {0, 0, 0, 1, 0, 0, 1, 1}, {0, 1, 1, 0, 0, 0, 1, 1} },
  Labeling<Sequence>{ {0, 0, 0, 1, 1, 0, 0},    {0, 0, 0, 0, 0, 1, 0}    }
};

// Train
auto hmm_trainer = HiddenMarkovModel::labelingTrainer();

hmm_trainer->add_training_set(std::move(training_set));

auto hmm = hmm_trainer->train(
  HiddenMarkovModel::maximum_likehood_algorithm{}, 2, 2, 0.1);

// Evaluate
auto evaluator = hmm->labelingEvaluator({ {0, 0, 0}, {1, 1, 1} });
evaluator->evaluateSequence(0, 3);

// Generate
auto generator = hmm->labelingGenerator();
generator->drawSequence(5);

Example: HMM for a Coin (EM)

std::vector<Sequence> training_set = {
  Sequence{ 0, 0, 0, 1, 1 },
  Sequence{ 0, 0, 0, 1, 0, 0, 1, 1 },
  Sequence{ 0, 0, 0, 1, 1, 0, 0 }
};

// Train
auto hmm_trainer = HiddenMarkovModel::standardTrainer();

hmm_trainer->add_training_set(std::move(training_set));

auto hmm = hmm_trainer->train(
  HiddenMarkovModel::baum_welch_algorithm{}, initial_hmm, 1000, 1e-4);

// Evaluate
auto evaluator = hmm->labelingEvaluator({ {0, 0, 0}, {1, 1, 1} });
evaluator->evaluateSequence(0, 3);

// Generate
auto generator = hmm->labelingGenerator();
generator->drawSequence(5);

Model Hierarchy: Interfaces

tops::model::ProbabilisticModel

Root of the hierarchy, implements 4 secretaries: Trainer, Evaluator, Generator and Serializer

tops::model::DecodableModel

Node of the hierarchy, descends directly from ProbabilisticModel and implements all parent's secretaries plus 2: Calculator and Labeler

tops::model::ProbabilisticModelDecorator

Node of the hierarchy, descends directly from ProbabilisticModel and adds functionalities around the implementation of parent's secretaries

Model Hierarchy: CRTP

The curiously recurring template pattern (CRTP) is an idiom in C++ in which a class X derives from a class template instantiation using itself as template argument. [...] Some use cases for this pattern are static polymorphism and other metaprogramming techniques [...]

Wikipedia, Curiously Recurring Template Pattern

tops::model::ProbabilisticModelCRTP
tops::model::DecodableModelCRTP
tops::model::ProbabilisticModelDecoratorCRTP

Implement FACTORY METHOD's for secretaries, define virtual methods that secretaries delegate to and host code reused between subclasses

Implementation details

  • All probabilistic models use an alias Symbol, defined as an unsigned int, to make their calculations
  • Algorithms make calculations with a class Probability, which use operator overloading to calculate probabilities in log-space while keeping the sum / multiplication syntax
  • The alias Sequence is a std::vector<Symbol> and the alias Matrix is a std::vector<std::vector<Symbol>>
  • Most secretaries have two subtypes: simple and cached. The latter pre-calculates results to answer queries in O(1)
  • Other hierarchies (Duration, State, etc.) also use CRTP

Development details

  • Test coverage is collected with LCOV
  • Code style is enforced using CppLint
  • Memory errors are verified with Valgrind
  • Static errors are verified with CppCheck
  • Compilation is automated with UMake

Repository details

  • Test are run online in TravisCI
    (
    configurations for the test environment are set in .travis.yml)
  • Coverage statistics are collected with Coveralls
    (statistics are collected after each test in Travis)
  • Code documentation is generated with Doxygen
    (comments are put above with Javadoc-style)
  • Contribution guidelines are listed in CONTRIBUTING.md
    (pull request instructions for peer-review are listed there)

Lang

tops::lang

In computing, configuration files, or config files configure the parameters and initial settings for some computer programs.

Wikipedia, Configuration file

Configuration files

Definition

Represents trained parameters for a given model

Training

Represents training parameters for a given model

Configuration files: Definiton

Simple models

Decodable models

Model
name

Observations'
domain

Model-specific
parameters

+

+

Simple model's options

Other observations'
domains

+

Labels'
domain

+

Unidimensional sequences

Multidimensional sequences

Example: IID for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "IID"

observations = [ "1", "2", "3", "4", "5", "6" ]

emission_probabilities = [
  "1" : 1.0/6,
  "2" : 1.0/6,
  "3" : 1.0/6,
  "4" : 1.0/6,
  "5" : 1.0/6,
  "6" : 1.0/6
]

Example: HMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "HMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

labels = [ "Fair", "Loaded" ]

initial_probabilities = [
  "Fair"   : 0.5,
  "Loaded" : 0.5
]

transition_probabilities = [
  "Loaded" | "Fair"   : 0.1,
  "Fair"   | "Fair"   : 0.9,
  "Fair"   | "Loaded" : 0.1,
  "Loaded" | "Loaded" : 0.9
]

emission_probabilities = [
  "1" | "Fair"   : 1.0/6,
  "2" | "Fair"   : 1.0/6,
  "3" | "Fair"   : 1.0/6,
  "4" | "Fair"   : 1.0/6,
  "5" | "Fair"   : 1.0/6,
  "6" | "Fair"   : 1.0/6,
  "1" | "Loaded" : 1.0/2,
  "2" | "Loaded" : 1.0/10,
  "3" | "Loaded" : 1.0/10,
  "4" | "Loaded" : 1.0/10,
  "5" | "Loaded" : 1.0/10,
  "6" | "Loaded" : 1.0/10
]

Example: GHMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

// Dishonest Cassino

model_type = "GHMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

labels = [ "Fair", "Loaded" ]

states = [
  "Fair"   : [ duration: geometric(),
               emission: model("fair_dice.tops") ],
  "Loaded" : [ duration: geometric(),
               emission: model("loaded_dice.tops") ]
]

initial_probabilities = [
  "Fair"   : 0.5,
  "Loaded" : 0.5
]

transition_probabilities = [
  "Loaded" | "Fair"   : 0.1,
  "Fair"   | "Fair"   : 0.9,
  "Fair"   | "Loaded" : 0.1,
  "Loaded" | "Loaded" : 0.9
]

Configuration files: Training

Simple models

Decodable models

Simple model's options

Other observations'
domains

+

Labels'
domain

+

Model
name

Observations'
domain

Training
algorithm

+

+

Algorithm-specific
parameters

+

Training
set

+

TSV with sequences in columns

Example: IID for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "IID"

observations = [ "1", "2", "3", "4", "5", "6" ]

training_algorithm = "MaximumLikehood"

training_set = dataset("data/cassino-dice.tsv")

Example: HMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "HMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

training_algorithm = "BaumWelch"

training_set = dataset("data/cassino-dice.tsv")

initial_model = pretrained_model("hmm.tops")

maximum_iterations = 100

diff_threshold = 1.23

Example: GHMM for a Dice

// -*- mode: c++ -*-
// vim: ft=chaiscript:

model_type = "GHMM"

observations = [ "1", "2", "3", "4", "5", "6" ]

training_algorithm = "MaximumLikehood"

training_set = dataset("data/cassino-dice.tsv")

labels = [ "Fair", "Loaded" ]

states = [
  "Fair"   : [ duration: geometric(),
               emission: pretrained_model("fair_dice.tops") ],
  "Loaded" : [ duration: fixed(7),
               emission: untrained_model("loaded_dice.tops") ]
]

pseudo_counter = 1.23

Config

tops::config

Exception

tops::exception

Apps

ToPS for Developers

By Renato Cordeiro Ferreira

ToPS for Developers

Presentation about ToPS (Toolkit of Probabilistic Models of Sequences) for developers.

  • 668