Renato Cordeiro Ferreira
Scientific Programmer @ JADS | PhD Candidate @ USP | Co-founder & Coordinator @CodeLab
Renato Cordeiro Ferreira
2017
Lang
Config
Model
Exception
App
Lang
Config
Model
Exception
App
ToPS' main component. It holds all code for probabilistic models as an independent shared library
All probabilistic models make their calculations based on a discrete numeric alphabet
Lang
Config
Model
Exception
App
ToPS' language component.
It holds the implementation of a domain specific language (DSL)
to describe probabilistic models
ToPS' DSL is based on ChaiScript, an embedded script language designed to be integrated with C++
Lang
Config
Model
Exception
App
ToPS' auxiliary layer. It holds a C++ based intermediate representation of probabilistic models
ToPS' config structures store parameters to train and define of probabilistic models
Lang
Config
Model
Exception
App
ToPS' exceptions, representing all errors that can happen during the execution of ToPS
Lang
Config
Model
Exception
App
ToPS' command-line applications, allowing end users to execute tasks on the probabilistic models
tops::model
Probabilistic Models
Model | Acronym |
---|---|
Discrete and Idependent Distribution | IID |
Inhomogeneous Markov Chain | IMC |
Maximal Dependence Decomposition | MDD |
Multiple Sequential Model | MSM |
Periodic Inhomogeneous Markov Chain | PIMC |
Similarity Based Sequence Weighting | SBSW |
Variable Length Markov Chain | VLMC |
Hidden Markov Model | HMM |
Generalized Hidden Markov Model | GHMM |
Decodable
Simple
Features
P(x)
Evaluate
Calculate the probability of a sequence given the model
Generate
Draw random sequences from the model
Train
Estimate parameters of the model from a dataset
Serialize
Save parameters of the model for later reuse
Calculate
Estimate intermediate probabilities for algorithms
Label
Find the categories of each symbol of an input sequence
Decodable models only
Architecture (SECRETARY)
P(x)
Probabilistic Models
(COMPOSITE)
Trainer
Estimates parameters of the model from a dataset
Evaluator
Calculates the probability of a sequence given the model
Generator
Draws random sequences from the model
Serializer
Saves parameters of the model for later reuse
Calculator
Estimates sintermediate probabilities for algorithms
Labeler
Finds the categories of each symbol of an input sequence
Boss
Secretary
Secretary
Secretary
Secretary
Secretary
Secretary
SECRETARY pattern explained
Boss
Secretary
SECRETARY: Class diagram
SECRETARY: Sequence diagram
Example: IID for a Coin
std::vector<Sequence> training_set = {
Sequence { 0, 0, 0, 1, 1 },
Sequence { 0, 0, 0, 1, 0, 0, 1, 1 },
Sequence { 0, 0, 0, 1, 1, 0, 0 }
};
// Train
auto iid_trainer = DiscreteIIDModel::standardTrainer();
iid_trainer->add_training_set(std::move(training_set));
auto iid = iid_trainer->train(
DiscreteIIDModel::maximum_likehood_algorithm{}, 2);
// Evaluate
auto evaluator = iid->standardEvaluator({0, 1, 0});
evaluator->evaluateSequence(0, 2);
// Generate
auto generator = iid->standardGenerator();
generator->drawSequence(5);
Example: HMM for a Coin (ML)
std::vector<Labeling<Sequence>> training_set = {
Labeling<Sequence>{ {0, 0, 0, 1, 1}, {1, 1, 1, 1, 1} },
Labeling<Sequence>{ {0, 0, 0, 1, 0, 0, 1, 1}, {0, 1, 1, 0, 0, 0, 1, 1} },
Labeling<Sequence>{ {0, 0, 0, 1, 1, 0, 0}, {0, 0, 0, 0, 0, 1, 0} }
};
// Train
auto hmm_trainer = HiddenMarkovModel::labelingTrainer();
hmm_trainer->add_training_set(std::move(training_set));
auto hmm = hmm_trainer->train(
HiddenMarkovModel::maximum_likehood_algorithm{}, 2, 2, 0.1);
// Evaluate
auto evaluator = hmm->labelingEvaluator({ {0, 0, 0}, {1, 1, 1} });
evaluator->evaluateSequence(0, 3);
// Generate
auto generator = hmm->labelingGenerator();
generator->drawSequence(5);
Example: HMM for a Coin (EM)
std::vector<Sequence> training_set = {
Sequence{ 0, 0, 0, 1, 1 },
Sequence{ 0, 0, 0, 1, 0, 0, 1, 1 },
Sequence{ 0, 0, 0, 1, 1, 0, 0 }
};
// Train
auto hmm_trainer = HiddenMarkovModel::standardTrainer();
hmm_trainer->add_training_set(std::move(training_set));
auto hmm = hmm_trainer->train(
HiddenMarkovModel::baum_welch_algorithm{}, initial_hmm, 1000, 1e-4);
// Evaluate
auto evaluator = hmm->labelingEvaluator({ {0, 0, 0}, {1, 1, 1} });
evaluator->evaluateSequence(0, 3);
// Generate
auto generator = hmm->labelingGenerator();
generator->drawSequence(5);
Model Hierarchy: Interfaces
tops::model::ProbabilisticModel
Root of the hierarchy, implements 4 secretaries: Trainer, Evaluator, Generator and Serializer
tops::model::DecodableModel
Node of the hierarchy, descends directly from ProbabilisticModel and implements all parent's secretaries plus 2: Calculator and Labeler
tops::model::ProbabilisticModelDecorator
Node of the hierarchy, descends directly from ProbabilisticModel and adds functionalities around the implementation of parent's secretaries
Model Hierarchy: CRTP
The curiously recurring template pattern (CRTP) is an idiom in C++ in which a class X derives from a class template instantiation using itself as template argument. [...] Some use cases for this pattern are static polymorphism and other metaprogramming techniques [...]
Wikipedia, Curiously Recurring Template Pattern
tops::model::ProbabilisticModelCRTP
tops::model::DecodableModelCRTP
tops::model::ProbabilisticModelDecoratorCRTP
Implement FACTORY METHOD's for secretaries, define virtual methods that secretaries delegate to and host code reused between subclasses
Implementation details
Development details
Repository details
tops::lang
In computing, configuration files, or config files configure the parameters and initial settings for some computer programs.
Wikipedia, Configuration file
Configuration files
Definition
Represents trained parameters for a given model
Training
Represents training parameters for a given model
Configuration files: Definiton
Simple models
Decodable models
Model
name
Observations'
domain
Model-specific
parameters
+
+
Simple model's options
Other observations'
domains
+
Labels'
domain
+
Unidimensional sequences
Multidimensional sequences
Example: IID for a Dice
// -*- mode: c++ -*-
// vim: ft=chaiscript:
model_type = "IID"
observations = [ "1", "2", "3", "4", "5", "6" ]
emission_probabilities = [
"1" : 1.0/6,
"2" : 1.0/6,
"3" : 1.0/6,
"4" : 1.0/6,
"5" : 1.0/6,
"6" : 1.0/6
]
Example: HMM for a Dice
// -*- mode: c++ -*-
// vim: ft=chaiscript:
model_type = "HMM"
observations = [ "1", "2", "3", "4", "5", "6" ]
labels = [ "Fair", "Loaded" ]
initial_probabilities = [
"Fair" : 0.5,
"Loaded" : 0.5
]
transition_probabilities = [
"Loaded" | "Fair" : 0.1,
"Fair" | "Fair" : 0.9,
"Fair" | "Loaded" : 0.1,
"Loaded" | "Loaded" : 0.9
]
emission_probabilities = [
"1" | "Fair" : 1.0/6,
"2" | "Fair" : 1.0/6,
"3" | "Fair" : 1.0/6,
"4" | "Fair" : 1.0/6,
"5" | "Fair" : 1.0/6,
"6" | "Fair" : 1.0/6,
"1" | "Loaded" : 1.0/2,
"2" | "Loaded" : 1.0/10,
"3" | "Loaded" : 1.0/10,
"4" | "Loaded" : 1.0/10,
"5" | "Loaded" : 1.0/10,
"6" | "Loaded" : 1.0/10
]
Example: GHMM for a Dice
// -*- mode: c++ -*-
// vim: ft=chaiscript:
// Dishonest Cassino
model_type = "GHMM"
observations = [ "1", "2", "3", "4", "5", "6" ]
labels = [ "Fair", "Loaded" ]
states = [
"Fair" : [ duration: geometric(),
emission: model("fair_dice.tops") ],
"Loaded" : [ duration: geometric(),
emission: model("loaded_dice.tops") ]
]
initial_probabilities = [
"Fair" : 0.5,
"Loaded" : 0.5
]
transition_probabilities = [
"Loaded" | "Fair" : 0.1,
"Fair" | "Fair" : 0.9,
"Fair" | "Loaded" : 0.1,
"Loaded" | "Loaded" : 0.9
]
Configuration files: Training
Simple models
Decodable models
Simple model's options
Other observations'
domains
+
Labels'
domain
+
Model
name
Observations'
domain
Training
algorithm
+
+
Algorithm-specific
parameters
+
Training
set
+
TSV with sequences in columns
Example: IID for a Dice
// -*- mode: c++ -*-
// vim: ft=chaiscript:
model_type = "IID"
observations = [ "1", "2", "3", "4", "5", "6" ]
training_algorithm = "MaximumLikehood"
training_set = dataset("data/cassino-dice.tsv")
Example: HMM for a Dice
// -*- mode: c++ -*-
// vim: ft=chaiscript:
model_type = "HMM"
observations = [ "1", "2", "3", "4", "5", "6" ]
training_algorithm = "BaumWelch"
training_set = dataset("data/cassino-dice.tsv")
initial_model = pretrained_model("hmm.tops")
maximum_iterations = 100
diff_threshold = 1.23
Example: GHMM for a Dice
// -*- mode: c++ -*-
// vim: ft=chaiscript:
model_type = "GHMM"
observations = [ "1", "2", "3", "4", "5", "6" ]
training_algorithm = "MaximumLikehood"
training_set = dataset("data/cassino-dice.tsv")
labels = [ "Fair", "Loaded" ]
states = [
"Fair" : [ duration: geometric(),
emission: pretrained_model("fair_dice.tops") ],
"Loaded" : [ duration: fixed(7),
emission: untrained_model("loaded_dice.tops") ]
]
pseudo_counter = 1.23
tops::config
tops::exception
By Renato Cordeiro Ferreira
Presentation about ToPS (Toolkit of Probabilistic Models of Sequences) for developers.
Scientific Programmer @ JADS | PhD Candidate @ USP | Co-founder & Coordinator @CodeLab