Machine Learning
Short story about learning the machine in PHP
ARKADIUSZ KONDAS
Lead Software Architect
@ Proget Sp. z o.o.
Poland
Zend Certified Engineer
Code Craftsman
Blogger
Ultra Runner
@ ArkadiuszKondas
arkadiuszkondas.com
Zend Certified Architect
PHPERS
Agenda:
- Terminology
- Ways of learning
- Types of problems
- Example applications
Why machine learning?
Examples:
- Game AI: opponent gets stronger
- Netflix: move suggestions
- Spotify: music suggestions
- Gesture recognition
- Face recognition
- Uber: booking allocation
Why now?
The name machine learning was coined in 1959
- Data availability
- Computation power
Terminology
Machine Learning
Learning is any process by which a system improves performance from experience
Samples
a sample is an item to process (e.g. classify). It can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of features.
Features
the number of features or distinct traits that can be used to describe each item in a quantitative manner
IBU: 45 (0 - 120) International Bittering Units
ALK: 4,7% (0% - 12%)
EXT: 12,0 (0 - 30) BLG, PLATO
EBC: 9 (0 - 80) European Brewery Convention
Feature vector
is an n-dimensional vector of numerical features that represent some object.
$beer = [45, 4.7, 12.0, 9];
Feature extraction
preparation of feature vector – transforms the data in the high-dimensional space to a space of fewer dimensions
$beer = [?, 7.0, 17.5, 9];
Training / Evolution set
Set of data to discover potentially predictive relationships.
Ways of learning
Supervised learning
Source: https://www.slideshare.net/Simplilearn/what-is-machine-learning-machine-learning-basics-machine-learning-algorithms-simplilearn
Supervised learning
Unsupervised learning
Source: https://www.slideshare.net/Simplilearn/what-is-machine-learning-machine-learning-basics-machine-learning-algorithms-simplilearn
Unsupervised learning
Reinforcement learning
Source: https://www.marutitech.com/businesses-reinforcement-learning/
Reinforcement learning
Types of problems
https://github.com/php-ai/php-ml
PHP-ML - Machine Learning library for PHP
Classification
5.1,3.8,1.6,0.2,setosa
4.6,3.2,1.4,0.2,setosa
5.3,3.7,1.5,0.2,setosa
5,3.3,1.4,0.2,setosa
7,3.2,4.7,1.4,versicolor
6.4,3.2,4.5,1.5,versicolor
6.9,3.1,4.9,1.5,versicolor
5.5,2.3,4,1.3,versicolor
5.9,3,5.1,1.8,virginica
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa
5.4,3.7,1.5,0.2,setosa
4.8,3.4,1.6,0.2,setosa
4.8,3,1.4,0.1,setosa
4.3,3,1.1,0.1,setosa
5.8,4,1.2,0.2,setosa
5.7,4.4,1.5,0.4,setosa
5.4,3.9,1.3,0.4,setosa
5.1,3.5,1.4,0.3,setosa
5.7,3.8,1.7,0.3,setosa
5.1,3.8,1.5,0.3,setosa
5.4,3.4,1.7,0.2,setosa
5.1,3.7,1.5,0.4,setosa
4.6,3.6,1,0.2,setosa
5.1,3.3,1.7,0.5,setosa
4.8,3.4,1.9,0.2,setosa
5,3,1.6,0.2,setosa
5,3.4,1.6,0.4,setosa
5.2,3.5,1.5,0.2,setosa
5.2,3.4,1.4,0.2,setosa
4.7,3.2,1.6,0.2,setosa
4.8,3.1,1.6,0.2,setosa
5.4,3.4,1.5,0.4,setosa
5.2,4.1,1.5,0.1,setosa
5.5,4.2,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa
5,3.2,1.2,0.2,setosa
5.5,3.5,1.3,0.2,setosa
4.9,3.1,1.5,0.1,setosa
4.4,3,1.3,0.2,setosa
5.1,3.4,1.5,0.2,setosa
5,3.5,1.3,0.3,setosa
4.5,2.3,1.3,0.3,setosa
4.4,3.2,1.3,0.2,setosa
5,3.5,1.6,0.6,setosa
5.1,3.8,1.9,0.4,setosa
4.8,3,1.4,0.3,setosa
5.1,3.8,1.6,0.2,setosa
4.6,3.2,1.4,0.2,setosa
5.3,3.7,1.5,0.2,setosa
5,3.3,1.4,0.2,setosa
7,3.2,4.7,1.4,versicolor
6.4,3.2,4.5,1.5,versicolor
6.9,3.1,4.9,1.5,versicolor
5.5,2.3,4,1.3,versicolor
6.5,2.8,4.6,1.5,versicolor
5.7,2.8,4.5,1.3,versicolor
6.3,3.3,4.7,1.6,versicolor
4.9,2.4,3.3,1,versicolor
6.6,2.9,4.6,1.3,versicolor
5.2,2.7,3.9,1.4,versicolor
5,2,3.5,1,versicolor
5.9,3,4.2,1.5,versicolor
6,2.2,4,1,versicolor
6.1,2.9,4.7,1.4,versicolor
5.6,2.9,3.6,1.3,versicolor
6.7,3.1,4.4,1.4,versicolor
5.6,3,4.5,1.5,versicolor
5.8,2.7,4.1,1,versicolor
6.2,2.2,4.5,1.5,versicolor
5.6,2.5,3.9,1.1,versicolor
5.9,3.2,4.8,1.8,versicolor
6.1,2.8,4,1.3,versicolor
6.3,2.5,4.9,1.5,versicolor
6.1,2.8,4.7,1.2,versicolor
6.4,2.9,4.3,1.3,versicolor
6.6,3,4.4,1.4,versicolor
6.8,2.8,4.8,1.4,versicolor
6.7,3,5,1.7,versicolor
6,2.9,4.5,1.5,versicolor
5.7,2.6,3.5,1,versicolor
5.5,2.4,3.8,1.1,versicolor
5.5,2.4,3.7,1,versicolor
5.8,2.7,3.9,1.2,versicolor
6,2.7,5.1,1.6,versicolor
5.4,3,4.5,1.5,versicolor
6,3.4,4.5,1.6,versicolor
6.7,3.1,4.7,1.5,versicolor
6.3,2.3,4.4,1.3,versicolor
5.6,3,4.1,1.3,versicolor
5.5,2.5,4,1.3,versicolor
5.5,2.6,4.4,1.2,versicolor
6.1,3,4.6,1.4,versicolor
5.8,2.6,4,1.2,versicolor
5,2.3,3.3,1,versicolor
5.6,2.7,4.2,1.3,versicolor
5.7,3,4.2,1.2,versicolor
5.7,2.9,4.2,1.3,versicolor
6.2,2.9,4.3,1.3,versicolor
5.1,2.5,3,1.1,versicolor
5.7,2.8,4.1,1.3,versicolor
6.3,3.3,6,2.5,virginica
5.8,2.7,5.1,1.9,virginica
7.1,3,5.9,2.1,virginica
6.3,2.9,5.6,1.8,virginica
6.5,3,5.8,2.2,virginica
7.6,3,6.6,2.1,virginica
4.9,2.5,4.5,1.7,virginica
7.3,2.9,6.3,1.8,virginica
6.7,2.5,5.8,1.8,virginica
7.2,3.6,6.1,2.5,virginica
6.5,3.2,5.1,2,virginica
6.4,2.7,5.3,1.9,virginica
6.8,3,5.5,2.1,virginica
5.7,2.5,5,2,virginica
5.8,2.8,5.1,2.4,virginica
6.4,3.2,5.3,2.3,virginica
6.5,3,5.5,1.8,virginica
7.7,3.8,6.7,2.2,virginica
7.7,2.6,6.9,2.3,virginica
6,2.2,5,1.5,virginica
6.9,3.2,5.7,2.3,virginica
5.6,2.8,4.9,2,virginica
7.7,2.8,6.7,2,virginica
6.3,2.7,4.9,1.8,virginica
6.7,3.3,5.7,2.1,virginica
7.2,3.2,6,1.8,virginica
6.2,2.8,4.8,1.8,virginica
6.1,3,4.9,1.8,virginica
6.4,2.8,5.6,2.1,virginica
7.2,3,5.8,1.6,virginica
7.4,2.8,6.1,1.9,virginica
7.9,3.8,6.4,2,virginica
6.4,2.8,5.6,2.2,virginica
6.3,2.8,5.1,1.5,virginica
6.1,2.6,5.6,1.4,virginica
7.7,3,6.1,2.3,virginica
6.3,3.4,5.6,2.4,virginica
6.4,3.1,5.5,1.8,virginica
6,3,4.8,1.8,virginica
6.9,3.1,5.4,2.1,virginica
6.7,3.1,5.6,2.4,virginica
6.9,3.1,5.1,2.3,virginica
5.8,2.7,5.1,1.9,virginica
6.8,3.2,5.9,2.3,virginica
6.7,3.3,5.7,2.5,virginica
6.7,3,5.2,2.3,virginica
6.3,2.5,5,1.9,virginica
6.5,3,5.2,2,virginica
6.2,3.4,5.4,2.3,virginica
5.9,3,5.1,1.8,virginica
Classification
Classification
use Phpml\Classification\KNearestNeighbors;
use Phpml\CrossValidation\RandomSplit;
use Phpml\Dataset\Demo\IrisDataset;
use Phpml\Metric\Accuracy;
$dataset = new IrisDataset();
$split = new RandomSplit($dataset);
$classifier = new KNearestNeighbors();
$classifier->train(
$split->getTrainSamples(),
$split->getTrainLabels()
);
$predicted = $classifier->predict($split->getTestSamples());
echo sprintf("Accuracy: %s",
Accuracy::score($split->getTestLabels(), $predicted)
);
Regression
Miles Price
9300 7100
10565 15500
15000 4400
15000 4400
17764 5900
57000 4600
65940 8800
73676 2000
77006 2750
93739 2550
146088 960
153260 1025
Regression
use Phpml\Regression\LeastSquares;
$samples = [[9300], [10565], [15000], [15000], [17764], [57000], [65940], [73676], [77006], [93739], [146088], [153260]];
$targets = [7100, 15500, 4400, 4400, 5900, 4600, 8800, 2000, 2750, 2550, 960, 1025];
$regression = new LeastSquares();
$regression->train($samples, $targets);
$regression->getCoefficients();
$regression->getIntercept();
$price = $regression->predict([35000]);
Clustering
338.04200965504333, 340.96099477471597
325.29926486005047, 351.5486679994454
578.4499100418299, 337.60567937295593
636.6010916396216, 337.1655634700131
463.61495372140104, 87.30781540000328
309.68095048053567, 354.79933984182196
569.0001374385495, 330.5023290963775
304.07877801351003, 382.1851806595662
346.4318094249369, 310.88359489761524
334.2663485285672, 344.5296863511734
299.4530082502729, 372.35350555098955
316.6104171289118, 375.41602001034096
591.0246508090806, 367.94797729389813
444.34991671401906, 118.45496995943586`
433.8227728530299, 71.72222842689047
444.1053454998834, 88.83183753321407
324.12063379205586, 395.6336572036508
301.51836226006776, 386.3882932443487
290.99261735355236, 344.87568081586755
581.4049866784304, 371.3100282327115
337.3816662630702, 355.6395349632366
498.01819058795814, 108.48920265724882
633.0636222044708, 326.3110287785337
626.3134372729498, 348.71913526515345
445.42859704368556, 75.27305470137435
306.81410851018825, 378.7736686118892
439.09865975254024, 76.66971977093272
568.7282712141936, 358.41194912805474
484.31765468419354, 106.46382764011003
449.07188156546613, 124.09596458236962
436.2664927059137, 65.18430339142009
317.17960432822883, 383.0054323942884
617.9759662145998, 344.4047557583218
616.3127444692847, 370.06136729095704
321.7158651152831, 362.67159106183084
346.4608886567032, 314.51981989975513
618.2526036244853, 357.0196182265154
590.7945765999675, 324.6498992628791
449.38056945433436, 118.60287536865769
472.066687997031, 80.37719565025702
294.3118847109533, 353.4865615033949
478.97062534720897, 95.70649681953358
567.7841441303771, 320.75547779107023
299.2327209106677, 352.1922118198445
292.16634675253886, 373.83696459447344
461.85741799708296, 117.44946869574665
299.07107909731684, 335.77658215111626
319.910986973631, 384.62564615284293
471.30428556038567, 106.28818985606335
559.5279075420794, 310.54351840325637
634.3642075285027, 335.5029769601599
439.8317690837323, 99.23099273365165
596.8791693350236, 381.59392224565795
604.1516435053849, 405.67736285856677
420.9873305290532, 90.75617345698856
437.59770111677204, 102.76018890480634
458.53847739928796, 82.72050379444829
608.088322572504, 376.039913501605
307.97097959822054, 324.722286003262
312.8947110796786, 342.3285548762767
277.14935633237906, 335.3623461390265
583.5560528138124, 359.803102482816
281.5020161290845, 317.9752805728433
560.5928946888674, 391.38894054488634
588.8035684901414, 318.5353923361487
411.6735472125374, 101.13213926573758
335.27072033693, 350.83097073880435
587.7683201110738, 376.82513508927434
611.1566402943419, 357.26648853430015
427.2108675492812, 123.9355948432858
299.39380762681003, 355.28019695638045
602.0009869194577, 347.296571408476
486.3169307223569, 83.6416789424083
325.289213170922, 363.1833076023479
442.75071528370154, 87.48454115478904
330.48743133797694, 379.067400358574
472.6320292955885, 90.23629371803918
587.4838699987308, 380.34546667208446
588.1129869337481, 381.2727150846652
591.2110482590026, 356.4409869262666
576.6265527643077, 365.78582049509833
593.9074313070807, 391.83088813346376
608.1987741896169, 418.608525504521
444.1151674614539, 104.33389362734681
290.6610271195223, 335.67104011710535
589.4047360621089, 334.42201557780413
434.8858913992243, 59.69518765296425
295.2734178813325, 369.4223946120387
585.5138327920143, 371.80999860054123
329.18314608591544, 367.39749113835285
600.0886391882641, 350.76731943539863
425.5700946282092, 75.53475894151018
573.7681709045746, 386.07330136408166
300.05777637102244, 372.88029897823975
448.8305033529923, 63.695405132884275
587.6650652701011, 368.9692221019065
490.8909673624537, 96.29103599971592
330.6332057590241, 304.0525581468305
588.5894529768525, 326.3580453764175
589.8695007238415, 366.44966849044806
327.5091798778813, 404.35590896990914
Clustering
Clustering
Clustering
Clustering
Clustering
use Phpml\Clustering\KMeans;
$samples = [
[1, 1], [8, 7], [1, 2],
[7, 8], [2, 1], [8, 9]
];
$kmeans = new KMeans(2);
$clusters = $kmeans->cluster($samples);
$clusters = [
[[1, 1], [1, 2], [2, 1]],
[[8, 7], [7, 8], [8, 9]]
];
Preprocessing
use Phpml\Preprocessing\Imputer;
use Phpml\Preprocessing\Imputer\Strategy\MeanStrategy;
$data = [
[1, null, 3, 4],
[4, 3, 2, 1],
[null, 6, 7, 8],
[8, 7, null, 5],
];
$imputer = new Imputer(
null, new MeanStrategy(), Imputer::AXIS_COLUMN, $data
);
$imputer->transform($data);
$data = [
[1, 5.33, 3, 4],
[4, 3, 2, 1],
[4.33, 6, 7, 8],
[8, 7, 4, 5],
];
Feature Extraction
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
$samples = [
'Lorem ipsum dolor sit amet dolor',
'Mauris placerat ipsum dolor',
'Mauris diam eros fringilla diam',
];
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
$vectorizer->fit($samples);
$vectorizer->getVocabulary()
$vectorizer->transform($samples);
$tokensCounts = [
[0 => 1, 1 => 1, 2 => 2, 3 => 1, 4 => 1, 5 => 0, 6 => 0, 7 => 0, 8 => 0, 9 => 0],
[0 => 0, 1 => 1, 2 => 1, 3 => 0, 4 => 0, 5 => 1, 6 => 1, 7 => 0, 8 => 0, 9 => 0],
[0 => 0, 1 => 0, 2 => 0, 3 => 0, 4 => 0, 5 => 1, 6 => 0, 7 => 2, 8 => 1, 9 => 1],
];
Model selection
use Phpml\CrossValidation\RandomSplit;
use Phpml\CrossValidation\StratifiedRandomSplit;
use Phpml\Dataset\ArrayDataset;
$dataset = new ArrayDataset(
$samples = [[1], [2], [3], [4]],
$labels = ['a', 'a', 'b', 'b']
);
$randomSplit = new RandomSplit($dataset, 0.5);
$dataset = new ArrayDataset(
$samples = [[1], [2], [3], [4], [5], [6], [7], [8]],
$labels = ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b']
);
$split = new StratifiedRandomSplit($dataset, 0.5);
Workflow
$transformers = [
new Imputer(null, new MostFrequentStrategy()),
new Normalizer(),
];
$estimator = new SVC();
$samples = [
[1, -1, 2],
[2, 0, null],
[null, 1, -1],
];
$targets = [
4,
1,
4,
];
$pipeline = new Pipeline($transformers, $estimator);
$pipeline->train($samples, $targets);
$predicted = $pipeline->predict([[0, 0, 0]]);
// $predicted == 4
- Feature Selection
- Dimensionality Reduction
- Datasets
- Models Management
- Neural Network
- Metric
- Association Rule Learning
- Ensemble Algorithms
- Math
https://github.com/php-ai/php-ml
Example applications
Beer Judge
https://github.com/akondas/phpcon-2016-ml/blob/master/examples/beers.php
Beer judge
ibu,alk,ext,score,name
75,6.5,16,7,"Szalony Alchemik"
28,4.2,12.5,6,"Miss Lata"
40,4,10.5,7,"Tajemniczy Jeździec"
42,5.7,14.5,6,"Dziki Samotnik"
20,2.9,7.7,4,"Dębowa Panienka"
36,5.2,12.5,5,"Piękna Nieznajoma"
28,4.8,14.0,3,"Mała Czarna"
35,4.6,12.5,7,"Nieproszony Gość"
20,5.2,12.5,8,"Ostatni sprawiedliwy"
30,4.8,12.5,6,"Dziedzic Pruski"
75,7.5,18.0,9,"Bawidamek"
45,4.7,12.0,8,"Miś Wojtek"
20,5.2,13.0,8,"The Dancer"
30,4.7,12.0,4,"The Dealer"
120,8.9,19.0,3,"The Fighter"
85,6.4,16.0,9,"The Alchemist"
100,10.3,24,4,"The Gravedigger"
40,4.8,12.0,8,"The Teacher"
75,7.0,16.0,7,"The Butcher"
80,6.7,16.0,5,"The Miner"
Beer judge
use Phpml\Classification\SVC;
use Phpml\CrossValidation\StratifiedRandomSplit;
use Phpml\Dataset\CsvDataset;
use Phpml\Metric\Accuracy;
use Phpml\SupportVectorMachine\Kernel;
$dataset = new CsvDataset('examples/beers.csv', 3);
$split = new StratifiedRandomSplit($dataset, 0.1);
$classifier = new SVC(Kernel::RBF);
$classifier->train($split->getTrainSamples(), $split->getTrainLabels());
$predicted = $classifier->predict($split->getTestSamples());
echo sprintf("Accuracy: %s\n", Accuracy::score($split->getTestLabels(), $predicted));
$newBeer = [20, 2.5, 7];
echo sprintf("New beer score: %s\n", $classifier->predict($newBeer));
Code Review Estimator
https://github.com/akondas/code-review-estimator
Summary
ML is all about the proccess
- Define a problem
- Gather your data
- Prepare your data for ML
- Select algorithm
- Train model
- Tune parameters
- Select finale model
- Validate finale model
source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Summary
- the most important is question, data is secondary (but also important)
- many algorithms and techniques
- application can be very simple but also extremely sophiticated
- sometimes difficult to find correct answer
- base math skills are important
PHP 7.0
206,128 instances classified in 30 seconds
(6,871 per second)
https://github.com/syntheticminds/php7-ml-comparison
Python 2.7
106,879 instances classified in 30 seconds
(3,562 per second)
NodeJS v5.11.1
245,227 instances classified in 30 seconds
(8,174 per second)
Java 8
16,809,048 instances classified in 30 seconds
(560,301 per second)
PHP 7.1
302,931 instances classified in 30 seconds
(10,098 per second)
PHP 7.2
365,568 instances classified in 30 seconds
(12,186 per second)
Where to begin?
kaggle.com
Q&A
Thanks for listening
@ ArkadiuszKondas
https://slides.com/arkadiuszkondas
https://github.com/akondas
Short story about learning the machine in PHP
By Arkadiusz Kondas
Short story about learning the machine in PHP
The main goal of Machine Learning is to create intelligent systems that can improve and acquire new knowledge through input. In practice this translates into the use of one of hundreds of different algorithms available. Based on the PHP-ML library I want to present different classes of problems and how to use them. I will also show you how to build an entire pipeline by which we go through all the ML stages: preprocessing, choosing algorithms, and evaluating its effectiveness.
- 2,264