Machine Learning
Tools and Practices...
Machine Learning
Tools and Practices...
and Concepts...
and whatever I'll have time for in 25 minutes
So, what is Machine Learning?
First, you have data
... it's more like this, in fact
And then, you have a model that discriminates
OK, then how is Machine Learning done?
Conceptually
- Clean data
- Clean data
- and ... clean data
- ...
Conceptually
- ...
- Ask domain experts for to develop new features
- Make your model (a simple one first)
- Profit OR get a better model/new data + clean data
And now the meat!!
The Holly Quaternity
Pandas
import pandas as pd
data = pd.read_csv("dataset/somedataset.csv") # or a lot of other file formats
no_nans_data = data.fillna(0)
no_nans_data["new_feature"] = some_data_transformation(no_nans_data["old_feature1"],
no_nans_data["old_feature2"])
# ... some more data wrangling here
labels = final_data.pop("label").values
features = final_data.values
# and a lot of other stuff, like pivot tables, sampling, plotting
# think of Pandas as Excel with a Python API... only Python API
# Pandas is used in the initial phase of the machine learning workflow
Scikit-Learn
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipeline = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
pca.fit(features)
# Prediction
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipeline,
dict(pca__n_components=n_components,
logistic__C=Cs))
estimator.fit(features, labels)
Matplotlib
import matplotlib.pyplot as plt
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
plt.axvline(estimator \
.best_estimator_ \
.named_steps['pca'] \
.n_components,
linestyle=':',
label='n_components chosen')
plt.legend(prop=dict(size=12))
plt.show()
Numpy (sometimes)
Thank you!
questions?
Links to get you started
- pandas.pydata.org/pandas-docs/stable/10min.html
- scikit-learn.org/stable/tutorial/index.html
- matplotlib.org/users/pyplot_tutorial.html
Machine Learning
By Alexandru Burlacu
Machine Learning
- 278