Bi-Weekly Presentation
Comparer Tool
Why?
- How can we distinguish online users from one another?
- Within a specified audience there exists clusters or segments of users who behave in a certain way
- If we can determine what distinguishes some segments from one another using clickstream history, we can more accurately make marketing decisions
What Do We Want?
- Quick, simple, automated way of classifying, distinguishing, and interpreting how users are behaving on the web.
-
Because this is clickstream data there can be a lot of features to deal with and wrap our heads around.
Chi-square:
- Univariate
- No dependent variables to define and you cannot make predictions.
Logistic Regression
Dealing with a very sparse and high dimensional data set, how do we find a small number of features that separate an audience?
- Lasso
- Interpretable
- Reproducible (Utility)

Regularization
- Selecting K parameters;
- Tuning to return desired number of interpretable coefficients.
Classification
Domain visits, demographics, impressions, attributes, category, etc. for ebay shoppers and amazon shoppers
Model
Domain visits, demographics, impressions, attributes, category (without segment type)
Model
Amazon shopper
or
Ebay shopper
Lasso (least absolute shrinkage and selection operator)
- Simplification for easier interpretability
- Quick
- Reduction of overfitting
Feature Selection
Regularization
- Applying a penalty to increasing the magnitude of parameter values to reduce overfitting.
- Penalize large values of the parameters.


Output
- Within the constraints of the generalization value, the output will contain the most distinguishing features.
- leaves just the number of features we consider to be meaningful
- It will also have understandable summary values such as a confusion matrix and AUC score.
Python
Python 3 (its time)
Scikit-Learn - machine learning
SciPy - scientific computing
OpenPyXl - read and write Excel files

What?
Prepare > Setup > Run > Analyze

| Person_id | Domain | url_count | Segment |
|---|---|---|---|
| 4321 | etsy.com | 4 | amazon |
| 4321 | reuters.com | 1 | amazon |
| 4321 | slate.com | 2 | amazon |
| 5442 | yahoo.com | 1 | ebay |
| 5442 | cragslist.com | 10 | ebay |
Prepare > Setup > Run > Analyze
Metric
Value
Gather and Prepare Data
Setup and Installation
Possible sources to get the code:
- Clone repo from BitBucket
- Download file structure from shared zip file
- Hopefully soon: Pip install from git link
Prepare > Setup > Run > Analyze
One inside the pkg/directory:
>> python setup.py install
Load File and Run
import pandas as pd
from comparer import *
filepath = 'lays_comparer_input.csv'
data = pd.read_csv(filepath, low_memory=False)
data.segment = data.segment.astype(str)
lays = Comparer(data, 'domain_name', 'views', 'segment', 300)
lays.export_report("/Users/patrick.nieto/Documents/lays_comparer_output")
Prepare > Setup > Run > Analyze
Example script:
Once installed, all need to do is:
- Import the comparer class
- load in your csv file using pandas
- reference the necessary columns as Comparer parameters
- And specify how many coefficients you need
>> python comparey.py "/Users/Patrick.nieto/lays/lays_comparer_input.csvPrepare > Setup > Run > Analyze
from __future__ import division
import pandas as pd
import numpy as np
import os
import openpyxl
from openpyxl.styles import Font
from sklearn.feature_extraction import DictVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize
from sklearn import metrics
from itertools import combinations
from scipy.sparse import csr_matrix
from collections import Counter
import pickle
import sys
from .logger import logger
class Comparer(object):
def __init__(self, df, metric, value, label, coefficients=20):
self.df = df
self.metric = metric
self.value = value
self.label = label
self.coefficients = coefficients
self.results = None
self.pair_wise = None
self.__handle_data()
self.__fit()
self.__summary_lookups()
def __summary_lookups(self):
logs = logger.logging_setup('Summary_lookups')
logs.info('creating dictionary of segment totals...')
self.segment_totals = self.df.groupby(self.label)["person_id"].nunique().to_dict()
logs.info('creating dictionary of domain_totals...')
self.url_counts = (self.df
.groupby([self.metric, self.label])['person_id']
.nunique()
.unstack(self.label)
.fillna(0)
.to_dict('index'))
if 'subcategory' in self.df.columns:
logs.info('creating dictionary of domain categories...')
self.categories = {k: v[['subcategory', 'category']].iloc[0] for k, v in self.df.groupby(self.metric)}
def __handle_data(self):
self.__drop_nulls()
self.__verify_types()
def __verify_types(self):
# Force variable name to be a string
self.df[self.metric] = self.df[self.metric].astype(str)
# Force segment to be a string
self.df[self.label] = self.df[self.label].astype(str)
def __drop_nulls(self):
if 'category' in self.df.columns:
self.df['category'].fillna('None', inplace=True)
if 'subcategory' in self.df.columns:
self.df['subcategory'].fillna('None', inplace=True)
self.df.dropna(inplace=True)
def __fit(self):
"""
params
"""
logs = logger.logging_setup('Model_fit')
df = self.df
coefficients = self.coefficients
results = dict()
subsets = self.subset_data_extract_vectors(df)
for k, i in subsets: # iterate over new dictionary to run logistic regression
# Vectorize list of dictionaries and apply logs.function to matrix
v = DictVectorizer(sparse=True)
x = v.fit_transform(i[0])
x = csr_matrix.log1p(x)
x = normalize(x, copy=False)
# Change all y values to 1's and 0's
var1 = k.split('_')[0]
y = [1 if _ == var1 else 0 for _ in i[2]]
# split features, labels, and person id's respectively into train and test
x_train, x_test, \
y_train, y_test, \
p_train, p_test = train_test_split(x, y, i[1], test_size=0.33, random_state=42)
j = [1]
count = 0
size = list()
circle = False
switch = None
while k not in results:
c = j[count]
clf = LogisticRegression(penalty='l1', C=c,
fit_intercept=False, class_weight="balanced")
clf.fit(x_train, y_train)
n = len([_ for _ in clf.coef_[0, :] if abs(_) > 0])
size.append(n)
logs.info('Fit #{}, C={}, now at {} coefs'.format(count, c, n))
temp = [clf, size[count]]
if count > 100:
results[k] = v, temp[0], temp[1], x_test, y_test
break
if coefficients*.95 <= n <= coefficients*1.05:
results[k] = v, temp[0], temp[1], x_test, y_test
break
if len(size) > 1:
current = size[-1]
previous = size[-2]
if ((current >= coefficients >= previous) or
(current <= coefficients <= previous)):
circle = True
switch = j[-2]
if circle:
new_c = (j[-1] + switch) / 2
j.append(new_c)
else:
if size[count] >= coefficients*1.1:
j.append(j[count]/2)
elif size[count] <= coefficients*.9:
j.append(j[count]*2)
else:
results[k] = v, temp[0], temp[1], x_test, y_test
elif size[count] >= coefficients*1.1:
j.append(j[count]/2)
elif size[count] <= coefficients*.9:
j.append(j[count]*2)
else:
results[k] = v, temp[0], temp[1], x_test, y_test
break
count += 1
self.results = results
def subset_data_extract_vectors(self, df):
"""
Subset data-frame for every combination of segments.
Organize features labels and people
:param df: data frame
:return: dictionary. key = 1 segment comparison
value = features, labels, people
"""
logs = logger.logging_setup('Subsets')
# iterate over all possible combinations of segments. Order doesnt matter.
possible_keys = sorted(df[self.label].unique())
if len(possible_keys) <= 1:
logs.critical("There must be more than one segment type to run comparer")
sys.exit()
for i in combinations(possible_keys, 2):
# create lists to hold features, labels and people
feature_vec = list()
y = list()
person_vec = list()
# subset dataframe for just the current segments in the loop and fill empty lists.
pair = df[(df[self.label] == i[0]) | (df[self.label] == i[1])]
grouped = pair.groupby(["person_id"]) # group on people
for key, value in grouped:
y.append(list(value[self.label])[0])
person_vec.append((key, value[self.metric].unique()))
length = len(value[self.value])
k = pd.Series(np.ones(length))
k.index = value[self.metric]
feature_vec.append(k.to_dict())
# create dict for every segment comparison with features, people and labels
logs.info('subsetting for {}_vs_{}...'.format(i[0], i[1]))
logs.info('length of y\'s: {}'.format(Counter(y)))
key_name = '{}_vs_{}'.format(i[0], i[1])
# pair_wise[key_name] = feature_vec, person_vec, y
yield key_name, [feature_vec, person_vec, y]
def report(self):
logs = logger.logging_setup('Report')
if self.results is None:
raise Exception("Something went wrong!")
models = sorted(list(self.results.keys()))
output = list()
for i, key in enumerate(models):
non_zero_params = self.results[key][2]
top_coefs = self.get_top_coefs(key)
tp = ['"{}": {}'.format(x[0], round(x[1], 2)) for x in top_coefs]
top = ", ".join(tp)
auc = self.get_auc(key)
row = "{:^10}: {:>10} non-zero coefs; AUC: {}, top 3 features are [{}]".format(
key, non_zero_params, auc, top
)
output.append(row)
return output
def get_top_coefs(self, key, n = 3):
"""
Funtion to return the top n coefficients, sorted by absolute value.
:param key: Model to investigate, identified by e.g. '1_vs_2'
:param n: The number of coefficients to return.
:return:
"""
logs = logger.logging_setup('Top_coefs')
seg1 = key.split('_')[0]
seg2 = key.split('_')[-1]
# extract feature names
feature_names = self.results[key][0].get_feature_names()
# totals = [self.segment_totals[x] for x in feature_names]
# number_visits = [self.url_counts[x] for x in feature_names]
seg1_totals = self.segment_totals[seg1]
seg2_totals = self.segment_totals[seg2]
seg1_person_count = [seg1_totals]*len(feature_names)
seg2_person_count = [seg2_totals]*len(feature_names)
domain1_totals = [self.url_counts[x][seg1] for x in feature_names]
domain2_totals = [self.url_counts[x][seg2] for x in feature_names]
person_freq1 = [y/x for x, y in zip(seg1_person_count, domain1_totals)]
person_freq2 = [y/x for x, y in zip(seg2_person_count, domain2_totals)]
if 'categories' in self.__dict__:
sub_cats = [self.categories[x]['subcategory'] for x in feature_names]
reg_cats = [self.categories[x]['category'] for x in feature_names]
else:
sub_cats = ['' for _ in feature_names]
reg_cats = ['' for _ in feature_names]
coefs = self.results[key][1].coef_
# zip together lists for data frame columns
logs.info("Zipping together lists for dataframe...")
results = zip(feature_names, coefs[0, :], domain1_totals, seg1_person_count, person_freq1,
domain2_totals, seg2_person_count, person_freq2, sub_cats, reg_cats)
logs.info("Trying to sort")
top_coefs_0 = sorted(results, key=lambda x: x[1], reverse=True)
top_coefs_0 = [x for x in top_coefs_0 if abs(x[1]) > 0]
return top_coefs_0[:n]
def get_confusion_matrix(self, key):
logs = logger.logging_setup('Confusion_matrix')
# predict y's and calculate confusion matrix
x_test = self.results[key][3]
y_test = self.results[key][4]
y_preds = self.results[key][1].predict(x_test)
cm = metrics.confusion_matrix(y_test, y_preds)
# plt.figure()
# plot_confusion_matrix(cm, 'Confusion matrix')
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# plt.figure()
# plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
return cm, cm_normalized
def get_auc(self, key):
"""
calculate auc against testing
:param key:
:return:
"""
logs = logger.logging_setup('AUC')
x_test = self.results[key][3]
y_test = self.results[key][4]
pos_label = self.results[key][1].classes_.max()
preds = self.results[key][1].predict_proba(x_test)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, preds, pos_label=pos_label)
auc = metrics.auc(fpr, tpr)
return auc
def export_model(self, key, directory):
"""
Function to save out:
1) The DictVectorizer to a pickled object for re-use.
2) The model (clf) as a picked object for re-use.
3) An Excel file, containing two tabs:
- Summary: More scores
- AUC
- Accuracy
- Confusion Table
- Coefficients: A table of non-zero coefficients.
:return:
"""
logs = logger.logging_setup('Model_export')
temp = list()
if not os.path.exists(directory):
os.makedirs(directory)
if self.results is None:
raise Exception("Something went wrong!")
logs.info("pickling vectorized dictionary...")
filename = '{}-DictVectorizor.pk'.format(key)
with open(os.path.join(directory, filename), 'wb') as handle:
pickle.dump(self.results[key][0], handle)
logs.info("pickling the model (clf)...")
filename = '{}-model.clf'.format(key)
with open(os.path.join(directory, filename), 'wb') as handle:
pickle.dump(self.results[key][1], handle)
logs.info("creating Excel file with summary scores...")
pdwriter = pd.ExcelWriter('{}/results.xlsx'.format(directory), engine="openpyxl")
for key in self.results.keys():
temp.append([key, self.get_top_coefs(key, 1)[0][0], self.get_top_coefs(key, 1)[0][1]])
data = pd.DataFrame(temp, columns=['seg', 'feature', 'value'])
with pdwriter as the_file:
data.to_excel(the_file, sheet_name="model_summary", index=False)
wb = openpyxl.load_workbook('results.xlsx')
bold_style = Font(bold=True)
italic_style = Font(italic=True)
summary = wb["model_summary"]
summary["A1"].font = bold_style
summary["B1"].font = bold_style
summary["C1"].font = bold_style
for i in range(2, 12):
summary["A{}".format(i)].font = italic_style
def export_report(self, directory, filename = "comparison_results_general"):
"""
A function to produce an Excel file containing one tab for each comparison, including
1) A table of the non-zero coefficients.
2) A summary table to the side containing the confusion matrix
3) The AUC score
:return
:
"""
logs = logger.logging_setup('Report_export')
if not os.path.exists(directory):
os.makedirs(directory)
logs.info("creating Excel file with summary scores...")
file_path = '{}/{}.xlsx'.format(directory, filename)
pdwriter = pd.ExcelWriter(file_path, engine="openpyxl")
for key in self.results.keys():
var1 = key.split('_')[0]
var2 = key.split('_')[-1]
data = pd.DataFrame(self.get_top_coefs(key, -1), columns=['feature', 'coefficient',
'{} person count'.format(var1),
'{} visit total'.format(var1),
'{} person frequency'.format(var1),
'{} person count'.format(var2),
'{} visit total'.format(var2),
'{} person frequency'.format(var2),
'subcategory', 'category'])
cm = self.get_confusion_matrix(key)[1] # 1 is for normalized matrix
auc = self.get_auc(key)
table = pd.DataFrame(cm, columns=[var2, var1], index=[var2, var1])
table2 = pd.DataFrame([auc], ['auc'], columns=None)
sheet_name = "{}".format(key)
with pdwriter as the_file:
data.to_excel(the_file, sheet_name=sheet_name, index=False, startrow=0)
# data2.to_excel(the_file, sheet_name=sheet_name, index=False, startcol=4, startrow=2)
table.to_excel(the_file, sheet_name=sheet_name, startcol=12, startrow=7)
table2.to_excel(the_file, sheet_name=sheet_name, startcol=12, startrow=11)
def execute_simple(filename):
logs = logger.logging_setup('Execute_simple')
data = pd.read_csv(filename)
required_fields = {'domain_name', 'person_id', 'url_count', 'segment', 'category', 'subcategory'}
missing = required_fields.difference(set(data.columns))
if len(missing) > 0:
missing_list = ", ".join(missing)
logs.critical("Missing required fields. We require: {}".format(missing_list))
sys.exit(0)
c = Comparer(data, 'domain_name', 'url_count', 'segment', 300)
c.export_report("output")
if __name__ == "__main__":
logs = logger.logging_setup('root')
if len(sys.argv) > 1:
filename = sys.argv[1]
execute_simple(filename)
else:
logs.critical("Please pass a filename to the script to run with default column names.")
sys.exit(0)
Future Considerations
- Optimizing based on other parameters besides number of coefficients (Regularization Penalty value)
- Incorporating a python wrapper for the initial data pull
deck
By Patrick Nieto
deck
- 332