Learning to classify a subject-line quality for email marketing using Data Mining techniques
MASTER IN INFORMATICS AND COMPUTING ENGINEERING


Supervisor: Vera Lucia Miguéis Oliveira e Silva
Second supervisor: Ivo Pereira
Maria João dos Santos Aguiar e Mira Paulo
Background
1
- Easy Recovery of Investments.
- Supports an easy way of measuring campaign success.
- The most cost-effective way of reach customers.










EMAIL MARKETING
One of the most preferred methods of contact by firms.
Problem
2
OPEN RATES
EMAIL OVERLOAD
MARKETING STRATEGIES FAIL


Customers do not have either time or interest in reading all the received content.
9999...









Goal
3
SENDER NAME & SUBJECT-LINE
Using Data Mining techniques, develop a model capable of classifying a certain subject-line, regarding its quality, from 1 to 5 stars.
The only factors recipients will see at first glance when opening the mail box are:

Related Work
Literature Review
4
- Studies using Data Mining techniques are limited to a set of techniques;
- Studies using secondary data do not evaluate the impact of using personalized messages, neither the impact of the country and business sector to which the campaign is sent;
- Secondary studies are not applied to specific business problems.
Related Tools
5
- There exists several websites seeking to help customers on choosing the best subject-line, but contain merely descriptive guidelines;
- MailChimp, Market leader in marketing automation, offers an insufficient tool;
- Solutions in the market continue to be scarce and become easily outdated.
Implementation
6
1.
2.
3.
To predict the subject quality taking into account not only structural but also content features, which could impact the overall subject quality.
Goals
To compare performance results regarding distinct data mining techniques.
To increase E-goi customers engagement with the E-goi platform, through an innovative and helpful service capable of analyzing a subject-line quality.
7
DATA RESTRICTIONS

- Comprised of 140. 000 email campaigns;
- Sent to at least 100 subscribers;
- Sent over one week before data collection;
- PT, EN or ES languages.
VARIABLES
- Subject-line,
- Open Rate,
- Country,
- Sector
Data Understanding
Most relevant languages
8
Classes Creation

Equal width binning technique.
9
Open Rate Distribution
Feature Construction
SUBJECT LINE

STRUCTURAL FEATURES

CONTENT FEATURES
PAST PERFORMANCE
BAG OF WORDS
or
10
Structural Features

- Number of words
- Number of characters
- Upper Case Percentage
- Punctuation
- Prefixes
- Emojis
- Personalization
- Special characters
- Numbers
- Currency
FEATURES
11
Content Features
1st Approach: Past Performance
- Number of lemmas
- Lemmas Past Performance
- Average
- Weighted Average
- Maximum
FEATURES


12
Content Features
2nd Approach: Bag of Words
- List of TF-IDF values (relavence)
TERM FREQUENCY–INVERSE DOCUMENT FREQUENCY.
FEATURES


13

Data Exploration
Feature: Number of words
14
Data Exploration

Feature: Currency
15
Data Exploration

Feature: Lemmas Past Performance
16
Data Exploration

17
Modeling
1.
Sector + Country + Structural Features
5 DISTINCT EXPERIMENTS
2.
2.1
2.2
2.3
Sector + Country + Structural Features + Past Performance Features
Weighted Average
Average
Maximum
3.
Sector + Country + Structural Features + Bag of Words Features
- Naive Bayes
- Support Vector Machine
- Random Forest
- Decision Tree
- Gradient Boosting
- Neural Network
6 ALGORITHMS
18
Nested Cross Validation

19
Evaluation
What is the most accurate model we can get regarding the business challenge?
Percentage of the correctly labeled subjects to the total of subjects.
Seek a balance between Precision and Recall, avoiding false negatives and false positives, for each of the five classes.
ACCURACY
F1 SCORE
20
Evaluation
















Perfect Recall
Perfect Precision
21
Results
| Exp. 1 | Random Forest | 60.4% | 60.2% |
|---|---|---|---|
| Exp. 2.1 | Random Forest | 61.7% | 62.1% |
| Exp. 2.2 | Random Forest | 62.2% | 62.4% |
| Exp. 2.3 | Random Forest | 61.5% | 61.8% |
| Exp. 3 | Random Forest | 60.3% | 60.6% |
EXPERIMENT
ALGORITHM
ACCURACY
F1 SCORE

PREDICTED
ACTUAL
Confusion matrix : Experiment 2.2
22
Deployment
23
Conclusions
1.
2.
3.
New agnostic tool for supporting customers on creating engaging and relevant subject-lines, contributing to more emails opened and, thus, more successful marketing campaigns.
First tool embedded into a marketing automation platform, and prepared for adjusting itself to the natural evolution of trends and patterns over time.
25
Effective prediction of the subject quality: 62.4% in terms of Accuracy and 62.2% in terms of F1 Score.

RANDOM 20%
4.
Research paper submitted to International Journal of Production Economics.
Difficulties
26




- Time and day of the week the email campaign was delivered;
- Sender recognition and reputation;
- Customer database quality;
- Subjectiveness when talking about an interesting email campaign.
Thank you
Maria João Mira Paulo
THESIS
By Maria João Mira Paulo
THESIS
- 460