A Comparison of Automatic Extractive Text Summarization Techniques

Alex Day, Soo Kim

Computer Information Science Department

Clarion University of Pennsylvania

Introduction
Techniques of Summarization
- Term Frequency-Inverse Document Frequency
- Text Rank
Implementation
Results
Conclusions

Introduction

Motivation: 6 in 10 Americans only read the headline of a news article
Goal: To programmatically determine a summary of a document

Related Works

Hans Peter Luhn, "The automatic creation of literature abstracts"
Dipanjan Das and Andre FT Martins, "A survey on automatic text summarization"
Samir Bajaj, "Shakespeare in 100 Words"

What is a Summary?

According to Google Dictionary: "a brief statement or account of the main points of something."
Representative set of infomormation that is, ideally, shorter than the original document
An Automatic Summary is one generated algorithmically

Types of Automatic Summary Algorithms

Abstractive
- Understand document content and produce new sentences
Extractive
- Words/phrases already in the document

Term Frequency-Inverse Document Frequency

Assigns a numerical score to a term (word) within a document in a corpus
This score represents the importance of the terms within the document

TF-IDF for Summarization

Generate an importance/relevance score for the entire sentence
Rank the sentences based on the importance scores
Higher score = More likely to be a representative sentence

tf(t, d) = f_{t,d}\\ idf(t, D) = log\frac{{\left|D\right|}}{{{\left|\{d \in D : t \in d\}\right|}}}\\ tfidf(t, d, D) = tf(t, d) \times idf(t, D)\\ \frac{1}{|S|}\times\sum\limits_{w\in S}tfidf(w,d,D)

TF-IDF Formulas

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

Text Rank

Idea: Google's PageRank search result ranking algorithm
Ranks sentences based on their importance to each other
Uses a graph to determine the order of importance

Sent2Vec

Idea: Word2Vec
A way of turning sentences into long vectors
Embeds the semantic meaning of the sentences
Requires training on a related corpus to make sense

Cosine Similarity

A way of comparing the similarity of two vectors
Measures the cosine of the angle between the vectors from the origin

\frac{\sum^{n}_{i=1}A_i B_i}{\sqrt{\sum^{n}_{i=1}A_{i}^2}\times\sqrt{\sum^{n}_{i=1}B_{i}^2}}

TextRank for Summarization

Put the document into the TextRank graph data structure
Generate the similarity scores between all of the sentences
Rate the sentences based on the sum total of all of their similarity scores
Higher sum similarity = More likely to be a representative sentence

S_1

S_2

S_3

S_4

S_5

S_1

S_2

S_3

S_4

S_5

S_1

S_2

S_3

S_4

S_5

0.33

0.25

0.01

0.5

S_1

S_2

S_3

S_4

S_5

0.33

0.25

0.01

0.5

\Sigma=0.33+0.01+0.5+0.25 = 1.09

S_1

S_2

S_3

S_4

S_5

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TextRank Algorithm

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

Implementation

Python 3.6
- NLTK - Reuters and Moby-Dick Corpora
- NetworkX - Graph library
- PRAW - Reddit corpus
Jupyter Notebook
https://github.com/AlexanderDavid/AutomaticExtractiveSummarization

Results

Moby-Dick
Reuters
Reddit (/r/legaladvice)

Moby Dick - TFIDF

- On the contrary, passengers themselves must pay.

- Whaling voyage by one ishmael.”

- For to go as a passenger you must needs have a purse, and a purse is but a rag unless you have something in it. - Why upon your first voyage as a passenger, did you yourself feel such a mystical vibration, when first told that you and your ship were now out of sight of land?

- Right and left, the streets take you waterward.

Moby Dick - TextRank

-Deep into distant woodlands winds a mazy way, reaching to overlapping spurs of mountains bathed in their hill-side blue.

- Why did the poor poet of tennessee, upon suddenly receiving two handfuls of silver, deliberate whether to buy him a coat, which he sadly needed, or invest his money in a pedestrian trip to rockaway beach?

- Well, then, however the old sea-captains may order me about

– however they may thump and punch me about, I have the satisfaction of knowing that it is all right; that everybody else is one way or other served in much the same way

Reuters - TFIDF

- New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.

- March/April sold at 4,340, 4,345 and 4,350 dlrs. - April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at 2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and 2.27 times New York Dec, Comissaria Smith said.

- Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for Oct/Dec.

Reuters - TextRank

- There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing difficulties in obtaining +Bahia superior+ certificates.

- March/April sold at 4,340, 4,345 and 4,350 dlrs.

- April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at 2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and 2.27 times New York Dec, Comissaria Smith said.

- Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for Oct/Dec.

Reddit - TFIDF

- I will just be saying MY home or MY driveway so I don’t have to keep typing ”my parent’s driveway” or ”my parent’s home” over and over.

- My parent’s neighbor’s kid (a very immature 20 year old) has a beater he leaves parked in front of my parent’s front yard.

- Now the parents are retaliating too. They finally moved the beater, but only to move their cars from the driveway to taking up the two spaces in front of our yard adjacent to their driveway.

- The one car parked just enough to have the front poking into our driveway.

- TLDR - neighbors parked all their cars in front of my parents home and wont move them, only rearrange them.

Reddit - TextRank

- These neighbors moved in a year or two ago and have made life so uncomfortable for my parents they are actually talking about selling their home to move, their marriage home with ALL the memories.

- He has it parked in the middle so that it takes up ALL the space and no one can park on either side of it without blocking a driveway.

- I assumed mom would just tell ’Billy, go move your dang car’ or something and it would be taken care of. - I get the dad yelling at me to go away and get off his property as the mom (from another room) starts bellowing about how I did NOT just tell her how to parent and he can do whatever he wants and to heck with me and my parents.

Conclusions

TextRank and TF-IDF are good for focused documents
TextRank produces longer sentences, further investigation will need to be done to figure out why

Future Research

Automatic evaluation of the summaries
Using Long Short-Term Memory neural networks for summarization
Abstractive summaries may be able to better get a handle on an overarching story

A Comparison of Automatic Extractive Text Summarization Techniques

Table of Contents

Introduction

Related Works

What is a Summary?

Types of Automatic Summary Algorithms

Term Frequency-Inverse Document Frequency

TF-IDF for Summarization

TF-IDF Formulas

TF-IDF Algorithm

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

Text Rank

Sent2Vec

Cosine Similarity

TextRank for Summarization

TextRank Algorithm

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

Implementation

Results

Moby Dick - TFIDF

Moby Dick - TextRank

Reuters - TFIDF

Reuters - TextRank

Reddit - TFIDF

Reddit - TextRank

Conclusions

Future Research

Questions?

A Comparison of Automatic Extractive Summarization Techniques

A Comparison of Automatic Extractive Summarization Techniques

Alex Day

A Comparison of Automatic Extractive Text Summarization Techniques

Table of Contents

Introduction

Related Works

What is a Summary?

Types of Automatic Summary Algorithms

Term Frequency-Inverse Document Frequency

TF-IDF for Summarization

TF-IDF Formulas

TF-IDF Algorithm

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

Text Rank

Sent2Vec

Cosine Similarity

TextRank for Summarization

TextRank Algorithm

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

TF-IDF Algorithm (cont.)

Implementation

Results

Moby Dick - TFIDF

Moby Dick - TextRank

Reuters - TFIDF

Reuters - TextRank

Reddit - TFIDF

Reddit - TextRank

Conclusions

Future Research

Questions?

A Comparison of Automatic Extractive Summarization Techniques

More from Alex Day