A Comparison of Automatic Extractive Text Summarization Techniques

Alex Day, Soo Kim

Computer Information Science Department

Clarion University of Pennsylvania

Table of Contents

  • Introduction
  • Techniques of Summarization
    • Term Frequency-Inverse Document Frequency
    • Text Rank
  • Implementation
  • Results
  • Conclusions

Introduction

  • Motivation: 6 in 10 Americans only read the headline of a news article
  • Goal: To programmatically determine a summary of a document

Related Works

  • Hans Peter Luhn, "The automatic creation of literature abstracts"
  • Dipanjan Das and Andre FT Martins, "A survey on automatic text summarization"
  • Samir Bajaj, "Shakespeare in 100 Words"

What is a Summary?

  • According to Google Dictionary: "a brief statement or account of the main points of something."
  • Representative set of infomormation that is, ideally, shorter than the original document
  • An Automatic Summary is one generated algorithmically

Types of Automatic Summary Algorithms

  • Abstractive
    • Understand document content and produce new sentences
  • Extractive
    • Words/phrases already in the document

Term Frequency-Inverse Document Frequency

  • Assigns a numerical score to a term (word) within a document in a corpus
  • This score represents the importance of the terms within the document

TF-IDF for Summarization

  • Generate an importance/relevance score for the entire sentence
  • Rank the sentences based on the importance scores
  • Higher score = More likely to be a representative sentence
tf(t, d) = f_{t,d}\\ idf(t, D) = log\frac{{\left|D\right|}}{{{\left|\{d \in D : t \in d\}\right|}}}\\ tfidf(t, d, D) = tf(t, d) \times idf(t, D)\\ \frac{1}{|S|}\times\sum\limits_{w\in S}tfidf(w,d,D)

TF-IDF Formulas

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

def summarize(corpus, document):
    tfidf_scores = {}

    for sentence in document:
        tfidf_sum = 0
        for word in sentence:
            tfidf_sum += tf-idf(word, document, corpus)

        tfidf_scores[sentence] = tfidf_sum / len(sentence.split(" "))

    return sorted(tfidf_scores.items(), key=lambda x: x[1])

TF-IDF Algorithm (cont.)

Text Rank

  • Idea: Google's PageRank search result ranking algorithm
  • Ranks sentences based on their importance to each other
  • Uses a graph to determine the order of importance

Sent2Vec

  • Idea: Word2Vec
  • A way of turning sentences into long vectors
  • Embeds the semantic meaning of the sentences
  • Requires training on a related corpus to make sense

Cosine Similarity

  • A way of comparing the similarity of two vectors
  • Measures the cosine of the angle between the vectors from the origin
\frac{\sum^{n}_{i=1}A_i B_i}{\sqrt{\sum^{n}_{i=1}A_{i}^2}\times\sqrt{\sum^{n}_{i=1}B_{i}^2}}

TextRank for Summarization

  • Put the document into the TextRank graph data structure
  • Generate the similarity scores between all of the sentences
  • Rate the sentences based on the sum total of all of their similarity scores
  • Higher sum similarity = More likely to be a representative sentence
S_1
S_2
S_3
S_4
S_5
S_1
S_2
S_3
S_4
S_5
S_1
S_2
S_3
S_4
S_5

0.33

0.25

0.01

0.5

S_1
S_2
S_3
S_4
S_5

0.33

0.25

0.01

0.5

\Sigma=0.33+0.01+0.5+0.25 = 1.09
S_1
S_2
S_3
S_4
S_5
def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TextRank Algorithm

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

def textrank(document):
    g = Graph()
    for sentence in document:
        g.add_node(sentence)

    for node in g:
        for node2 in g:
            if node is not node2:
                sent1 = sent2vec(node)
                sent2 = sent2vec(node2)
                g.set_weight(node, node2, cosine_similarity(sent1, sent2))

    return sorted(g, key=lambda x: sum([node.weight for node in x.neighbors]))

TF-IDF Algorithm (cont.)

Implementation

Results

  • Moby-Dick
  • Reuters
  • Reddit (/r/legaladvice)

Moby Dick - TFIDF

- On the contrary, passengers themselves must pay.

- Whaling voyage by one ishmael.”

- For to go as a passenger you must needs have a purse, and a purse is but a rag unless you have something in it. - Why upon your first voyage as a passenger, did you yourself feel such a mystical vibration, when first told that you and your ship were now out of sight of land?

- Right and left, the streets take you waterward.

Moby Dick - TextRank

-Deep into distant woodlands winds a mazy way, reaching to overlapping spurs of mountains bathed in their hill-side blue.

- Why did the poor poet of tennessee, upon suddenly receiving two handfuls of silver, deliberate whether to buy him a coat, which he sadly needed, or invest his money in a pedestrian trip to rockaway beach?

- Well, then, however the old sea-captains may order me about

– however they may thump and punch me about, I have the satisfaction of knowing that it is all right; that everybody else is one way or other served in much the same way

 

Reuters - TFIDF

- New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.

- March/April sold at 4,340, 4,345 and 4,350 dlrs. - April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at 2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and 2.27 times New York Dec, Comissaria Smith said.

- Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for Oct/Dec.

Reuters - TextRank

- There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing difficulties in obtaining +Bahia superior+ certificates.

- March/April sold at 4,340, 4,345 and 4,350 dlrs.

- April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at 2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and 2.27 times New York Dec, Comissaria Smith said.

- Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for Oct/Dec.

Reddit - TFIDF

- I will just be saying MY home or MY driveway so I don’t have to keep typing ”my parent’s driveway” or ”my parent’s home” over and over.

- My parent’s neighbor’s kid (a very immature 20 year old) has a beater he leaves parked in front of my parent’s front yard.

- Now the parents are retaliating too. They finally moved the beater, but only to move their cars from the driveway to taking up the two spaces in front of our yard adjacent to their driveway.

- The one car parked just enough to have the front poking into our driveway.

- TLDR - neighbors parked all their cars in front of my parents home and wont move them, only rearrange them. 

Reddit - TextRank

- These neighbors moved in a year or two ago and have made life so uncomfortable for my parents they are actually talking about selling their home to move, their marriage home with ALL the memories.

- He has it parked in the middle so that it takes up ALL the space and no one can park on either side of it without blocking a driveway.

- I assumed mom would just tell ’Billy, go move your dang car’ or something and it would be taken care of. - I get the dad yelling at me to go away and get off his property as the mom (from another room) starts bellowing about how I did NOT just tell her how to parent and he can do whatever he wants and to heck with me and my parents.

Conclusions

  • TextRank and TF-IDF are good for focused documents
  • TextRank produces longer sentences, further investigation will need to be done to figure out why

Future Research

  • Automatic evaluation of the summaries
  • Using Long Short-Term Memory neural networks for summarization
  • Abstractive summaries may be able to better get a handle on an overarching story

Questions?

A Comparison of Automatic Extractive Summarization Techniques

By Alex Day

A Comparison of Automatic Extractive Summarization Techniques

  • 856