Automatic Text Summarization: a solid Base

Download 481 b.

Sana	13.08.2017
Hajmi	481 b.
	#13412

Automatic Text Summarization: A Solid Base

Martijn B. Wieling,
Rijksuniversiteit Groningen

Outline

Why should we bother at all? (a.k.a. Introduction)
A frequency based ATS [Luhn, 1958]
An ATS based on multiple features [Edmundson, 1969]
Automatically combining the features (1) [Kupiec et al, 1995]
Automatically combining the features (2) [Teufel & Moens, 1997]
Why should we still bother? (a.k.a. Conclusion)

Why should we bother at all?

Time saving
Large scale application possible, e.g.

‘Google-xtract’
Extract translation

Abstracts will be consistent and objective

And in the beginning there was …

Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958

Luhn’s method: basic idea

Target documents: technical literature
The method is based on the following assumptions:

Frequency of word occurrence in an article is a useful measurement of word significance
Relative position of these significant words within a sentence is also a useful measurement of word significance

Based on limited capabilities of machines (IBM 704)  no semantic information

Why word frequency?

Important words are repeated throughout the text

examples are given in favor of a certain principle
arguments are given for a certain principle
Technical literature  one word: one notion

Simple and straightforward algorithm  cheap to implement (processing time is costly)

Note that different forms of the same word are counted as the same word

When significant?

Too low frequent words are not significant
Too high frequent words are also not significant (e.g. “the”, “and”)
Removing low frequent words is easy

set a minimum frequency-threshold

Removing common (high frequent) words:

Setting a maximum frequency threshold (statistically obtained)
Comparing to a common-word list

Using relative position

Where greatest number of high-frequent words are found closest together  probability very high that representative information is given
Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters)

The significance factor

The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between
Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “
Significance factor formula: (Σ[*])2 / |[.]|
(2.5 in the above example)

Generating the abstract

For every sentence the significance factor is calculated
The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned)
For large texts, it can also be applied to subdivisions of the text
No evaluation of the results present in the journal paper!

A new method by Edmundson

H.P. Edmundson: New methods in Automatic Extracting - 1969

Four methods for weighting

Weighting methods:

Cue Method
Key Method
Title Method
Location Method

The weight of a sentence is a linear combination of the weights obtained with the above four methods
The highest weighing sentences are included in the abstract
Target documents: technical literature

Cue Method

Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”)
Three types of Cue words:

Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”, “Greatest”)
Stigma words: negatively affecting the relevance of a sentence (e.g. “Impossible”, “Hardly”)
Null words: irrelevant

Obtaining Cue words

The lists were obtained by statistical analyses of 100 documents:

Dispersion (λ): number of documents in which the word occurred
Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all sentences

Bonus words: η > thighη
Stigma words: η < tlowη
Null words: λ > tλ and tlowη< η < thighη

Resulting Cue lists

Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc.
Stigma list (73): anaphoric expressions, belittling expressions, etc.
Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc.

Cue weight of sentence

Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0
Cue weight of sentence: Σ (Cue weight of each word in sentence)

Key Method

Principle based on [Luhn], counting the frequency of words.
Algorithm differs:

Create key glossary of all non-Cue words in the document which have a frequency larger than a certain threshold
Weight of each key word in the key glossary is set to the frequency it occurs in the document
Assign key weight to each word which can be found in the key glossary
If word is not in key glossary, key weight: 0
No relative position is used ([Luhn])

Key weight of sentence: Σ (Key weight of each word in sentence)

Title Method

Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs)
Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document
Words are given a positive title weight if they appear in this glossary
Title words are given a larger weight than heading words
Title weight of sentence: Σ (Title weight of each word in sentence)

Location Method

Based on the hypothesis that:

Sentences occurring under certain headings are positively relevant
Topic sentences tend to occur very early or very late in a document and its paragraphs

Global idea:

Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight
Give each sentence a certain weight based on its position - Ordinal weight
Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence

Location Method: Heading weight

Compare each word in a heading with the pre-stored Heading dictionary
If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary
Heading weight of a heading: Σ (heading weight of each word in heading)
Heading weight of a sentence = Heading weight of its heading

Creating the Heading dictionary

The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word:

Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings

Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.)
Weights were given to the words in the Heading dictionary proportional to the selection ratio
The resulting Heading dictionary contained 90 words

Location Method: Ordinal weight

Sentences of the first paragraph are tagged with weight O1
Sentences of the last paragraph are tagged with weight O2
The first sentence of a paragraph is tagged with weight O3
The last sentence of a paragraph is tagged with weight O4
Ordinal weight of sentence: O1 + O2 + O3 + O4

Generating the abstract

Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight
The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract
Return the highest N sentences under their proper headings as the abstract (including title)

N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used

Which combination is best?

All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract
As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract
Surprising result! (Luhn used only keywords to create the abstract)

Evaluation

Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts

Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge)
Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts

Another evaluation criterion: ‘extract-worthiness’

Result: 84% of the sentences selected is extract-worthy
Therefore: for one document many possible abstracts (differing in length and content)

Comments

[Goldstein e.a., 1999]: Not good to base length of abstract on length of document

Summary length is independent of document length
The longer the document, the smaller the compression ratio ( |doc.| / |abstract| )
Better to use constant summary length

[Rath e.a., 1961] Human selection of sentences in abstracts is very variable

6 abstracts of 20 sentences: only 32% overlap between 5 subjects (6: 8%)
Abstracting the same document 2 times by the same person with 8 weeks in between: only 55% overlap (average for 6 subjects)

Perhaps the Key Method algorithm used here is not that good (Luhn’s algorithm could be better)

Time and cost of this system 

Speed of extracting: 7800 words/minute
Cost: $ 0,015 / word

Including keypunching costs: $ 0.01 / word
Used corpus of 29,500 words  $ 442.50 total cost
CPI 2003: $ 2798.00 total cost

A jump in time

1969: First man on the moon
1972: Watergate scandal
1980: John Lennon killed
1981: First identification of AIDS & Birth of me 
1986: Space Shuttle Challenger explodes after launch
1989: Fall of Berlin Wall
1990: Start Gulf War & Introduction WWW
1991: Soviet Union breaks up
1992: Formal end of Cold War
1993: Creation of European Union (“Verdrag van Maastricht”)
1994: Nelson Mandela president of South Africa

1995: Trained summarization

Julian Kupiec, Jan Pedersen and Francine Chen: A Trainable Document Summarizer - 1995

Trained weighting

Edmundson used subjective weighting of the features (Cue, Key, Title, Location) to create an abstract
In this journal paper generating the abstract is approached as a statistical classification problem

Given a training set of documents with handmade abstracts:
Develop a classification function that estimates the probability a given sentence is included in the abstract

This requires a training corpus of documents with abstracts
Target documents: technical literature

Features

Five features were used:

Sentence Length Cut-off Feature
Fixed Phrase Feature
Paragraph Feature
Thematic Word Feature
Uppercase Word Feature

The above features were chosen by experimentation

Sentence Length Cut-off Feature

Based on the principle that short sentences are often not included in abstracts
Given a threshold (e.g. 5 words):

SLC-value is true for sentences longer than the threshold
SLC-value is false otherwise

Note that this feature is not similar to any of the features Edmundson used

Fixed-Phrase Feature

Based on the hypothesis that:

sentences containing any of a list of fixed phrases (mostly 2 words long) are likely to be in the abstract (e.g. “in conclusion”, “this result” – total: 26 elements)
Sentences following a heading containing a certain keyword are more likely to be in the abstract (e.g., “conclusions”, “results”, “summary”)

FP-value is true for sentences in the above situations, false otherwise
Note that this feature is a combination of Edmundson’s Location Method and Cue Method, though in reduced form

Paragraph Feature

Each sentence in the first ten and last five paragraphs is tagged based on it’s location

Paragraph-initial
Paragraph-final (|P| > 1 sentence)
Paragraph-medial (|P| > 2 sentences)

Note that this feature is a reduced form of Edmundson’s Location Method

Thematic Word Feature

The most frequent words in a document are defined as thematic words
A small number of thematic words is selected and each sentence is scored as a function of frequency of these thematic words
TW-value is true if it is one of the highest scoring sentences
TW-value is false otherwise
Note that this feature is an adapted version of Edmundson’s Key Method

Uppercase Word Feature

Based on the hypothesis that proper names often are important, since it is the explanatory text for acronyms (e.g. “… the ISO (International Standards Organization) …”)
Count the frequency of each proper name

Constraint: the uppercase thematic word is not sentence initial and begins with a capital letter
The word must occur several times and may not be an abbreviated measurement unit

Score each sentence based on the number of frequent proper names in each sentence

The score of a sentence in which the frequent proper name appears first is twice as high as later occurrences

UW-value is true if it is one of the highest scoring sentences, false otherwise
Note that this feature is a bit similar to Edmundson’s Key Method

Classification

For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule):
Assuming statistical independence of the features:
is constant, and and can be estimated directly from the training set by counting occurrences
This function assigns for each s a score which can be used to select sentences for inclusion in the abstract

The training material

188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document)
Sentences from the abstract were matched to the original document:

79% direct sentence matches
3% direct joins (2 sentences combined)
18% no direct match or join possible

Therefore the maximum performance of the automatic system is 82%

Evaluation (1)

Too little material  Cross-validation used to evaluate
Two evaluation measures

Fraction of manually selected sentences which were reproduced correctly: average result: 35%
Fraction of the matchable selected sentences which were reproduced correctly: average result: 42%

Performance of features (2nd measure):

Evaluation (2)

Best combination is: Paragraph + Fixed Phrase + Length Cut-off (44% performance)
Addition of frequency keyword features results in a slight decrease of performance (44%  42%)

Note that Edmundson in this case also reports a decrease in performance

In final implementation frequency keyword features are retained in favor of robustness
Baseline used in this experiment: Selecting N sentences from the beginning (Length Cut-off, thus positively biased)
Full feature set has an improvement of 74% over baseline (24%  42%)

Evaluation (3)

If the size of the generated abstract is increased to 25%, the performance improves to 84%
Edmundson ‘only’ had a performance of 44%

Comments

The features used in this paper were chosen by experimentation

No results/discussions of these experiments are given in the paper, so the reason for the choices remain unclear…

The comparison to Edmundson is not very fair

Handmade reference abstracts of Edmundson had a size of 25% (here 3.5%)

Also the comments which were given about [Edmundson] apply here:

Not good to base length of abstract on length of document
Human selection of sentences in abstracts is very variable
Perhaps the Key Method algorithm used here is too simple (Luhn’s algorithm could be better)

Revisited: [Kupiec e.a., 1995]

Simone Teufel and Marc Moens: Sentence extraction as a classification task - 1997

Main research questions

Could Kupiec e.a.’s methodology (training a model with a corpus) be used for another evaluation criterion?
What was the difference in extracting performance of both evaluation criterions for different types of documents?
Note that another set of features is used here than Kupiec e.a. used

Another evaluation method

Kupiec e.a. used the ‘match sentences’ evaluation criterion
Here the training and test set abstracts are created by the authors themselves (as opposed to Kupiec e.a.)
Hence less alignable sentences are available in the document

32% on average vs. 79% in Kupiec e.a.

This does not mean there are less ‘extract-worthy’ sentences in the document  another evaluation method is chosen
Evaluation: ask human to identify abstract-worthy non-matchable sentences in the original document

Features

The features used here are different from Kupiec e.a.

Cue Phrase Method (1670 cue phrases):
Location Method
Sentence Length Method
Thematic Word Method
Title Method

Cue Phrase Method

Similarly as in Edmundson, with some differences:

A 5-point scale (-1 … +3) is used instead of 3 (Bonus, Null, Stigma)
Cue phrases are used instead of Cue words
If a phrase was entered into the list, also syntactically and semantically similar phrases were manually included in the list
A sentence gets the score of it’s maximum-scored Cue phrase, if no Cue phrases are present it gets a score of 0

The list was manually created by inspecting extracted sentences

Also based on relative frequency in abstract and relative frequency in document

Sentences occurring directly after headings like ‘Introduction’ or ‘Conclusion’ are given a prior score of +2 (in Edmundson this is part of the Location Method)

Location Method

As in Edmundson, with the exception of the sentences directly after headings previously mentioned
Sensitive for certain headings (e.g. “Introduction”); if such headings cannot be found: only the sentences of the first 7 and last 3 paragraphs are tagged (initial, medial, final)

Sentence Length Method

As in Kupiec e.a.
The threshold is set to 15 tokens (including punctuation)

Thematic Word Method

As in Kupiec e.a., with a few differences:
Selecting (non-Cue) words which occur frequently in this document, but rarely in the overall collection of documents
For each (non-Cue) word the term-frequency*inverse-document-frequency value is calculated:
score(w) = floc * log (100*N / fglob)

with N: total number of documents, floc: frequency of word w in document, fglob: number of documents containing word w

Top 10 scoring words are defined as thematic words
Top 40 sentences based on the frequency of thematic words (meaned by sentence length) are given a TW-value of 1, all others 0

Title Method

As in Edmundson, with the difference that:
The Title score of the sentence is the mean frequency of Title word occurrences in the sentence (in Edmundson each Title word was given the same score and the scores were summed)
Headings are not taken into account here (by experimentation)
The 18 top-scoring sentences receive a Title-value of 1, the others 0

The experiment

Training set: a corpus of 124 documents from different areas of computational linguistics with summaries written by the authors
A human judge marked additional abstract-worthy sentences in each document
32% alignable sentences in the abstracts
Two evaluation methods (‘alignable’ and ‘abstract-worthy’) which were also combined

Summary of results

Baseline: 28% (obtained in a similar fashion as Kupiec e.a.)
Bad performance of 31.6% for alignability can be explained because there are less alignable sentences to train on
Short abstracts were generated (2 – 5% of size original document)
If abstract size would be increased to 25%, performance would increase to:

‘Alignability’: 96% (Kupiec e.a.: 84%)
‘Abstract-worthy’: 98%
Combined: 97.3%

Therefore compression makes the difference, not the evaluation criterion

Conclusions of this experiment

The method proposed by Kupiec e.a. of classificatory sentence selection is not restricted to texts which have high-quality handmade abstracts
A higher alignability of the handmade abstract is therefore not necessary for the purpose of sentence extraction – compression rate is the factor which influences the result
However, if more flexible abstracts should be generated, the addition of other training and evaluation criterions is useful
Increased training did not improve results, improvement can be obtained in the extraction methods themselves

Comments

The features used in this paper were different from Kupiec e.a.

No motivation was given why for instance the Uppercase Word feature was omitted, and why adapted versions of Edmundson were chosen instead of the versions Kupiec e.a. used

Also comments which were given about [Edmundson] apply here:

Not good to base length of abstract on length of document
Human selection of ‘abstract-worthy’ sentences in abstracts is very variable

Why should we still bother …

In the discussed methods no attention is given to:

Cohesion of the abstract: filtering anaphors out of an abstract (e.g. ‘it’, ‘that’)
Filtering out repetition in the abstract
The semantics of the document

Cohesion: an attempt is made by using Lexical Chains
Repetition: an attempt is made by using Maximum Marginal Relevance
Semantics: this can still not be done for the general case, but an attempt is made by using Rhetorical Tree Structures
Interested about these problems?
Wicher will explain extraction methods which will address repetition and semantics problems in his presentation
Terrence will explain Lexical Chains in his presentation

References

The Automatic Creation of Literature Abstracts, H.P. Luhn, 1958
New Methods in Automatic Extracting, H.P. Edmundson, 1969
A Trainable Document Summarizer, J. Kupiec e.a., 1995
Sentence Extraction as a Classification Task, S. Teufel and M. Moens, 1997
The Formation of Abstracts by the Selection of Sentences, G.J. Rath e.a., 1961
Constructing Literature Abstracts by Computer: Techniques and Prospects, C.D. Paice, 1990
Summarizing Text Documents: Sentence Selection and Evaluation Metrics, Goldstein e.a., 1999

Any questions?

Download 481 b.

Do'stlaringiz bilan baham:

Automatic Text Summarization: a solid Base

Automatic Text Summarization: A Solid Base

Martijn B. Wieling,

Rijksuniversiteit Groningen

Outline

Why should we bother at all? (a.k.a. Introduction)

A frequency based ATS [Luhn, 1958]

An ATS based on multiple features [Edmundson, 1969]

Automatically combining the features (1) [Kupiec et al, 1995]

Automatically combining the features (2) [Teufel & Moens, 1997]

Why should we still bother? (a.k.a. Conclusion)

Why should we bother at all?

Time saving

Large scale application possible, e.g.

Abstracts will be consistent and objective

And in the beginning there was …

Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958

Luhn’s method: basic idea

Target documents: technical literature

The method is based on the following assumptions:

Based on limited capabilities of machines (IBM 704)  no semantic information

Why word frequency?

Important words are repeated throughout the text

Simple and straightforward algorithm  cheap to implement (processing time is costly)

When significant?

Too low frequent words are not significant

Too high frequent words are also not significant (e.g. “the”, “and”)

Removing low frequent words is easy

Removing common (high frequent) words:

Using relative position

Where greatest number of high-frequent words are found closest together  probability very high that representative information is given

Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters)

The significance factor

The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between

Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “

Significance factor formula: (Σ[*])2 / |[.]|

(2.5 in the above example)

Generating the abstract

For every sentence the significance factor is calculated

The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned)

For large texts, it can also be applied to subdivisions of the text

No evaluation of the results present in the journal paper!

A new method by Edmundson

H.P. Edmundson: New methods in Automatic Extracting - 1969

Four methods for weighting

Weighting methods:

The weight of a sentence is a linear combination of the weights obtained with the above four methods

The highest weighing sentences are included in the abstract

Target documents: technical literature

Cue Method

Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”)

Three types of Cue words:

Obtaining Cue words

The lists were obtained by statistical analyses of 100 documents:

Bonus words: η > thighη

Stigma words: η < tlowη

Null words: λ > tλ and tlowη< η < thighη

Resulting Cue lists

Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc.

Stigma list (73): anaphoric expressions, belittling expressions, etc.

Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc.

Cue weight of sentence

Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0

Cue weight of sentence: Σ (Cue weight of each word in sentence)

Key Method

Principle based on [Luhn], counting the frequency of words.

Algorithm differs:

Key weight of sentence: Σ (Key weight of each word in sentence)

Title Method

Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs)

Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document

Words are given a positive title weight if they appear in this glossary

Title words are given a larger weight than heading words

Title weight of sentence: Σ (Title weight of each word in sentence)

Location Method

Based on the hypothesis that:

Global idea:

Location Method: Heading weight

Compare each word in a heading with the pre-stored Heading dictionary

If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary

Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ () - - - [ - * * - - * - - * ] - - (*) “