Madhavi Ganapathiraju Graduate Student Language Technologies Institute Carnegie Mellon University
In this talk… What do we mean by Summarization Expectations Intuitive guesses on “how to” Current approaches One specific method in detail
Reduce length of document But preserve: - Key Information
- Style of writing
Expected Qualities:
How can we say a summary is good or bad? Be able to answer questions Compression ratio No Redundancy
How is it done manually Read document Identify important “phrases” Identify Chronology of events if any
How to do it automatically: Edmundson’s method His work at IBM - In 1969!!
- Forms major component even in today’s systems!!!!
Edmundson’s method
Scoring schemes derived Keyword-occurrence Title-keyword Location heuristic Indicative phrases - “this report …”, “in conclusion…”
Short-length cutoff Upper-case word feature
Graph theoretic method
How to put key information together? Synthesis new sentences? - Too difficult… to synthesize accurately
- Systems exist
- Undesirable
- Original style of writing lost
- Subtle information like tone of presentation lost
Take top most scoring sentences Arrange them by descending scores Preserve chronology if exists
Redundancy Edmundson’s procedure: Novel methods to avoid redundancy - Maximum “marginal relevance” (MMR)
Similarity between sentences Semester begins tomorrow New semester is beginning on Monday S1 = [Semester(1) begin(1) tomorrow(1)] S2 = [New(1) semester(1) begin(1) Monday(1)] Similarity
Clusters of sentences Candidature of a sentence to be in summary: - Similarity to query.
- Coverage of the passage
- Content in the passage, eg., proper nouns, dates, etc.
- Time Sequence: more recent ones
Undesirable features in sentences: - Similarity to passages already included in the summary
- Belonging to the cluster/document that has already contributed a sentence to the summary
MMR algorithm
Future presentations on Summarization & Contact persons for research in this area: Future presentations on Summarization & Contact persons for research in this area: Nikesh Garera (ng+@cs.cmu.edu) Learning Methods Ravindra G. (ravi@mmsl.serc.iisc.ernet.in)
Do'stlaringiz bilan baham: |