Novelty Detection in Repeated MEAD Summarization

Richard Murphy: murphyr@umich.edu / EECS 597 - Language & Information, Fall 2002 / Final Project

Contents

  1. What is MEAD?
  2. Why was this project undertaken?
  3. How does the novelty module work?
  4. How do I use this module?
  5. What other materials are available on this site?
  6. What is the future of this project?

What is MEAD?

MEAD is an system created by the CLAIR group at the University of Michigan to create multi-document extractive text summaries. The goal of an extractive text summary is to present the reader with the maximum amount of information possible in a smaller form than the original collection of documents. NewsInEssence is an web application powered by MEAD to provide extractive summaries of current news articles, which are automatically collected and categorized from a number of sources.

As a quick example of the type of inputs and output MEAD can take, look at this cluster of articles and a 20% summary produced from the cluster by MEAD. While one of the sentences in the summary is not important to the story ([5]) and two of the sentences are nearly identical ([7] and [12]), the summary presents a readable presentation of the main points of the article cluster.

[top]

Why was this project undertaken?

The problem with MEAD occurs when a reader attempts to view a second summary of a cluster.  Since most summarized documents are news articles, and news articles often cover on-going events.  As events unfold or further information is found, the body of documents relevant to a given topic will grow.  A reader may return to a story some hours, days, or weeks after viewing a summary of that story in hopes of viewing fresh information.  In this case, the reader would request a new summary drawn from the expanded cluster.

Unfortunately, the current implementation of MEAD has no way to remember that the user has already seen a summary of the cluster.  Since the information presented in the first summary will still most likely be among the most important information available in the cluster, the information will be repeated at the expense of presenting the reader with fresh information.  While this new summary would be useful to a reader who does not yet know anything about the topic, the returning reader will find the new summary to be of low value.

As an example of this problem, consider the two summaries presented below.  The summary on the left is a 20% extract of the earliest two articles available in a cluster, and the summary on the right is a 20% extract of the same cluster, now with two additional articles.  Italicized sentences denote those which appear in both summaries.

[1]  CNN.com - Plane hits skyscraper in Milan - April 18, 2002
[2]  CNNenEspanol.com A small plane has hit a skyscraper in central Milan, setting the top floors of the 30-story building on fire, an Italian journalist told CNN.
[3]  The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina.
[4]  Several storeys of the building were engulfed in fire, she said.
[5]  Italian TV says the crash put a hole in the 25th floor of the Pirelli building, and that smoke is pouring from the opening.
[6]  U.N. envoy horror at Jenin camp U.S. bombing kills Canadians Chinese missiles concern U.S. 2002 Cable News Network LP, LLLP.
[7]  The building houses government offices and is next to the city's central train station.

[1]  CNN.com - Plane hits skyscraper in Milan - April 18, 2002
[2]  The crash by the Piper tourist plane into the 26th floor occurred at 5:50 p.m. (1450 GMT) on Thursday, said journalist Desideria Cavina.
[3]  The building houses government offices and is next to the city's central train station.
[4]  Italian TV says the crash put a hole in the 25th floor of the Pirelli building, and that smoke is pouring from the opening.
[5]  U.N. envoy horror at Jenin camp U.S. bombing kills Canadians Chinese missiles concern U.S. 2002 Cable News Network LP, LLLP.
[6]  The Pirelli Building in Milan, Italy, was hit by a small plane.
[7]  (ABCNEWS.com)   8212; A small plane crashed into a skyscraper in downtown Milan today, setting several floors of the 30-story building on fire.
[8]  The plane crashed into the 25th floor of the Pirelli building in downtown Milan.
[9]  A small airplane crashed into a government building in heart of Milan, setting the top floors on fire, Italian police reported.
[10]  WITNESSES REPORTED hearing a loud explosion from the 30-story office building, which houses the administrative offices of the local Lombardy region and sits next to the city s central train
station.
[11]  Italian state television said the crash put a hole in the 25th floor of the Pirelli building.
[12]  CNNenEspanol.com A small plane has hit a skyscraper in central Milan, setting the top floors of the 30-story building on fire, an Italian journalist told CNN.


Notice that, despite the fact that around 80% of the first two articles and none of the second two articles have been seen yet, all but one of the sentences chosen for the first summary have been chosen for the second summary.  While the second summary might be a good extract for someone who has not yet seen any information on the story, it involves a lot of repetition for a read who has seen the first extract, and thus presents significantly less new information than a summary of this size could.

My goal was to add to MEAD some awareness of the fact that a returning reader has already seen some of the information that MEAD has available, thereby allowing MEAD to generate a summary which maximizes the total information presented to the user.

[top]

How does the novelty module work?

To implement an awareness of seen summaries, I created a new reranker module for the MEAD system, based on the default reranker module.

The new reranker takes the same input as all MEAD rerankers--a reranker-info file containing compression information (sentences vs. words, percent of total vs. absolute, and a percent or absolute number), cluster information (so that the reranker may inspect the sentences), and sentence scores, as computer by the classifier.  Rerankers also take command line options: the default reranker takes the name of a similarity function, a similarity cutoff, and an IDF database.  The new reranker takes these same options, followed by an optional user name, a demotion increment, and a second similarity function and cutoff.

The new reranker first eliminates overly redundant sentences within the new summary, using the first similarity function passed in on the command line.  At the time of writing, the only sim-function supported by MEAD is MEAD-cosine.  The reranker selects the highest-scored sentence for inclusion in the summary, then compares every lower-scored sentence in the sentence set to the selected sentence; any sentence more similar than the first passed-in cutoff is struck from the set.  This process is repeated for each sentence not already struck from the set.  In the default reranker, this only continues until enough unstruck sentences have been found to provide a summary of the desired size; since the novelty-concerned reranker has to be prepared to lose some of the internally-novel sentences upon comparison with seen summaries, it must continue until it has seen every sentence in the set.

If no user is passed in, or if the user named has not previously seen any summaries of this cluster, the novelty-detecting reranker simply selects the topmost sentences for inclusion in the summary until the desired size has been reached, since all of the sentences available are novel.  If a user is passed, and that user has seen previous summaries, then each of the archived summaries is opened and the seen sentences are compared to the sentences available for the new summary.  If a second similarity function and cutoff are not specified, the reranker will use the same ones specified for the first phase of reranking.  Each sentence in the available set is compared to each sentence that has already been seen.  Whenever the available sentence is found to be more similar to a seen sentence than the specified cutoff, the available sentence's score is demoted by the amount specified by the decrement amount.  If no decrement is provided, 0.1 will be used by default.  After all demotion has been done, the highest scored sentences are selected for inclusion in the summary.  After sentences are selected, an archive file is written out containing these sentences.

When the new reranker has finished processing the sentences, MEAD will continue as it would after any reranker.

[top]

How do I use this module?

In order to use this novelty-aware reranker, you must first install MEAD.  Download and install the latest version and read the documentation available on the MEAD homepage.

Once you have a functioning version of MEAD, download the new reranker: murphyr-reranker.tar.gz.  Unzip and place the file "murphyr-reranker.pl" into the bin directory of your MEAD installation.

MEAD can be instructed to use this reranker either with a "reranker" line in your .meadrc file or with the "-reranker" command line option.  Either way, you should provide the command line for the reranker in the form:
murphyr-reranker.pl simfunction_1 simcutoff_1 IDF [userid [decrement [simfunction_2 simcutoff_2]]]
Where simfunction_1, simcutoff_1, and IDF are the arguments passed to the first phase of the reranking (eliminating internal redundancy), userid specifies the user requesting a summary, decrement is the amount by which a non-novel sentence is to be demoted, and simfunction_2 and cutoff_2 are used (with the same IDF specified before) are used in the second phase (reducing redundancy across multiple summaries).

Examples:
murphyr-reranker.pl MEAD-cosine 0.7 enidf
will use the MEAD-cosine function to compare sentences within the summary, eliminating those with a cosine similarity of more than 0.7 with already included sentences, using the enidf database.  No comparison will be made with past summaries.
murphyr-reranker.pl MEAD-cosine 0.7 enidf Murph
will eliminate sentences within the summary as above, but will now test the sentences available against all of the sentences that the user Murph has already seen.  The comparison with already seen sentences will be made with the MEAD-cosine function, with sentences having a similarity to seen sentences higher than 0.7 being demoted by 0.1 points.
murphyr-reranker.pl MEAD-cosine 0.7 enidf Murph 0.3
will work as above, but demoting sentences similar to seen sentences by 0.3 points.
murphyr-reranker.pl MEAD-cosine 0.7 enidf Murph 0.3 MEAD-cosine 0.6
will work as above, but will demote sentences with similarity to seen sentences of 0.6 or higher.

Note that setting a demotion similarity cutoff of 1 will only demote those sentences which have already been seen, and not sentences which are very similar (for example, one has a CNN tagline and the other an MSNBC tagline at the beginning).  Setting a demotion similarity cutoff of 0 will demote all available sentences for each seen sentence, such that every available sentence will be demoted the same amount and the resulting summary will be identical to one with no demotion done.

[top]

What other materials are available on this site?

  1. Project proposal, 28 October 2002 (MS Word)
  2. Class presentation, 06 December 2002 (MS PowerPoint)
  3. Test data, including sample clusters, summaries, and evaluation data
[top]

What is the future of this project?

Several things could happen to this project after the course it was designed for has finished.  All of them are purely dependant on my free time or that of others, and are not included in the scope of the course project.
[top]