Text Summarization Techniques : A Brief Survey

in #text6 years ago

Hi ! Follows a short and interesting survey on text summarization. Those techniques are very useful in daily life and work. I see this in many many applications : search engines, semantic filters, presentation layers...

From Nice to Have To Daily Effective Use

Typically, when you perform research you have to read several articles with in general a catchy abstract, but you still have to perform a first quick read to select if you keep it or not, if so, then you start to perform a first more complete read, and if it is really interesting you are going to dig further. Having a good text summarization will automatically perform a good first stage, and the integration with a crawler performing text mining will enable to push to you directly interesting papers worth reading: you still need to feed the animal with what you consider worth to read or not, but it will be able to help you not based on what Google think is worth but in a more neutral way for you.

Easier to Integrate

What is very interesting with all the hype around of Artificial Intelligence, it is the creation of stable frameworks and libraries allowing to develop and to integrate such features. You do not need to understand the dirty details to use it, but it is important to have some little background on how it works.

Text Summarization Techniques: A Brief Survey

by Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys Kochut

Source: https://arxiv.org/abs/1707.02268

This survey is about "extractive summarization", a summary is build based on a subset of the sentences in the original text. This is done in 3 phase:

  • Build an intermediate representation of the text showing its main aspects.
  • Score sentences using such representation.
  • Build a summary from a subset of such sentences.

Intermediate Representation

There are 2 types of representation: Topic based and Indicator based. Topic based are for instance: frequency-driven, topic word, latent semantic analysis and Bayesian topic models. Indicator based describe each sentence as set of features: length, position in the document...

Topic Representation

Explanatory words called "topic signature" are found. To compute the importance of the sentence: the number of topics signature it contains (long phrase), or a proportion of the topic signatures in it (measure density).

Frequency-driven

Most common techniques are word probability and TFIDF(Term Frequency Inverse Document Frequency). Word probability is simply the number of occurrences of a word divided by the sum of all words. When using the word probability you need a list of stop words, TFIDF allows to identify such very common words and giving them low weight. TFIDF is widely used. There are more advanced techniques summarized in the survey.

Latent Semantic Analysis

LSA is an unsupervised technique extracting a representation of the semantics of the text based on words. First it builds a matrix between words and sentences. The weight of the words is computed using TFIDF, when a word is not in a sentence it is set to 0. This matrix is decomposed in few matrix that allows to choose sentences for each topic. The technique has a few drawbacks and extensions cited in the survey.

Bayesian

Bayesian allows to represent conditional probabilities, sentences are typically not independent of each other. There are different techniques using for instance probability distributions enabling higher quality summarization than existing techniques, especially Latent Dirichlet Allocation.

Indicator Representation

Indicators are used to model the text as a set of features instead of topics. There are graph-based methods and machine learning based methods.

Graph Methods

Graph methods are influenced by the well-known Google PageRank algorithm representing the text in a graph. Sentences are the vertices and edges is the similarity between sentences. Sub-graphs are topics. Important sentences are the ones having many edges, the "central" ones. A main limitation is that it does not take in account the syntactic and the semantics information. Which is why Google is so interested in pushing really forwards on the Machine Learning.

Machine Learning Methods

Most of Machine Learning approaches handle the problem as a classification problem. There are supervised and semi supervised learning methods which are discussed in the survey. Such techniques are very successful and allow more fine tuning according to the kind of expected summary.

Sort:  

You got a 1.61% upvote from @postpromoter courtesy of @boucaron! Want to promote your posts too? Check out the Steem Bot Tracker website for more info. If you would like to support development of @postpromoter and the bot tracker please vote for @yabapmatt for witness!

Coin Marketplace

STEEM 0.20
TRX 0.14
JST 0.030
BTC 64294.06
ETH 3427.66
USDT 1.00
SBD 2.59