Automatic Keyword Extraction for Text Summarization: A Survey

in #text6 years ago

Hi ! I continue my tour on the Text Summarization techniques, this time is about the keyword extraction, with this survey written by Santosh Kumar Bharti, Korra Sathya Babu.

Source: https://arxiv.org/abs/1704.03242

The Automatic Keyword Extraction is fundamental in order to provide a good representation for the summarization.

It is not a descriptive survey detailing and explaining, it is more a web showing a view of of the current state of the research and citing several papers. I like also this way of doing, it is very neutral, you can quickly select what you are interested into and dive into. This post will be like the survey, minimalist on the descriptive aspect.

If you want a more descriptive post, you can have a look to this one: https://steemit.com/text/@boucaron/text-summarization-techniques-a-brief-survey

AA.png

This previous figure shows the main categories of the Automatic Keyword Detection. The statistic approach are using non-linguistic features of the document, word occurrences for instance. The linguistic approach allows to extend the previous one using lexical, syntactic, discourse analysis. Machine Learning can also be used, the keyword extraction is a learning problem, it needs a dataset for learning, it can be tuned for a specific field/task.

AB.png

This figure shows an overview of the different kinds of Text Summarization: single or multiple documents, query based where only the subset of interest is extracted, extractive that builds a summary, abstractive where the "idea" is extracted using linguistic to extract concepts and generate short abstract, supervised based on training through datasets.

AC.png
AD.png

The previous figures show a taxonomy of the different summarization approaches. This previous post describes in more detail the Statistical Based (TF-IDF), Graph Based (GPR) and Bayes Machine Learning based (NB): https://steemit.com/text/@boucaron/text-summarization-techniques-a-brief-survey. The Coherent based approach uses words, lexical, grammar to establish the meaning. The Algebraic approach contains all matrix related techniques trying to produce a set of concepts from the text through different techniques: indexing, clustering, classification....

AE.png

This last figure shows how the performance of the text summarization can be evaluated through different metrics.

Even, if this survey is not descriptive enough and is lacking many definitions, it provides a good big picture of the domain to allow to dig further.

Sort:  

I find your posts are very helpful and honest, thank you for sharing your information boucaron, following you.

Thank you, it means a lot for me.

You got a 1.50% upvote from @postpromoter courtesy of @boucaron! Want to promote your posts too? Check out the Steem Bot Tracker website for more info. If you would like to support development of @postpromoter and the bot tracker please vote for @yabapmatt for witness!

Coin Marketplace

STEEM 0.19
TRX 0.13
JST 0.030
BTC 63802.69
ETH 3413.56
USDT 1.00
SBD 2.55