9 Natural Language Generation, Summarization, & Translation
In this chapter, we overview three tasks that involve the production of well-formed natural language: text generation, text summarization, and automated translation of natural language. These three tasks all focus on the synthesis of language from either structured or unstructured content, rather than the analysis of language, which we considered in prior chapters. Some applications that use generation are: creating reports and letters, presenting information retrieval results as snippets, and writing captions for tables or images. (An example of report generation would be describing the results of a database query when the number of results is large. For example, one might present the results of a query of “cats” on the online database Wikidata as “There are between 140 and 150 different cats mentioned in Wikidata; a few examples are Meuzza, Orangey, and Mrs. Chippy.”) Summarization and translation both involve sequence to sequence transduction, which means that the input and output are both sequences. Text summarization starts with a long sequence of words and produces a shorter one; machine translation starts with sequences of words in one language and creates sequences with the same general meaning in another language. We will consider each of these three tasks, after we first consider a more detailed example of a real-world application of natural language generation.
9.1 AN Application OF NATURAL LangUage Generation
Recommender systems are programs that help people decide among several choices (e.g., different models or brands of an unfamiliar product)[1]. Suppose that we wish to create a system that will provide a summary of comments from a range of customers, rather than the results of our own testing. To get the comments, we could extract product reviews from an online marketplace, which typically include text, an overall rating, and sometimes a rating of “usefulness” provided by other shoppers. The output might include a well-formed paragraph describing what features purchasers liked or disliked. The objective would be that these automatically written reviews be as informative and fluid as reviews published in professional publications, such as “Tom’s Guide” or “Consumer Reports”.
To see what professionals write, consider the example in Figure 9.1, which includes sentences from the opening paragraphs of a review written for “Tom’s Guide”[2] based on the author’s personal opinion and testing of the “Roborock S4 Max”, an autonomous vacuum that entered the market in late 2020. The review starts with engaging generalities, mentions some specific features, and then continues afterwards with several labelled subsections on: price and availability; design; cleaning performance; setup, map, app, and mapping; and verdict. These subsections include both text and tables.
While there are plenty of budget-busting robot vacuums ready to do your bidding, finding one like the Roborock S4 Max, which combines performance and affordability, is rare. It gets the job down smartly and efficiently– without cleaning out your wallet. … In our Roborock S4 Max review, we found a vacuum that works well and has useful, modern features. With fast mapping, single room cleaning, and automatic carpet detection, the $429 S4 Max strikes the right balance of performance, features, and cost. All of that has earned a spot at the top of our best robot vacuums list. |
At the same time, if you had searched online, you would have found customer reviews as shown in Figure 9.2. In the positive review, the features mentioned were: mopping, camera based object avoidance, (quality of) cleaning, WiFi setup, laser navigation, battery life, (degree of) quiet, (speed of) mopping, (accuracy of) map, virtual walls, no-go function. In the negative review, the features mentioned were: (accuracy of) map, (quality of) cleaning (expressed as “There was a lot of debris left after two cycles on max mode”), (quality of) suction, (accuracy of) mapping, expressed as “It is currently in my master bathroom running into the cabinet although it was set to clean the kitchen”; and (quality of) object avoidance (expressed as “running into walls” and “stuck under the dishwasher”).
Negative review
(1 star) |
|
||||
Positive review
(5 star) |
|
A general approach to creating a summary, similar to past approaches[3],[4],[5], would be to first extract information about the features of the product and the value of that feature provided, which might be yes-no, or qualitative (fast, quiet), or quantitative (150 minutes). Then, the system could collect information for each feature across the set of reviews and either generate a sentence for each feature or create sentences for all the liked features and for all the disliked features.
Recent automated approaches to creating aggregated reviews do not produce actual text, that is, they do not summarize or generate text, because they treat it as more of a classification problem. For example, one approach, called Abstractive Opinion Tagging[6] used analyzed reviews of Hot-Pot restaurants to produce a ranked list of the top five items as: [hospitable service (223), delicious food (165), value for money (104), comfortable environment (65), served quickly (14)]. The tag “delicious food” would apply to sentences with phrases like “I was pleasantly surprised about how yummy the dish and the lamb were”, “The shrimp was fresh and the pork mixture was tasty”, and “Food is delicious”. While tagging features is useful for supporting certain kinds of searches, it does not address natural language generation or summarization.
9.2 Natural Language Generation
The task of a natural language generation (NLG) system is to create a text that will achieve a specified communicative goal. There are three steps to this: first, deciding what to say at the conceptual level (which maps a broad goal, like “respond” onto specific subgoals); second, deciding how to organize the information into sentences; and third, creating output as a sequence of words. These tasks are known as content selection, sentence planning, and realization, respectively. Some natural language generation tasks do not require planning or realization, because the target output is mostly fixed, and thus selecting the output form can be handled as a classification task. The goal of separating sentence planning and realization as a general service is to minimize the amount of linguistic knowledge that systems must encapsulate, so that they can focus on manipulating information at the task level.
Content selection involves creating a description of the content to be expressed, and possibly also the reason for expressing it, such as “to support quantitative comparison” (e.g., to allow one to compare the size or cost of two alternatives)[7],[8]. Selecting content is normally a function within an application. To make it easier to use standardized components for the other steps however, applications might represent the selected content in a standardized form, such as a set of relational triples, as a record structure with multiple slots, or as a table, with labelled columns.
Sentence planning involves grouping content into a non-redundant set of sentences that will be easy to understand. (It is not a good idea to put every concept in a separate sentence because it becomes unnaturally repetitive.) Thus, in sentence planning, a system should keep track of what has already been conveyed and select appropriate referring expressions, including pronouns and shortened descriptions. Coherence is improved by adding cue phrases and discourse markers to indicate discourse relations or rhetorical structure (as discussed in Chapter 7). Cue phrases are expressions like “for example” or “The second phase”. Sentence planning might also determine that entire clauses should be excluded because they are already part of the context of the interaction. To leave them in would create the mistaken inference that it was new information or that the speaker believes that the hearer has some defect in their hearing or understanding.
Sentence realization produces well-formed sequences of words in a target language. The input will be a sentence plan, which might be a semantic representation or a list of slot-filler pairs. There are several ways the plan might be mapped onto text. The simplest approach is to use canned text, which is any text that has been entirely pre-written, and to provide the mapping explicitly, using a form or table. This approach is how most chatbots and IVR systems produce their output. (Canned text can supply a broader range of outputs, but only by hand-coding collections of alternative sentences that achieve the same intent.) Rule-based approaches can specify patterns that include designated variables that are instantiated from a database or discourse context. A few dialog frameworks support these outputs, and automated form letter generators have always worked this way. When these patterns are more sophisticated, they are called templates, and may include functions for assuring that the sentences are all grammatical without forcing the application to know all derivations of a root form[9],[10]. The most flexible approach to realization uses a fine-grained grammar to do realization, similar to reversing the action of a rule-based parser, typically one that relies on feature unification. Unification, with a grammar that includes precise specification of grammatical features, would be best for offline applications where output quality is more important than speed, such as professional reports[11].
9.3 Text Summarization
Summarization maps an input text to a shorter one that preserves its essential meaning, where what is essential may depend on the task. For example, for information retrieval, one might want to make it clear to the user how a document uses the keywords in the query, so that they will understand the relevance. Other applications of summarization include automatically providing an abstract of a particular length for a website or providing a summary of news stories gathered across multiple documents (such as news.google.com). Summarization is most often extractive, which means that the summaries comprise selected complete sentences from the original. The alternative is abstractive summarization, which means the summaries comprise entirely new sentences that express the desired content, which would be akin to translation, where the source and target languages are the same.
Traditional methods for extractive summarization traverse the entire text and rank each sentence based on a hand-built scoring function. One popular choice involves first computing the tf-idf score for each word (as used for the Vector Space Model discussed in Chapter 8) and selecting all words with a score above a given threshold (called a “centroid”) and then scoring each sentence based on similarity with the centroid, which is a vector that represents an average over all the sentences of the unit. A similar, but more sophisticated idea is to score each sentence based on its similarity to semantic vectors trained for the entire document, treated as a “sentence”. These vectors are known as “universal sentence embeddings”[12]. Figure 9.3 shows the original text of a journal abstract by de Wilde et al (2018)[13], along with an extractive summaries by two systems, provided online by SMRZR.io and DeepAI.org. (The highlighting shows which sentences each summarizer selected.) The summary on the top right, by the tool SMRZR.io, reports using a technique based on the BERT deep learning architecture. BERTs for summarization are trained to create an embedding for the entire document, and then sentences are compared against this vector. The summary on the bottom right in Figure 9.3 shows a summary provided by DeepAI.org. (Unfortunately no information is provided about their approach.) The summary is reasonable. It is less readable than the other as a summary, but includes more technical content. Unsupervised approaches can also be trained to select a set of semantically related sentences (to form a more cohesive text)[14]. An alternative to such unsupervised approaches would be to train a supervised machine learning model using data where each sentence is labeled with the class INCLUDED or NOT-INCLUDED[15], but few such data sets exist.
|
|
An abstractive approach to summarization might identify and rank concepts rather than sentences, for example by mapping a text onto a set of relational triples) and then use a standard natural language generation pipeline. (This approach is like treating summarization akin to machine translation, where the source and target just happen to be the same language.) Of the two, extractive summarization has been the most commonly used – because it is the easiest to do. However, there has been increasing interest in abstractive summarization, especially by applying recent work on semantic textual similarity.
Other promising approaches to summarization, which could be either extractive or abstractive, make use of ranking methods developed for information retrieval, such as TextRank to select relevant sentences or concepts[16], [17].
The quality of a text summary can be evaluated in several ways[18]. The expectation is that the summary will be similar to the reference document from which it was created, so measures of general document similarity used in information retrieval (such as cosine similarity or BM25) are an option. Other methods count topics, which are sets of co-occurring words derived using clustering algorithms, such as Latent Semantic Analysis. Another method, created specifically for summarization and translation, is known as ROUGE, for “Recall-Oriented Understudy for Gisting Evaluation”[19]. The ROUGE method counts and compares surface units, such as the number of overlapping n-grams, between summaries created automatically and summaries created by human experts. It thus requires a training set that includes expert summaries. Going forward, new methods are being devised that make use of meaning representations created via machine learning to measure semantic similarity between the summary and either the reference text, or a hand-built summary[20].
9.4 Machine Translation
Machine Translation systems are systems that translate text from a source language into one or more target languages (or for assisting human translators in their task, known as machine-aided translation). The primary goal is to preserve the meaning of the original while observing the language conventions of the target language. Literary translation systems may have the added goal of preserving stylistic aspects of the original, including preserving the intended effects (such as amusement or suspense)[21], [22]. While the idea of machine translation is almost as old as computer science itself, it did not become practical until the development of methods based on statistical language modelling[23].
The standard current approach for developing largescale machine translation systems is to train paired language models that link syntactic structures (phrases) from a source language onto a single target, based on a collection of translated texts. These pairs of multilingual datasets form what are known as parallel corpora[24]. An example of translated sentence pairs is shown below in Figure 9.4, where translations from English to French were created using the online version of Google Translate[25]. From this collection, a model might learn to translate some words and phrases correctly, e.g. “the cat” -> “le chat”, “the mat” -> “le tapis”, “on” -> “sur”, “under” -> “sous”. It likely could not correctly learn “is sleeping”, “is standing”, or “stands” because of the variation in the words used for these expressions.
English | French |
The cat is sleeping on the mat. | Le chat dort sur le tapis. |
The cat is standing on the mat. | Le chat est debout sur le tapis. |
The cat stands on the mat. | Le chat se tient sur le tapis. |
On the mat, the cat is standing. | Sur le tapis, le chat est debout. |
The cat is sleeping under the bed. | Le chat dort sous le lit. |
Under the bed, the cat is sleeping. | Sous le lit, le chat dort. |
Some of the most recent machine translation modelling approaches employ neural networks[26], although methods based on phrase-based statistical modelling still outperform them for some language pairs[27]. With a trained model associating phrases from the two languages, an efficient search algorithm or classifier can find the highest probability translation among previously seen sentences in the target language.
Earlier approaches to machine translation based on rules sometimes translated from a single source to multiple targets at once (multilingual translation) by means of an intermediate representation called an interlingua. This approach has been used in commercial settings where translation into 100’s of target languages must be completed quickly and the domain is rather small (e.g., repair manuals for farm machinery). Use of an interlingua to support neural network-based multilingual translation has shown both promise and challenges[28],[29].
Evaluation of Machine Translation systems most often use the same metrics for evaluating Natural Language Generation, such as ROUGE or BLEU. One metric developed specifically for Translation is called METEOR, which its orignators describe as follows:
[It scores] “machine translation hypotheses by aligning them to one or more reference translations. Alignments are based on exact, stem, synonym, and paraphrase matches between words and phrases. Segment and system level metric scores are calculated based on the alignments between hypothesis-reference pairs. The metric includes several free parameters that are tuned to emulate various human judgment tasks” [30].
A version of the Meteor scoring metric has been implemented in NLTK V3.5[31].
Commercial providers (such as Google) provide APIs that support high-quality translation for a wide range of language pairs. For less common languages and dialects or specialized domains, open source tools, such as MOSES, are available to create machine translation systems by training them with a parallel corpus[32]. MOSES uses a statistical approach. There are several open source toolkits for creating neural network based machine translation systems, including OpenNMT[33], Sockeye[34] and MarianNMT[35].
9.5 Summary
This chapter considered three related tasks, text generation, text summarization, and automated translation of natural language text, which all involve the output of well-formed natural language, rather than just the analysis of language data. The output for each of these tasks can be specified as a set of concepts using a representation of meaning, or as a sequence to sequence operation that will maximize some objective function, such as maximizing semantic similarity, while at the same time creating output in the target language and at the target length. The lack of parallel input-output data sets has led to many approaches that rely on hand-built rules. Some systems are experimenting with machine learning based methods, where datasets might already exist (e.g., because of government requirements to create documents in multiple language) or can be created using crowdsourcing.
- Resnick, P., and Varian, H. R. (1997). Recommender Systems. Communications of the ACM, 40(3), 56-58. ↵
- McDonough, M. (2020). Roborock S4 Max robot vacuum review, November 18, 2020, Tom’s Guide, Future US, Inc. URL: https://www.tomsguide.com/reviews/roborock-s4-max-robot-vacuum Accessed Jan 2021 ↵
- Carenini, G., Ng, R., and Pauls, A. (2006). Multi-document Summarization of Evaluative Text. In Proceedings of EACL 2006. ↵
- Tata, S. and Di Eugenio, B., (2010). Generating Fine-Grained Reviews of Songs from Album Reviews. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10). Association for Computational Linguistics, USA, pp. 1376–1385. ↵
- Gerani, S., Mehdad, Y., Carenini, G., Ng, R., and Nejat, B. (2014). Abstractive Summarization of Product Reviews Using Discourse Structure. In Proceedings of EMNLP 2014. 1602–1613. ↵
- Li, Q., Li, P., Li, X., Ren, Z., Chen, Z., & de Rijke, M. (2021). Abstractive Opinion Tagging. In Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event, Israel. ACM, New York, NY, USA, 9 pages. Link to a github site with their review data: https://github.com/qtli/AOT. ↵
- Mittal, V.O., Moore, J.D., Carenini, G. and Roth, S. (1998). Describing Complex Charts in Natural Language: A Caption Generation System. Computational Linguistics, 24(3), pp.431-467. ↵
- Baaj, I., Poli, J.P. and Ouerdane, W. (2019). Some Insights Towards a Unified Semantic Representation of Explanation for eXplainable Artificial Intelligence. In Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019), pp. 14-19. ↵
- McRoy, S.W., Channarukul, S. and Ali, S.S. (2000). YAG: A Template-Based Generator for Real-Time Systems. In Proceedings of the First International Conference on Natural Language Generation. pp. 264-267. ↵
- Channarukul, S., McRoy, S.W. and Ali, S.S. (2001). YAG: A Template-Based Text Realization System for Dialog. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 9(06), pp.649-659. ↵
- Elhadad, M. and Robin, J. (1997). SURGE: A Comprehensive Plug-in Syntactic Realization Component for Text Generation. Computational Linguistics, 99(4). ↵
- Lamsiyah, S., El Mahdaouy, A., El Alaoui, S.O. and Espinasse, B. (2019). A Supervised Method for Extractive Single Document Summarization based on Sentence Embeddings and Neural Networks. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development, pp. 75-88. Springer, Cham. ↵
- de Wilde, A. H., Snijder, E. J., Kikkert, M., & van Hemert, M. J. (2018). Host Factors in Coronavirus Replication. Current Topics in Microbiology and Immunology, 419, 1–42. https://doi.org/10.1007/82_2017_25 ↵
- Joshi, A., Fidalgo, E., Alegre, E. and Fernández-Robles, L. (2019). SummCoder: An Unsupervised Framework for Extractive Text Summarization based on Deep Auto-Encoders. Expert Systems with Applications, 129, pp.200-215. ↵
- Yao, J.G., Wan, X. and Xiao, J. (2017). Recent Advances in Document Summarization. Knowledge and Information Systems, 53(2), pp.297-336. ↵
- Mihalcea, R., and Tarau, P. (2004). Textrank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (pp. 404-411). ↵
- Wang, W. M., See-To, E. W. K., Lin, H. T., and Li, Z. (2018, October). Comparison of Automatic Extraction of Research Highlights and Abstracts of Journal Articles. In Proceedings of the 2nd International Conference on Computer Science and Application Engineering (pp. 1-5). ↵
- Steinberger, J. and Ježek, K. (2012). Evaluation Measures for Text Summarization. Computing and Informatics, 28(2), pp.251-275. ↵
- Lin, C.Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pp. 74-81. ↵
- Campr, M. and Ježek, K. (2015). Comparing Semantic Models for Evaluating Automatic Document Summarization. In Proceedings of the International Conference on Text, Speech, and Dialogue, pp. 252-260. Springer, Cham. ↵
- Toral, A., and Way, A. (2015). Machine-Assisted Translation of Literary text: A Case Study. Translation Spaces, 4(2), 240-267. ↵
- Toral, A., and Way, A. (2018). What level of quality can neural machine translation attain on literary text?. In Translation Quality Assessment (pp. 263-287). Springer, Cham. ↵
- Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J., Mercer, R.L. and Roossin, P.S. (1990). A Statistical Approach to Machine Translation. Computational Linguistics, 16(2), pp.79-85. ↵
- Williams, P., Sennrich, R., Post, M. and Koehn, P. (2016). Syntax-Based Statistical Machine Translation. Synthesis Lectures on Human Language Technologies, 9(4), pp.1-208. ↵
- Google.com (2021) Google Translate website URL: https://translate.google.com/ Accessed July 2021. ↵
- Bahdanau, D., Cho, K. and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473. ↵
- Sánchez-Martínez, F., Pérez-Ortiz, J.A. and Carrasco, R.C. (2020). Learning Synchronous Context-Free Grammars with Multiple Specialised Non-terminals for Hierarchical Phrase-Based Translation. ArXiv preprint arXiv:2004.01422. ↵
- Escolano, C., Costa-jussà, M.R. and Fonollosa, J.A. (2019). Towards Interlingua Neural Machine Translation. arXiv preprint arXiv:1905.06831. ↵
- Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M.X., Cao, Y., Foster, G., Cherry, C., Macherey, W. Chen, Z., and Wu, Y.(2019). Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. arXiv preprint arXiv:1907.05019. ↵
- Denkowski M. and Lavie, A "Meteor Universal: Language Specific Translation Evaluation for Any Target Language", Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014 ↵
- NLTK.org (2020) NLTK 3.5 Documentation Source Code for NLTK.meteor_score, Web page URL: https://www.nltk.org/_modules/nltk/translate/meteor_score.html ↵
- Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. (2007). MOSES: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster =Sessions, pp. 177-180. ↵
- Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senellart, J. and Rush, A.M. (2018). OpenNMT: Neural Machine Translation Toolkit. arXiv preprint arXiv:1805.11462. Link to OpenNMT https://opennmt.net/ ↵
- Hieber, F., Domhan, T., Denkowski, M., Vilar, D., Sokolov, A., Clifton, A. and Post, M. (2017). Sockeye: A Toolkit for Neural Machine Translation. arXiv preprint arXiv:1712.05690. ↵
- Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., Seide, F., Germann, U., Aji, A.F., Bogoychev, N., Martins, A.F., and Birch, A. (2018). Marian: Fast Neural Machine Translation in C++. arXiv preprint arXiv:1804.00344. Link to MarianNMT: https://marian-nmt.github.io/ ↵