Benchmark Tasks for Language Modelling

Susan McRoy

6 Benchmark Tasks for Language Modelling

This chapter assesses the progress of Natural Language Processing as a scientific discipline. We do so by considering applications of NLP that test the generality of a system or language model. The field of NLP comprises both practical applications and basic science.

Basic science refers to investigations where the primary goal is to predict phenomena or to understand nature. Basic science outcomes help us to develop technologies that alter events or outcomes. Early qualitative science NLP involved researchers looking at small examples of language phenomena and using them to better understand how human language processing works. This research has looked at what people do well and also when they make mistakes and seeing under what conditions computation models exhibit similar behavior. Inspiration for computational approaches comes from research in other disciplines, primarily psycholinguistics. Early work involved asking people to make judgements about whether they think a sentence is grammatical. More modern approaches use biosensors to track movements of the eyes or event-related potentials of the brain, as they process ambiguous sentences or resolve long distance dependencies, such as referring expressions^[1].

Basic science for NLP also includes developing models of human language. Language modeling involves learning parameters for complex mathematical expressions that measure or predict the likelihood of occurrences of a word or sentence for a given context. These models are used to implement methods for solving low-level NLP tasks that form the basis of more complex software applications. Four important examples of low-level NLP tasks include:

Labelling language units or pairs of units according to their grammaticality,
Labelling units with the type of sentiment or emotion they express
Labelling pairs of units as two whether they are semantically similar and
Labelling pairs of units as to their logical relationship.

They are sometimes called “benchmark” tasks for language processing because they are tasks that can be defined independent of a specific computational approach and thus offer a way to make them comparable.^[2] For example, having a method of correctly classifying sentences as ungrammatical might be useful for helping writers, teaching second language learners, or providing input to a classification-based parser. To date, there is no one modelling approach that can handle all these benchmark tasks well, which is a sign that the field of NLP has not yet solved the problem of creating general-purpose models of human language. In this chapter, we will consider each of the four tasks.

6.1 Grammaticality Analysis

Grammaticality refers to whether an expression obeys the accepted syntactic conventions of a language, independent of whether those sentences make “sense”. A sentence can be both grammatical and meaningless; a classic example is “Colorless green ideas sleep furiously.”^[3]^, ^[4]. Grammaticality judgements are binary (yes or no). Since grammars are descriptions developed by observing the language of native speakers, decisions about what counts as grammatical must also come from native speakers of a language or expert observations. Grammaticality analysis can be used directly, such as for tools to support writing or to grade the writing of language learners, or it can be used to generate features for other tasks, such as assessing some cognitive disorders (e.g., the frequency of ungrammaticality has been associated with particular subtypes of autism spectrum disorder^[5]).

Approaches to grammaticality involve either comparing a given sentence against some model of perfect grammar or comparing a given sentence against some model of common errors, or a combination^[6]. Models of perfect grammar can consist of rules that enforce syntactic constraints at the sentence level. These constraints include that the subject and verb must agree (e.g., both be plural or singular); the pronoun must be correct form (case) for its role in the sentence (e.g., a subject “I” versus the object “me”); the noun must agree with its determiner (e.g., “This book” but not “These book”); the verb must in in the correct tense given its role (e.g. the present form of “be” expects the present participle, as in “am/is/are walking”); conjoined phrases should all have the same (parallel) structure; and words that take arguments, that is create requirements for the co-occurrence of other constructs, must occur with the correct pattern of arguments. There are also rules about the location of apostrophe when forming the possessive. Models of common errors can be rules that describe examples of violations of syntactic constraints and also rules for detecting problems that are primarily semantic, such as dangling modifiers, which are constructions where the entity being modified is either missing or ambiguous.

There are also specialized models of common errors, for different types of learners, such as children or people who are learning a second language. Children often have difficulty with homophones (such as “there”, “their” and “they’re”) or near-homophones (such as “then” and “than”) which sound similar, but are spelled differently. Second language learners have errors that occur when they mistakenly apply grammatical constraints from their first language that do not hold in the second language. For example, native speakers of Arabic sometimes omit the present form of “be” before an adjective or an indefinite article before a noun, because they are not used in these constructions in Arabic.

One of the first automated grammar checkers was EPISTLE^[7]. It takes what we would now consider to be a standard relaxation approach. The algorithm first attempts to parse a sentence while enforcing grammaticality constraints expressed as rules written by experts and then selectively relaxing some of the constraints, while noting the type and location of a violated condition. (The reason to look for known types of errors is to be able to decide whether a parser failed to derive a sentence because it lacks coverage or failed because the sentence being parsed has an error.) An alternative rule-based approach is based on extending a rule-based grammar to include explicit “malrules” that cover known types of errors and are marked with extra features that flag them as errors^[8]^, ^[9]. For example, a malrule might say that an erroneous noun phrase is one with a plural determiner and a singular head noun.

Both relaxation and malrules rely on search to find a rule that matches the suspect sentences. More recent approaches to grammaticality analysis use classification rather than search to determine the existence and type of the error. One approach is to train models using a large corpus of grammatical text, then score unseen examples with a measure of their likelihood of being one of the training sentences and applying thresholds of similarity to determine if a writer’s usage is correct “enough” (that is close enough to a predicted example) given the context. More recent approaches use a corpus of erroneous examples to train models and then classify unseen sentences as one of the erroneous ones. Figure 6.1 shows some examples of errors found in writing assignments submitted by learners of English.

Figure 6.1 Grammatical errors annotated in NUCLE with ERRANT from Ng et al (2014)
Type	Description	Example
Vt	Verb tense	Medical technology during that time [is→was] not advanced enough to cure him.
Vm	Verb modal	Although the problem [would→may] not be serious, people [would→might] still be afraid.
V0	Missing verb	However, there are also a great number of people [who → who are] against this technology.
Vform	Verb form	A study in 2010 [shown → showed] that patients recover faster when surrounded by family members.
SVA	Subject-verb agreement	The benefits of disclosing genetic risk information [outweighs → outweigh] the costs.
ArtOrDet	Article or determiner	It is obvious to see that [internet → the internet] saves people time and also connects people globally.
Nn	Noun number	A carrier may consider not having any [child → children] after getting married.
Npos	Noun possessive	Someone should tell the [carriers → carrier’s] relatives about the genetic problem.
Pform	Pronoun form	A couple should run a few tests to see if [their → they] have any genetic diseases beforehand.
Pref	Pronoun reference	It is everyone’s duty to ensure that [he or she → they] undergo regular health checks.
Prep	Preposition	This essay will [discuss about → discuss] whether a carrier should tell his relatives or not.
Wci	Wrong collocation/idiom	Early examination is [healthy → advisable] and will cast away unwanted doubts.
Wa	Acronyms	After [WOWII → World War II], the population of China decreased rapidly.
Wform	Word form	The sense of [guilty → guilt] can be more than expected.
Wtone	Tone (formal/informal)	[It’s → It is] our family and relatives that bring us up.
Srun	Run-on sentences, comma splices	The issue is highly [debatable, a → debatable. A] genetic risk could come from either side of the family.
Smod	Dangling modifiers	[Undeniable, → It is undeniable that] it becomes addictive when we spend more time socializing virtually.
Spar	Parallelism	We must pay attention to this information and [assisting → assist] those who are at risk.
Sfrag	Sentence fragment	However, from the ethical point of view.
Ssub	Subordinate clause	This is an issue [needs → that needs] to be addressed.
WOinc	Incorrect word order	Someone having what kind of disease → What kind of disease someone [has] is a matter of their own privacy.
WOadv	Incorrect adjective/ adverb order	In conclusion, [personally I → I personally] feel that it is important to tell one’s family members.
Trans	Linking words/phrases	It is sometimes hard to find [out → out if] one has this disease.
Mec	Spelling, punctuation, capitalization, etc.	This knowledge [maybe relavant → may be relevant] to them.
Rloc−	Redundancy	It is up to the [patient’s own choice → patient] to disclose information.
Cit	Citation	Poor citation practice.
Others	Other errors	An error that does not fit into any other category but can still be corrected.
Um	Unclear meaning	Genetic disease has a close relationship with the born gene. (i.e., no correction possible without further clarification.)

Resources for building models of correct grammar from data include syntactic tree banks (which are collections of sentences annotated with parse trees that have been vetted by experts)^[10]. Resources for building models of errors include sentences collected from published articles by trained scholars of linguistics that include examples of grammatical or ungrammatical sentences. One such collection is the Corpus of Linguistic Acceptability^[11]. Other collections of erroneous sentences include the NUS Corpus of Learner English (NUCLE)^[12], the Cambridge English Write & Improve (W&I) corpus, and the LOCNESS corpus, which are collections of essays written by second language learners that have been annotated. Another resource is ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to the type of error. Figure 6.1 shows 28 error types along with examples for each from the NUCLE corpus that have been annotated using ERRANT^[13].

6.2 Sentiment Analysis and Emotion Recognition

Sentiment Analysis attempts to capture the emotional aspects of language including opinions and evaluation. It originated from work that classified sentences as subjective or objective based on the particular words that they contain^[14]^,^[15]^,^[16]^,^[17]. Particular words express the polarity of opinions expressed by a language unit, which might be positive (e.g., “liking”), negative (e.g., “not liking”), or neutral, which is often expressed using adjectives, such as “beautiful” vs “ugly”. Sentences are considered “objective” if they do not include any expressed opinions. Figure 6.2 shows examples of subjective and objective sentences. The sentence on the left is considered subjective because the word “fascinating” is considered positive (in contrast to saying “boring” or “trite”); the sentence on the right is objective because “increase” is considered factual (without implying the increase was too much or too little).

Figure 6.2 Examples of subjective versus objective sentences
Subjective sentence	Objective sentence
At several different layers, it’s a fascinating tale.	Bell Industries Inc. increased its quarterly to 10 cents from 7 cents a share.

Sentiment often involves just labelling the general polarity, which might be positive, negative, or neutral. More fine-grained approaches classify particular levels of sentiment. For example, the Stanford Sentiment Treebank uses continuous values ranging from 1 to 25, where 1 is the most negative and 25 is the most positive. Figure 3 below shows some examples (taken from movie reviews) and the scores that the sentences received using their algorithm.

Figure 6.3 Stanford Sentiment Treebank example score values
Example	Score (rounded)
The performances are an absolute joy.	21
Yet the act is still charming here	18
A big fat pain	5
Something must have been lost in the translation	7

Beyond sentiment, there are also approaches that try to capture various types of emotion, such as anger, excitement, fear, joy, sadness, etc. These tasks are not included in the benchmarks, but might be in the future. Training data would be more difficult to create as labelling examples with such categories requires specialized expertise. However, associating words with emotions has long been part of the qualitative analysis of human language performed by some psychologists^[18]^,^[19] and incorporated into some tools, such as Linguistic Inquiry and Word Count (LIWC)^[20].

Freely available tools for sentiment analysis include VADER, TextBlob, and Sentistrength. Some of the most recent open source tools use Deep Learning (e.g., from the Open Data Hub). Resources for building sentiment analysis include movie and product reviews in the public domain, which typically include an explicit rating that can be mapped to a sentiment polarity value. Examples of public standardized corpora includes the Stanford Sentiment Treebank^[21] which was derived from a sentence-level corpus of movie reviews^[22]. The LIWC tools mentioned above are available for a licensing fee.

6.3 Semantic Similarity

Semantic similarity, also called semantic textual similarity, is the notion that two expressions mean approximately the same thing (e.g., they are paraphrases of each other). Similarity is thus a symmetric relationship – when comparing two units neither would be more general or specific than the other. Similarity can arise at the word level, through the synonyms, or at the sentence level, where one might reorder the parts of a conjunction or substitute an active construction for a passive one. Being able to detect when two expressions mean “nearly” the same thing is useful for assessing whether a student has answered a test question correctly, or when trying to determine the intent of a question or command without requiring the designer to list every possible expression explicitly. Systems that require one to say things in exactly one way make it difficult for users to learn or remember the required phrasing.

Systems that map expressions onto sentences in a formal logic to test subsumption (which they refer to as classification inference) are performing a type of semantic similarity analysis. A rule-based approach would create a deep-semantics (such as an expression in first order logic or description logic) and then test whether a pair of concepts (A, B) are equivalent, that is, both A ⊆ B and B ⊆ A hold. Today, semantic similarity is a task that can be learned from data that includes pairs of expressions that have been previously deemed to be similar. Figure 6.4 includes some examples of similarity judgements given by staff at Microsoft.

Figure 6.4 Examples of similarity judgements from the Microsoft Research Paraphrase Corpus (2005)
Sentence 1	Sentence 2	Similarity class
Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence.	Referring to him as only “the witness”, Amrozi accused his brother of deliberately distorting his evidence.	Yes
Yucaipa owned Dominick’s before selling the chain to Safeway in 1998 for $2.5 billion.	Yucaipa bought Dominick’s in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.	No

Three resources for semantic similarity that have been proposed as a standard for evaluating work on similarity are: The Microsoft Research Paraphrase Corpus^[23], the Quora Question Pairs dataset^[24]^, ^[25], and the Semantic Similarity Benchmark corpus from SemEval 2017^[26]. The first two of these corpora include text collected from online news and questions posted to a community question answering website where equivalence judgements were provided by staff at the respective organizations (e.g., Microsoft and Quora). The SemEval data includes video and image captions created via crowdsourcing and judged by organizers and participants of the conference. One newer dataset that may prove useful is the multi-domain question rewriting dataset, which was created from Stack Exchange question edit histories by researchers at Google, University of Chicago, and Toyota Technological Institute^[27].

6.4 Textual Entailment

Textual entailment, also known as natural language inference, is an approximation of real inference that has been formalized to allow for using classification as a solution. There are three categories: a given text either entails another text, contradicts another text, or neither, which is classified as “neutral”. (There are examples in Figure 6.5.) By convention, the entailing text is known as “the text” or “the premise” and the other text is known as “the hypothesis”. A search-based system might create a semantics using first order logic and apply a theorem prover. A more efficient approach would use a graph-based representation, such as a subsumption hierarchy. Recent work by Young et al. (2014) explores combining a statistical measure (conditional probability or mutual information) and a graph structure, to form a hybrid structure that they call a Denotation Graph^[28].

Classification-based approaches for Textual Entailment have also been devised, using a training set where items have both generated and then labelled by hand for different entailment relations, typically by crowdsourcing, using the Amazon Mechanical Turk platform. The conventionally used entailment categories for this approach are defined as follows:

Entailment: the hypothesis states something that is definitely correct about the situation or event in the premise.

Neutral: the hypothesis states something that might be correct about the situation or event in the premise.

Contradiction: the hypothesis states something that is definitely incorrect about the situation or event in the premise.

The collections of sentence pairs that have been created are designed to exemplify several known sources of entailment, from low-level word meanings and sentence structure to high-level reasoning and application of world knowledge. They capture four broad categories of phenomena: lexical semantics, predicate-argument structure, logic, and world and common sense knowledge, where each category may have several subtypes. We will now consider the general idea for each.

Lexical semantics or word meaning includes entailments based on concept abstraction or exclusivity, such as that a cat is a type of mammal and a cat is not a dog. It also includes morphological negation (e.g. agree vs. disagree, happy vs. unhappy, etc). It also includes entailments based on verbs, that in normal usage warrant certain inferences, such as saying “I stopped doing X” entails one was doing X previously and saying “I recognize that X” entails X. Also, saying “The cat is fat.” entails that the cat exists. (Linguists sometimes refer to these various types of expressions as presuppositions or factives.)

Predicate-argument structure includes a verb and its subject and object, which may tell you who did something and what was acted upon (their semantic role). The order of the roles depends on the syntax. For example both “The cat broke the vase.” and “The vase was broken by the cat.” entail that the vase broke.

Logic includes entailments that may arise because of connectives (conjunction, disjunction), negation, double negation, conditionals, and quantifiers. Logic also includes entailments based on specific types of entities, such as numbers and intervals of time, which have an associated magnitude and sequential order, and operators defined on them, such as less than or greater than, and before or after. Logic-based entailments (mostly) follow the semantics of mathematical logic. For example, “the cat sat on the mat and slept” entails “the cat slept” and “the cat slept” entails “the cat slept or the cat ate”, but “the cat slept or ate” is neutral about “the cat slept”. Conditionals include sentences like “if it is raining, the grass will be wet”, which would not entail that “the grass is wet” or “it is raining” or even “it is raining and the grass is wet”; however, a complex conditional such as “It is raining and if it is raining, then the grass will be wet” would entail that “the grass is wet.” Quantifiers of logic include the universal (all) and the existential (some) which create entailments based on their semantics, e.g., “All cats have fur.” entails “my cat has fur” and “my cat likes fish” entails that “some cats like fish”. Natural language includes additional quantifiers such as “most” or “most X in the Y”. Determining the entailments of these quantifiers requires judgements that combine an understanding of their general meaning and world or common sense knowledge.

World and common sense knowledge includes entailment relations based on knowledge of history, geography, law, politics, science, technology, culture, other aspects of human experience. An example might be that “Milwaukee has some beautiful parks.” entails that “Wisconsin has some beautiful parks.” Common sense includes entailment relations that are not exactly factual, but do not depend on either cultural or educational background. For example, “A girl was making snow angels” entails “a girl is playing in the snow” and “the grass is wet” entails “it might have rained”. Some examples from the Stanford Natural Language Inference dataset, showing the Text-Hypothesis pairs and the crowdsourced inference type, are shown in Figure 6.5

Figure 6.5 Examples of textual entailment types
Text	Hypothesis	Class label
A man inspects the uniform of a figure in some East Asian country.	The man is sleeping	contradiction
An older and younger man smiling.	Two men are smiling and laughing at the cats playing on the floor.	neutral
A black race car starts up in front of a crowd of people.	A man is driving down a lonely road.	contradiction
A soccer game with multiple males playing.	Some men are playing a sport.	entailment
A smiling costumed woman is holding an umbrella.	A happy woman in a fairy costume holds an umbrella.	neutral

Textual entailment recognition is a classification task that can support several different software applications including text summarization, question answering, information retrieval, and machine translation. For example, summaries can be shortened by removing any sentences that are already entailed by other sentences in the summary. In question answering, any acceptable candidate answer found in a source document must entail the ideal answer, which we are presumed to already know. In information retrieval, search criteria might be given in terms of sentences that the desired documents must entail. Textual entailment being independent of any application has also been used as a benchmark task for natural language, one that could be used to evaluate and compare the effectiveness of natural language models across a variety of applications. Currently there are a number of software tools and datasets available for creating and evaluating systems for textual entailments. Functions for performing textual entailment are included in (or available for) AllenNLP and spaCy.

Some resources that have been proposed as a standard for evaluation include: the Multi-Genre Natural Language Inference Corpus, which includes examples drawn from transcribed speech, popular fiction, and government reports which were labelled via crowdsourcing^[29]; the Stanford Natural Language Inference corpus^[30]; the Stanford Question Answering Dataset^[31]; the Recognizing Textual Entailment datasets, which come from a series of annual challenges, first held in 2004^[32]; and the data from the Winograd Schema Challenge^[33].

6.5 Summary

Four tasks have emerged as benchmarks tasks for language modelling. These tasks are grammaticality analysis, sentiment analysis, semantic textual similarity, and recognizing textual entailment. Language models can address these tasks by training on data that represents the set of class types for that task. Much of this data was created “naturally”, for example extracted from collections of student writing graded by instructors, from online reviews of products, or from captions posted to online image databases (e.g., Flickr). When examples of entailments did not otherwise exist, the data has been crowdsourced. These benchmark tasks, while not always useful on their own, are useful for many downstream applications. For example, a training system might evaluate the similarity or entailment relation of the response to an expected answer. A question-answering system might use textual similarity to group together related questions. A combination of grammaticality and textual similarity would be helpful for assessing or improving writing, for example to identify grammatical mistakes, excessive repetition, or potential plagiarism. Single or cross-language similarity classification might be used as an objective function for training systems that summarize texts or that translate from one language to another.

In addition to supporting better applications, benchmark tasks serve as a guiding force for advancing the basic science of NLP. Language models and the architectures for constructing them are becoming more powerful and more complex every day. To compare alternatives or to measure progress over time, common datasets and tasks like the ones discussed in this chapter are often used. Because these tasks do not depend on a particular domain, such as medicine or manufacturing, they are also used to test generality. Already, some have become concerned that these tasks are not sufficiently general or appropriate, however^[34]. To increase generality, additional benchmarks have been proposed, e.g., SuperGLUE^[35]. Concerns about how to address the ethics of creating language models have yet to be resolved and so each NLP projectmust be careful to avoid or reduce potentially harmful consequences.

Barkley, C., Kluender, R. and Kutas, M. (2015). Referential Processing in the Human Bain: An Event-Related Potential (ERP) Study. Brain Research, 1629, pp.143-159. (Online at: http://kutaslab.ucsd.edu/people/kutas/pdfs/2015.BR.143.pdf) ↵
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S.R. (2018). Glue: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ArXiv preprint arXiv:1804.07461. ↵
Chomsky, N. (1957). Syntactic Structures (The Hague: Mouton, 1957). Review of Verbal Behavior by BF Skinner, Language, 35, 26-58. ↵
Erard, M. (2010). The Life and Times of "Colorless Green Ideas Sleep Furiously". Southwest Review, 95(3), 418-425. Accessed July2021, from http://www.jstor.org/stable/43473072 ↵
Wittke K, Mastergeorge AM, Ozonoff S, Rogers SJ, Naigles LR. (2017). Grammatical Language Impairment in Autism Spectrum Disorder: Exploring Language Phenotypes Beyond Standardized Testing. Frontiers in Psychology. 8:532. DOI: 10.3389/fpsyg.2017.00532. ↵
Dale, R. (2011). Automated Writing Assistance: Grammar Checking and Beyond. (manuscript) Online at: http://web.science.mq.edu.au/~rdale/teaching/icon2011/iconbib.pdf ) ↵
Heidorn, G.E., Jensen, K., Miller, L.A., Byrd, R.J., and Chodorow, M. (1982). The EPISTLE Text-Critiquing System. IBM Systems Journal, 21, pp. 305-326. ↵
Sleeman. D. (1982). Inferring (mal) Rules from Pupil’s Protocols. In Proceedings of the 1982 European Conference on Artificial Intelligence (ECAI-82), Orsay, France, pp. 160–164. ↵
Schneider, D., & McCoy, K. F. (1998). Recognizing Syntactic Errors in the Writing of Second Language Learners. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics. Also online at: arXiv.org/abs/cmp-lg/9805012. ↵
Taylor, A., Marcus, M., and Santorini, B. (2003). The Penn Treebank: An Overview. Treebanks: Building and Using Parsed Corpora, Abeillé, A. (Ed.), pp. 5-22. ↵
Warstadt, A., Singh, A. and Bowman, S.R. (2019). Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics, 7, pp. 625-641. ↵
Dahlmeier, D., Ng, H.T. and Wu, S.M. (2013). Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 22-31. ↵
Ng, H.T., Wu, S.M., Briscoe, T., Hadiwinoto, C., Susanto, R.H. and Bryant, C. (2014). The CoNLL-2014 Shared Task on Grammatical Error Correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1-14. ↵
Hatzivassiloglou, V., and McKeown, K. (1997). Predicting the Semantic Orientation of Adjectives. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL-EACL 1997), pp. 174–181. ↵
Wiebe, J., Bruce, R., and O’Hara, T. (1999). Development and Use of a Gold Standard Data Set for Subjectivity Classifications. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), 246–253. University of Maryland: ACL. ↵
Wiebe, J. (2000). Learning Subjective Adjectives from Corpora. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press, 735–740. ↵
Wiebe, J., Wilson, T., Bruce, R., Bell, M. and Martin, M. (2004). Learning Subjective Language. Computational linguistics, 30(3), pp.277-308. ↵
Gottschalk, L. A., and Gleser, G. C. (1969). The Measurement of Psychological States through the Content Analysis of Verbal Behavior. CA: University of California Press. ↵
Weintraub, W. (1989). Verbal Behavior in Everyday Life. NY: Springer. ↵
Pennebaker, J.W., Boyd, R.L., Jordan, K., and Blackburn, K. (2015). The Development and Psychometric Properties of LIWC2015. Austin, TX: University of Texas at Austin. ↵
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C. (2013). Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing pp. 1631-1642. ↵
Pang, B., Lee, L. and Vaithyanathan, S. (2002). Thumbs Up?: Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. Volume 10 (pp. 79-86). Association for Computational Linguistics. ↵
Dolan, W.B. and Brockett, C. (2005). Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005). ↵
Csernai, K. (2017). Data @ Quora: First Quora Dataset Release: Question Pairs. URL: https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs (Accessed March 2021). ↵
An archived copy of the Quora Question Pairs dataset, as a 55.5MB tsv file, can be found here: http://web.archive.org/web/20181122142727/http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv ↵
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I. and Specia, L. (2017). Semeval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-Lingual Focused Evaluation. arXiv preprint arXiv:1708.00055. ↵
Chu, Z., Chen, M., Chen, J., Wang, M., Gimpel, K., Faruqui, M., & Si, X. (2020). How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7586-7593. https://doi.org/10.1609/aaai.v34i05.6258 ↵
Young, P., Lai, A., Hodosh, M. and Hockenmaier, J. (2014). From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics, 2, pp.67-78. ↵
Williams, A., Nangia, N. and Bowman, S.R. (2017). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. ArXiv preprint arXiv:1704.05426. ↵
Bowman, S., Angeli, G., Potts, C., ,and Manning, C. (2015). A Large Annotated Corpus for Learning Natural Language Inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Association for Computational Linguistics. ↵
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Association for Computational Linguistics. ↵
Dagan, I., Glickman, O., and Magnini, B. (2006). The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment, pp. 177–190. Springer. ↵
Davis, E., Morgenstern, L., and Ortiz Jr., C. (2017). The First Winograd Schema Challenge at IJCAI-16. AI Magazine, 38(3), 97-98. (Link to data: https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html) ↵
Floridi, L., Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences.(2020) Minds & Machines 30, 681–694. https://doi.org/10.1007/s11023-020-09548-1 ↵
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S.R., (2019). SuperGLUE A stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 3266-3280) Also available as: arXiv preprint arXiv:1905.00537. ↵

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License