What will dullhead after has eaten and drank the cake and beer

Improving Reading Comprehension Question Generation with Data Augmentation and Overgenerate-and-rank

Nischal Ashok Kumar1, Nigel Fernandez1, Zichao Wang2, Andrew Lan1

questions or question-answer pairs to meet the de-mand for a large pool of relevant questions (Kurdi et al., 2020; Yao et al., 2022). These advances can potentially facilitate the development of artifi-cial intelligence (AI)-supported learning platforms to help students develop reading comprehension skills (Zhang et al., 2022).

Prior work on question generation in educational applications can be broadly classified into two cat-egories: answer-aware, which is the focus of our current work, and answer-unaware (see Dugan et al. (2022) for a feasibility study), depending on whether the desired answer is given or not. For answer-aware question generation, the goal is to build an AI-based system to generate a question given both the context and the answer (Wang et al., 2018). The context can be any text segment, from a few sentences to a possibly long document, that pro-vides background information on which the ques-tion is grounded in. The answer is a short span of text that is either part of the context (explicit) or not part of the context but can be inferred from the context (implicit). More specifically, in answer-aware question generation, the question generation

		system is trained using the context-answer pairs
		as input and the question as the output (Yao et al.,

A key challenge in answer-aware question gener-ation is that there are often multiple relevant ques-tions for a given context-answer pair. Existing question generation systems are limited in identify-ing which questions human educators would prefer from multiple relevant ones. Table 1 shows an ex-ample context-answer pair from the FairytaleQA dataset (Xu et al., 2022b) with four relevant ques-tions that can be answered by “a lovely dinner”, the given answer. The first and second questions fo-cus on describing the setting of the context framed using the object (table) and the subject (Tom and Hunca), respectively. The third question adds a

causal element inquiring about the cause of Tom and Hunca’s emotion. The fourth question is pre-dictive in nature, asking about an event which can be inferred from the context.


	a lovely dinner
Questions	1. What was laid upon the table?

• We conduct extensive experiments to validate the effectiveness of our methods. Our best method achieves a 5% absolute increase in the ROUGE-L score over the best existing

1.1	Contributions	baseline (Xu et al., 2022b). We also observe
1.1	Contributions	that 1) the data augmentation method can be

(Chung et al., 2022) fine-tuning backbone, our con-	2
	2

• We propose a data augmentation method to augment the training set with syntheti-cally generated diverse and relevant questions. Specifically, we prompt a larger language model, OpenAI Codex (Chen et al., 2021), to first generate a diverse question pool and then filter out questions that are inconsistent with the given context-answer pair using a question-answering model.

• We propose an overgenerate-and-rank method to rank multiple generated question candidates for the given context-answer pair. Specifically, we fine-tune a separate BERT- based model by optimizing a distribution matching objective to learn which questions are more preferable to human educators and use the model to rank them.

at:	1The	code	for	the	paper	can	be	found
at:	1The

There are several works on question generation for reading comprehension. Stasaski et al. (2021) and Zou et al. (2022) propose question generation methods based on causal relations and unsuper-vised learning, respectively. However, their meth-ods are focused on very specific questions and are thus not generalizable. In contrast, our work fo-cuses on a broad variety of questions covering dif-ferent narrative elements in reading comprehension. Rathod et al. (2022) proposes to generate multiple semantically similar but lexically diverse questions for a given answer. However, their work is limited to generating only two questions per answer. In contrast, our approach is capable of generating mul-tiple diverse and relevant questions, along with a ranking method to select the best question aligned with human educator preferences. Recent work on the FairytaleQA dataset develops event-based question generation methods (Zhao et al., 2022; Xu et al., 2022a). However, their results are re-ported on only a small subset of attributes: action, causal relationship, and outcome resolution. In contrast, we report our results over all attributes on the complete FairytaleQA dataset and compare with the current state-of-the-art baseline. Yuan et al. (2022) propose a prompt-based question gen-eration method that leverages large language mod-els (LM) like GPT-3. However, these black-box LMs have limited API only access. In contrast, our method uses open-source language models to achieve competitive results. The FairtytaleQA dataset paper (Xu et al., 2022b) proposes the cur-rent state-of-the-art question generation method by fine-tuning the BART (Lewis et al., 2020) LM to generate the ground truth question given the input context-answer pair. Improving upon LM fine-tuning, we propose two question generation methods for increased robustness, data augmenta-tion and overgenerate-and-rank, which are able to both generate diverse and valid question candidates and also accurately rank and select the top question aligned with human educator preference.

questions, followed by our over-generate-and-rank method to select the top question from the diverse question candidates generated.

We first describe our LM fine-tuning approach for question generation. We use a pre-trained Flan-T5 (Chung et al., 2022) model as our base LM for question generation. We also tried using vanilla T5 (Raffel et al., 2020) and GPT-2 (Radford et al.,

3		2019) as our base LM which gave a comparable
3		but lower performance, possibly because Flan-T5

	Generate question given	as the target context-answer-question triplet to aug-

	Context:	ci
	Context:	ci

where qi,t is the tthtoken of question qi and qi,<t refers to all tokens preceding the tthtoken. Our finetuning objective is the sum of this loss across all training questions.

3.3 Data Augmentation

To obtain the answer of a generated question, we again use Codex in an in-context prompting fashion with a subtle change in the prompt. We use the same five in-context examples of context-answer-question triplets taken from the same at-tribute as the target context-answer-question be-ing augmented. However, we change the earlier context-answer-question pattern suitable for ques-tion generation and reformulate in the order of context-question-answer appropriate for question answering. We denote the answer to the generated question ˆqi,j as ˆai,j. We use greedy decoding since we need the single best answer. We observe that comparing the similarity of this obtained answer generated by Codex to the ground truth answer ai written by human education experts can sometimes exclude consistent synthetic questions incorrectly. We alleviate this issue by obtaining another refer-ence answer to compare with; we prompt Codex in an in-context fashion to obtain the answer to the ground truth question qi, which we denote as ¯ai. Note that ¯ai could be different from the ground truth answer ai as shown in an example in Table 6 in the Supplementary Material.

To check consistency, we measure the similarity between ˆai,j and both ai and ¯ai using the ROUGE-1 F1 score (Lin, 2004). If either similarity is greater than a threshold of 0.5, we include the context-answer-synthetic question triplet (ci, ai, ˆqi,j) in our augmented training set. We outline our method in Figure 1 and also in Algorithm 1 in the Supplemen-

	QA Model	Consistency Matching

minimize	Model Predicted	Human Preferred	generating the question under that language model.
	Model Predicted	Human Preferred
	Scores	Scores
	Scores	Scores

Ranking Model	ROUGE-L (	)
Ranking Model	ROUGE-L (	)	the lowest perplexity as the best question for the

given context-answer pair.

matching-based ranking.

model to rank the overgenerated question candi-dates by predicting scores over these generated questions with a similar distribution to the ROUGE-L scores between the generated questions and the ground truth question. This distribution matching objective encourages the ranking language model to associate higher scores with questions similar to the ground truth question written by human ed-ucation experts. We select the question with the

tary Material.

Our method inspired from (Shi et al., 2023) trains a ranking language model to minimize the KL di-vergence (Joyce, 2011) between the distribution of the model-predicted scores over the generated questions and the distribution of ROUGE-L scores computing similarity of the generated questions to the human educator-written ground truth ques-tion. We outline the training process of the ranking model in Figure 2.

More specifically, we use a pre-trained Con-vBERT (Jiang et al., 2020) model as our rank-ing language model. We use a combination of the given context-answer pair and the generated question to rank as input to the model. We feed the [CLS] embedding vector to a learnable lin-ear layer during fine-tuning. For the ithtraining question, Pϕ(ˆqi) ∈ [0, 1]Kdenotes the probability distribution of the model-predicted scores for gener-ated questions and R(ˆqi, qi) ∈ [0, 1]Kdenotes the probability distribution of the ROUGE-L scores

Experimental Evaluation [R(ˆqi, qi)]j =

�jexp αR.r(ˆqi,j, qi).

(4)

We use a pre-trained Flan-T5-Large model (Chung et al., 2022) with 770M parameters as our base LM for question generation; all implementation was done using the HuggingFace (Wolf et al., 2020) transformers library. We fine-tune the base LM for 10 epochs with early stopping on the validation loss

using the AdamW (Loshchilov and Hutter, 2017) optimizer with a learning rate of 3e-4 and a batch size of 8. Each epoch takes 20 minutes on a single NVIDIA A100 GPU.

Method		Questions
	All	Explicit	Implicit
BART	0.5270	-	-
(Xu et al., 2022b)	0.5270	-	-
	0.5639	0.5998	0.4571
	0.5664	0.5994	0.4682
	0.5664	0.5994	0.4682
	0.5689	0.6057	0.4591
Ranking	0.5689	0.6057	0.4591

Data Augmentation Variants. We report ROUGE-L scores for several variants of our data augmentation method in Table 4 in the Supple- FairytaleQA is imbalanced mentary Material.

		0.6107	0.4798
Matching-based				balance the training set. Moreover, fine-tuning

Table 2:

the current state-of-the-art BART baseline. This significant improvement shows that our data aug-mentation and overgenerate-and-rank methods are effective at making question-generation systems more robust, which results in better questions be-ing generated. We also experiment with combining our data augmentation and overgenerate-and-rank methods. However, perhaps surprisingly, this com-bination does not lead to significant improvement in performance. We think that this result is possi-bly due to synthetic questions being too diverse in many cases with respect to the ground truth question. Therefore, controlling the diversity of synthetic questions for better alignment with those written by human educators is an important direc-tion for future work.

Performance Stratified by Question Category. To gain more insight into the performance of our question generation methods, we also report the av-erage ROUGE-L over test questions in the explicit and implicit categories. For the harder implicit questions with answers not explicitly included in the context as text spans, our data augmentation and distribution matching-based ranking methods improve performance by 1.2% and 2.3% over fine-tuning Flan-T5, respectively. This significant per-formance improvement shows that our data aug-mentation and overgenerate-and-rank methods are well-suited for harder question generation tasks, especially when given an answer that needs to be inferred from the context, for which the ground-truth questions are already highly diverse.

Context	...and when they had finished the little grey old man said to the dullhead: “Now I will bring you luck, because you have a kind heart and are willing to share what you have with others.

	What did the man tell dullhead to do?
Flan-T5

Perplexity-based Ranking	4. What will dullhead do to find something? 5. What will dullhead do when he meets the grey old man?
	4. What did the little man tell dullhead to do because he wanted to find something? 5. What will dullhead need to do?

Table 3: Qualitative analysis with an example input context-answer-question from the FairytaleQA dataset and question generated by our methods. Both data augmentation and overgenerate-and-rank improve diversity among the generated questions, which makes question generation more robust.

The first two error types are beyond our control but the third type suggests that our methods have plenty of room for improvement. Errors of type character coreference resolution can occur when an input context has multiple characters and coref-erences. In the first example, “self” is used as a complex coreference and confuses the question generation method. Errors of type out-of-context ground-truth questions can occur for ground-truth questions using information present outside the context the model sees as input. These ground-truth questions are human errors often referring to named entities present in other sections of the same story but not included in the input context. In the second example, the ground truth question refers to the character “Ian” who is not present in the con-text; the generated question uses the reference of“fisher’s son” that is has access to in the given con-text. Errors of type multiple evidence angles can occur when the input context discusses different as-pects of an answer. In the third example, the event of “Norseman invasion” in the answer could have questions related to either its cause, “people being wicked”, or its timeline, “happening after the two Countesses fled to Scotland”. As a result, among the top decoder output questions, there are none that discusses the latter, which is contained in the ground-truth question. Therefore, it is important to develop methods that can take all possible question

angles into account during decoding.

tion candidates and ranking them to align with hu-man educator preferences. First, we proposed a

data augmentation method that augments the train-ing dataset with diverse questions obtained from a larger language model. Second, we proposed an overgenerate-and-rank method with two choices of ranking criterion, perplexity-based ranking and distribution matching-based ranking. The latter learns to rank the generated candidate questions to select ones that are closer to human-written ques-tions. We conducted extensive experiments on the FairytaleQA dataset to validate the effectiveness of our methods showing that our best method pro-vides an absolute improvement of 5% in ROUGE-L over the current state-of-the-art on this dataset. We also showed that our methods are significantly bet-ter than baselines in generating harder questions whose answers are not directly present in the con-text as text spans and have to be inferred.

There are several directions for future work. First, we can experiment with other data augmen-tation methods, e.g., by fine-tuning the base lan-guage model by weighting synthetically-generated questions according to their ROUGE-L scores with respect to the ground truth question. Second, we can explore the use of chain-of-thought (Wei et al., 2022) or self-ask (Press et al., 2022) to prompt the large language model in our data augmenta-tion method. Third, we can experiment with other ranking objectives, such as ones using the Bradley-Terry model (Bradley and Terry, 1952) or ones using reinforcement learning with human feedback framework (Ziegler et al., 2019), to select the best questions that are aligned with human preference. Fourth, we can apply our methods to other question generation scenarios that require reasoning, such as logistical questions in online course discussion forums (Zylich et al., 2020), to help instructors anticipate common student questions.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Hyung Won Chung, Le Hou, Shayne Longpre, Bar-ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

Zi-Hang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. 2020. Con-vbert: Improving bert with span-based dynamic con-volution. Advances in Neural Information Process-ing Systems, 33:12837–12848.

3https://www.thequestchallenge.org

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. ssociation for Computational Linguistics, pages 7871–7880, Online. Association for Computa-tional Linguistics.

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans-former. The Journal of Machine Learning Research, 21(1):5485–5551.

Manav Rathod, Tony Tu, and Katherine Stasaski. 2022. Educational multi-question generation for reading comprehension. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 216–223.

Zichao Wang, Andrew Lan, and Richard Baraniuk. 2021. al Meth-ods in Natural Language Processing, pages 5986–5999, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Zichao Wang, Andrew S Lan, Weili Nie, Andrew E Waters, Phillip J Grimaldi, and Richard G Baraniuk. 2018. Qg-net: a data-driven question generation model for educational content. In Proceedings of the fifth annual ACM conference on learning at scale, pages 1–10.

Ying Xu, Dakuo Wang, Mo Yu, Daniel Ritchie, Bing-sheng Yao, Tongshuang Wu, Zheng Zhang, Toby Li, Nora Bradford, Branda Sun, Tran Hoang, Yisi Sang, Yufang Hou, Xiaojuan Ma, Diyi Yang, Nanyun Peng, Zhou Yu, and Mark Warschauer. 2022b. n As-sociation for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland. Asso-ciation for Computational Linguistics.

Bingsheng Yao, Dakuo Wang, Tongshuang Wu, Zheng Zhang, Toby Li, Mo Yu, and Ying Xu. 2022. Association for Computational Linguistics (Volume 1: Long Papers), pages 731–744, Dublin, Ireland. Association for Computational Linguistics.

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. 2019. Fine-tuning lan- guage models from human preferences. arXiv preprint arXiv:1909.08593.

Bowei Zou, Pengfei Li, Liangming Pan, and Aiti Aw. 2022. Automatic true/false question generation for

¯ai ← GenAnsCodex((ci, qi)); for j ← 1 to M do ˆai,j ← GenAnsCodex((ci, ˆqi,j)); if ROUGE(ˆai,j, ai) > 0.5 or ROUGE(ˆai,j, ¯ai) > 0.5 then

Decoding		Perplexity-		Distribution			Distribution
Type		based		Matching-			Matching-
Ranking				based
Rank-
ing					with		with
Nucleus
Sampling							Search	(4,
(0.95,						1,	0.6, 10)
10)
Greedy (No

Nucleus		0.5664		0.5778			0.5657
Sampling		0.5664		0.5778			0.5657

Nucleus		0.5618
Sampling		0.5618
(0.95,
10)
Nucleus		0.5671		0.5766

(0.95,

Contrastive			0.5689
Search	(4,		0.5689

Table 5:		Experimental results on the FairytaleQA

Data Augmentation Method Variant		ROUGE-L
No Augmentation		0.5639


Minority Questions + λ Weighting		0.5664
Table 4:	Experimental results on the FairytaleQA
dataset in ROUGE-L (higher is better) comparing dif-ferent variants of our data augmentation method.

	Ground Truth	Ground Truth					Generated
	Answer					Answer
							of	Ground
						Question	Truth Ques-
						Question	tion
...and with that the rat laid a linen	excited	How did the		How did the
thread in the youth’s hand. “Heaven		youth	feel	youth	feel
be praised!”, said the youth when he		when the rat		when he had
was up above once more. “I’ll not		allowed him to			linen
go down there again in a hurry.” But				thread in his
he held the thread in his hand and				hand?
danced and sang as usual ...				hand?

					Generated		Error Type
			Truth Ques-		Question
					Question
"What is your name?" asked the girl from underground. "Self				did			Character
is my name," said the woman. That seemed a curious name to			the	girl’s	screamed		coreference
the girl, and she once more began to pull the fire apart. Then the			father think			ran
woman grew angry and began to scold, and built it all up again.			burned the
Thus they went on for a good while; but at last, while they were
in the midst of their pulling apart and building up of the fire, the
woman upset the tar-barrel on the girl from underground. Then
the latter screamed and ran away, crying: "Father, father! Self
burned me!" "Nonsense, if self did it, then self must suffer for
it!" came the answer from below the hill.
So the gallows was built upon a high platform, and the fisher’s	The king’s		How
son mounted the steps up to it, and turned at the top to make the			did	the			context
speech that was expected from every doomed man, innocent or	saw	the	princess		when	the	ground-truth
guilt. As he spoke he happened to raise his arm, and the king’s					fisher’s son
daughter, who was there at her father’s side, saw the name which	which she				raised	his
she had written under it. With a shriek she sprang from her seat,	had written				arm?
and the eyes of the spectators were turned towards her. ’Stop!	under it.
stop!’ she cried, hardly knowing what she said. ’If that man is
hanged there is not a soul in the kingdom but shall die also.’ And
running up to where the fisher’s son was standing, she took him
by the hand, saying, ’Father, this is no robber or murderer, but
the victor in the three races, and he loosed the spells that were

His vengeance was baulked, however, for in the panic and confu-	Norsemen		What hap-				Multiple evi-
sion that followed Harold’s death, the two Countesses slipped			pened after				dence angles
out of the Palace and fled to the coast, and took boat in haste	the	land,	the	two	because		in context
to Scotland, where they had great possessions, and where they	and	their	Countesses
were much looked up to, and where no one would believe a word	Castle was		fled	to		is
against them. But retribution fell on them in the end, as it always	set on fire,				wicked, or
does fall, sooner or later, on everyone who is wicked, or selfish,	and	they				or
or cruel; for the Norsemen invaded the land, and their Castle	perished				cruel?
was set on fire, and they perished miserably in the flames. When
Earl Paul found that they had escaped, he set out in hot haste for	in	the
the Island of Hoy, for he was determined that the Dwarf, at least,
should not escape. But when he came to the Dwarfie Stone he
found it silent and deserted, all trace of its uncanny occupants