The Advantages of Human Evaluation of Sociomedical Question Answering Systems

Victoria Firsanova


The paper presents a study on question answering systems evaluation. The purpose of the study is to determine if human evaluation is indeed necessary to qualitatively measure the performance of a sociomedical dialogue system. The study is based on the data from several natural language processing experiments conducted with a question answering dataset for inclusion of people with autism spectrum disorder and state-of-the-art models with the Transformer architecture. The study describes model-centric experiments on generative and extractive question answering and data-centric experiments on dataset tuning. The purpose of both model- and data-centric approaches is to reach the highest F1-Score. Although F1-Score and Exact Match are well-known automated evaluation metrics for question answering, their reliability in measuring the performance of sociomedical systems, in which outputs should be not only consistent but also psychologically safe, is questionable. Considering this idea, the author of the paper experimented with human evaluation of a dialogue system for inclusion developed in the previous phase of the work. The result of the study is the analysis of the advantages and disadvantages of automated and human approaches to evaluate conversational artificial intelligence systems, in which the psychological safety of a user is essential.

Full Text:



Allam A.M.N., Haggag M. H. The question answering systems: A survey // International Journal of Research and Reviews in Information Sciences (IJRRIS). Vol. 2 No. 3. 2012. P. 211–221.

Gao J., Galley M., Li L., Neural approaches to conversational AI: Question answering, task-oriented dialogues and social chatbots. Now Foundations and Trends, 2019.

Cheng H., Shen Y., Liu X., He P., Chen W., Gao J., UnitedQA: A hybrid approach for open domain question answering // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. 2021. P. 3080–3090.

Alqifari R. Question answering systems approaches and challenges // Proceedings of the Student Research Workshop Associated with RANLP 2019, Varna, Bulgaria. 2019. P. 69–75.

El-Hashem J. An ontology-driven sociomedical Web 3.0 framework. Ph.D. thesis, Concordia University. 2014.

Firsanova V. Autism spectrum disorder and Asperger syndrome question answering dataset 1.0 // Figshare. 2021. DOI: 10.6084/m9.figshare.13295831.v10

Green Jr. B.F., Wolf A.K., Chomsky C., Laughery K. Baseball: An automatic question-answerer // Western Joint IRE-AIEE-ACM Computer Conference. 1961. P. 219–224.

Woods W.A., Kaplan R.B.F. Lunar rocks in natural English: Explorations in natural language question answering // Linguistic Structures Processing 5. 1977. P. 521–569.

Auer S., Bizer C., Kobilarov G., Lehmann J., Cyganiak R., Ives Z. DBpedia: A nucleus for a web of open data // The semantic web. Springer. 2007. P. 722–735.

Bartolo M., Roberts A., Welbl J., Riedel S., Stenetorp P. Beat the AI: Investigating adversarial human annotation for reading comprehension // Transactions of the Association for Computational Linguistics. Vol. 8. 2020. P. 662–678.

Gao L., Dai Z., Callan J. Modularized Transfomer-based ranking framework // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 2020. P. 4180–4190.

Segal E., Efrat A., Shoham M., Globerson A., Berant J. A simple and effective model for answering multi-span questions // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. P. 3074–3080.

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is all you need // Advances in Neural Information Processing Systems. 2017. No 30. P. 5998–6008.

Chen D., Fisch A., Weston J., Bordes A. Reading Wikipedia to answer open-domain questions // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada, 2017. P. 1870–1879.

Rajpurkar P., Zhang J., Lopyrev K., Liang P. SQuAD: 100,000+ questions for machine comprehension of text // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, 2016. P. 2383–2392.

Rajpurkar P., Jia R., Liang P. Know what you don’t know: Unanswerable questions for SQuAD // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Volume 2: Short Papers. Melbourne, Australia, 2018. P. 784–789.

Efimov P., Chertok A., Boytsov L., Braslavski P. SberQuAD – russian reading comprehension dataset: Description and analysis // Experimental IR Meets Multilinguality, Multimodality, and Interaction. Springer, 2020, P. 3–15.

Li C.-H., Wu S.-L., Liu C.-L., Lee H.-y. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension // Interspeech, 2018, P. 3459–3463.

Ruder S., M. Peters E., Swayamdipta S., Wolf T. Transfer Learning in Natural Language Processing // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 2019.

Devlin J., Chang M., Lee K., Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Volume 1 (Long and Short Papers). 2019. P. 4171–4186.

Sanh V., Debut L., Chaumond J., Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter // 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS, 2019.

Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzmán F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale // ACL. 2020.

Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I. Language Models are Unsupervised Multitask Learners // OpenAI. 2019.


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162