INLG 2024 Tutorial: Human Evaluation of NLP System Quality

Tutorial Description

Human evaluation has always been considered the most reliable form of evaluation in Natural Language Processing (NLP), but recent research has thrown up a number of concerning issues, including in the design (Belz et al., 2020; Howcroft et al., 2020) and execution (Thomson et al., 2024) of human evaluation experiments. Standardisation and comparability across different experiments is low, as is reproducibility in the sense that repeat runs of the same evaluation often do not support the same main conclusions, quite apart from not producing similar scores.

The current situation is likely to be in part due to how human evaluation is viewed in NLP: not as something that needs to be studied and learnt before venturing into conducting an evaluation experiment, but something that anyone can throw together without prior knowledge by pulling in a couple of students from the lab next door.

Our aim with this tutorial is primarily to inform participants about the range of options available and choices that need to be made when creating human evaluation experiments, and what the implications of different decisions are. Moreover, we will present best practice principles and practical tools that help researchers design scientifically rigorous, informative and reliable experiments.

As can be seen from the schedule below, the tutorial is structured into a series of presentations and brief exercises, followed by a practical session at the end where participants will be supported in creating some of the steps in developing an evaluation experiment, and analysing results from them, using tools and other resources provided by the tutorial team.

We aim to address all aspects of human evaluation of system outputs in a research setting, equipping participants with the knowledge, tools, resources and hands-on experience needed to design and execute rigorous and reliable human evaluation experiments. Take-home materials and online resources will continue to support participants in conducting experiments after the tutorial.

Schedule

Time	Duration	Unit #	Topic
09:30—10:00	30 mins	Unit 1	Introduction
10:00—10:30	30 mins	Unit 2	Development and Components of Human Evaluations
10:30—10:45	15 mins		Break
10:45—11:45	60 mins	Unit 3	Quality Criteria and Evaluation Modes
11:45—12:30	45 mins	Unit 4	Experiment Design
12:30—14:00	90 mins		Lunch
14:00—15:15	75 mins	Unit 5	Statistical Analysis of Results
15:15—15:30	15 mins		Break
15:30—16:15	45 mins	Unit 6	Experiment Implementation
16:15—16:40	25 mins	Unit 7	Experiment Execution
16:40—16:55	15 mins		Break
16:55—18:30	95 mins	Unit 8	Practical Session

Reading List

Anya Belz, João Sedoc, Craig Thomson, Simon Mille and Rudali Huidrom. 2024. The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units. In Proceedings of the 17th International Conference on Natural Language Generation, Tokyo, Japan.