Human evaluation has always been considered the most reliable form of evaluation in Natural Language Processing (NLP), but recent research has thrown up a number of concerning issues, including in the design (Belz et al., 2020; Howcroft et al., 2020) and execution (Thomson et al., 2024) of human evaluation experiments. Standardisation and comparability across different experiments is low, as is reproducibility in the sense that repeat runs of the same evaluation often do not support the same main conclusions, quite apart from not producing similar scores.
The current situation is likely to be in part due to how human evaluation is viewed in NLP: not as something that needs to be studied and learnt before venturing into conducting an evaluation experiment, but something that anyone can throw together without prior knowledge by pulling in a couple of students from the lab next door.
Our aim with this tutorial is primarily to inform participants about the range of options available and choices that need to be made when creating human evaluation experiments, and what the implications of different decisions are. Moreover, we will present best practice principles and practical tools that help researchers design scientifically rigorous, informative and reliable experiments.
As can be seen from the schedule below, the tutorial is structured into a series of presentations and brief exercises, followed by a practical session at the end where participants will be supported in creating some of the steps in developing an evaluation experiment, and analysing results from them, using tools and other resources provided by the tutorial team.
We aim to address all aspects of human evaluation of system outputs in a research setting, equipping participants with the knowledge, tools, resources and hands-on experience needed to design and execute rigorous and reliable human evaluation experiments. Take-home materials and online resources will continue to support participants in conducting experiments after the tutorial.
Time | Duration | Unit # | Topic |
---|---|---|---|
09:30—10:00 | 30 mins | Unit 1 | Introduction |
10:00—10:30 | 30 mins | Unit 2 | Development and Components of Human Evaluations |
10:30—10:45 | 15 mins | Break | |
10:45—11:45 | 60 mins | Unit 3 | Quality Criteria and Evaluation Modes |
11:45—12:30 | 45 mins | Unit 4 | Experiment Design |
12:30—14:00 | 90 mins | Lunch | |
14:00—15:15 | 75 mins | Unit 5 | Statistical Analysis of Results |
15:15—15:30 | 15 mins | Break | |
15:30—16:15 | 45 mins | Unit 6 | Experiment Implementation |
16:15—16:40 | 25 mins | Unit 7 | Experiment Execution |
16:40—16:55 | 15 mins | Break | |
16:55—18:30 | 95 mins | Unit 8 | Practical Session |