INLG 2024 Tutorial:
Human Evaluation of NLP System Quality

ADAPT Research Centre, Dublin City University, Ireland
New York University, USA


Our tutorial will be held on 24th September 2024 as a part of INLG 2024, Tokyo.

The tutorial with be available to both in-person and remote attendees. Registration to INLG is open now!

We will release all slides and colab notebooks from the tutorial.

Tutorial Description

Human evaluation has always been considered the most reliable form of evaluation in Natural Language Processing (NLP), but recent research has thrown up a number of concerning issues, including in the design (Belz et al., 2020; Howcroft et al., 2020) and execution (Thomson et al., 2024) of human evaluation experiments. Standardisation and comparability across different experiments is low, as is reproducibility in the sense that repeat runs of the same evaluation often do not support the same main conclusions, quite apart from not producing similar scores.

The current situation is likely to be in part due to how human evaluation is viewed in NLP: not as something that needs to be studied and learnt before venturing into conducting an evaluation experiment, but something that anyone can throw together without prior knowledge by pulling in a couple of students from the lab next door.

Our aim with this tutorial is primarily to inform participants about the range of options available and choices that need to be made when creating human evaluation experiments, and what the implications of different decisions are. Moreover, we will present best practice principles and practical tools that help researchers design scientifically rigorous, informative and reliable experiments.

As can be seen from the schedule below, the tutorial is structured into a series of presentations and brief exercises, followed by a practical session at the end where participants will be supported in creating some of the steps in developing an evaluation experiment, and analysing results from them, using tools and other resources provided by the tutorial team.

We aim to address all aspects of human evaluation of system outputs in a research setting, equipping participants with the knowledge, tools, resources and hands-on experience needed to design and execute rigorous and reliable human evaluation experiments. Take-home materials and online resources will continue to support participants in conducting experiments after the tutorial.

Schedule

Time Duration Unit # Topic
09:30—10:00 30 mins Unit 1 Introduction
10:00—10:30 30 mins Unit 2 Development and Components of Human Evaluations
10:30—10:45 15 mins Break
10:45—11:45 60 mins Unit 3 Quality Criteria and Evaluation Modes
11:45—12:30 45 mins Unit 4 Experiment Design
12:30—14:00 90 mins Lunch
14:00—15:15 75 mins Unit 5 Statistical Analysis of Results
15:15—15:30 15 mins Break
15:30—16:15 45 mins Unit 6 Experiment Implementation
16:15—16:40 25 mins Unit 7 Experiment Execution
16:40—16:55 15 mins Break
16:55—18:30 95 mins Unit 8 Practical Session

Reading List