The rise of large language models (LLMs) has had far reaching effects across multiple fields, requiring evaluation strategies to assess their impact. In contrast to the framework of quantitative benchmark-based evaluations typically used at AI conferences, evaluating LLMs for human computer interaction requires more nuanced consideration as LLM "performance" in this arena is inherently human-centered and often bespoke to the experiential context. This paper provides a set of insights distilled from a survey of 23 papers recently published at CHI and suggests a lens through which to view HCI LLM evaluation strategies. We discuss the challenges of evaluating LLMs in HCI and provide suggestions to help increase interdisciplinary rigor.
Learn More