YANG JANET LIU

📷 in Dubrovnik, Croatia 🇭🇷

May 2023 🌇 sunset lover 🎧

I am an Assistant Professor at the Department of Linguistics at University of Pittsburgh. I was a Postdoctoral Researcher at the MaiNLP research lab at the Center for Information and Language Processing (CIS) at LMU Munich led by Prof. Dr. Barbara Plank. I was also affiliated with the Munich Center for Machine Learning (MCML).

I obtained my Ph.D. in Computational Linguistics from the Department of Linguistics at Georgetown University, where I was advised by Amir Zeldes, Ph.D. and was a member of Corpling@GU and Computational Linguistics @ Georgetown (GUCL). I was also a student research affiliate of NERT, directed by Nathan Schneider, Ph.D.

research interests involve:

tackling text variation in NLP (broadly construed)
studying model internals for discourse-level linguistic phenomena and generalization
discourse-level linguistic phenomena across genres using computational, statistical, and corpus-based methods
NLP applications involving discourse structure and understanding (e.g. summarization for genre-diverse texts)
cross-framework discourse understanding and unifying discourse resources (co-organizer of the DISRPT shared task)
multilingual annotation projects involving discourse-level phenomena

📧 jal787 [@] pitt [dot] edu

news

Oct 10, 2025	🎉 successfully organized the First Workshop on Bridging NLP and Public Opinion Research with my amazing co-organizers at COLM 2025 in Montreal, Canada!
Aug 25, 2025	🛎️ 1 paper accepted to EMNLP 2025 (main) & 1 paper accepted to INLG 2025
Jun 23, 2025	🛎️ new preprint on examing the role and impact of referemce set choice on summarization metrics!
Jun 03, 2025	🛎️ presented our ACL 2025 paper on discourse generalization at the CIS PhD seminar at LMU Munich

selected publications

INLG

References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation

Silvia Casola^*, Yang Janet Liu^*, Siyao Peng^*, Oliver Kraus, Albert Gatt, and 1 more author

In Proceedings of the 18th International Natural Language Generation Conference, Oct 2025

(*equal contribution)

Abs PDF

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
ACL

Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set

Florian Eichin^*, Yang Janet Liu^*, Barbara Plank, and Michael A. Hedderich

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025

(*equal contribution)

Abs PDF Code Poster

Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
EMNLP

GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains

Yang Janet* Liu, Tatsuya * Aoyama, Wesley* Scivetti, Yilun* Zhu, Shabnam Behzad, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

(*equal contribution)

Abs PDF Code

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.
SIGDIAL

What’s Hard in RST Parsing? Predictive Models for Error Analysis

Yang Janet Liu, Tatsuya Aoyama, and Amir Zeldes

In Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Sep 2023

Abs PDF Code

Despite recent advances in Natural Language Processing (NLP), hierarchical discourse parsing in the framework of Rhetorical Structure Theory remains challenging, and our understanding of the reasons for this are as yet limited. In this paper, we examine and model some of the factors associated with parsing difficulties in previous work: the existence of implicit discourse relations, challenges in identifying long-distance relations, out-of-vocabulary items, and more. In order to assess the relative importance of these variables, we also release two annotated English test-sets with explicit correct and distracting discourse markers associated with gold standard RST relations. Our results show that as in shallow discourse parsing, the explicit/implicit distinction plays a role, but that long-distance dependencies are the main challenge, while lack of lexical overlap is less of a problem, at least for in-domain parsing. Our final model is able to predict where errors will occur with an accuracy of 76.3% for the bottom-up parser and 76.6% for the top-down parser.
Findings

GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization

Yang Janet Liu and Amir Zeldes

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

Abs PDF Code Poster

Automatic summarization with pre-trained language models has led to impressively fluent results, but is prone to ‘hallucinations’, low performance on non-news genres, and outputs which are not exactly summaries. Targeting ACL 2023’s ‘Reality Check’ theme, we present GUMSum, a small but carefully crafted dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization. Summaries are highly constrained, focusing on substitutive potential, factuality, and faithfulness. We present guidelines and evaluate human agreement as well as subjective judgments on recent system outputs, comparing general-domain untuned approaches, a fine-tuned one, and a prompt-based approach, to human performance. Results show that while GPT3 achieves impressive scores, it still underperforms humans, with varying quality across genres. Human judgments reveal different types of errors in supervised, prompted, and human-generated summaries, shedding light on the challenges of producing a good summary.
EACL

Why Can’t Discourse Parsing Generalize? A Thorough Investigation of the Impact of Data Diversity

Yang Janet Liu and Amir Zeldes

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023

Abs PDF Code Poster Slides

Recent advances in discourse parsing performance create the impression that, as in other NLP tasks, performance for high-resource languages such as English is finally becoming reliable. In this paper we demonstrate that this is not the case, and thoroughly investigate the impact of data diversity on RST parsing stability. We show that state-of-the-art architectures trained on the standard English newswire benchmark do not generalize well, even within the news domain. Using the two largest RST corpora of English with text from multiple genres, we quantify the impact of genre diversity in training data for achieving generalization to text types unseen during training. Our results show that a heterogeneous training regime is critical for stable and generalizable models, across parser architectures. We also provide error analyses of model outputs and out-of-domain performance. To our knowledge, this study is the first to fully evaluate cross-corpus RST parsing generalizability on complete trees, examine between-genre degradation within an RST corpus, and investigate the impact of genre diversity in training data composition.