Publications

Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

Peer review is central to scientific quality, yet reliance on simple heuristics -- lazy thinking -- has lowered standards. Prior work treats lazy thinking detection as a single-label task, but review segments may exhibit multiple issues, including broader clarity problems, or specificity issues. Turning detection into actionable improvements requires guideline-aware feedback, which is currently missing. We introduce an LLM-driven framework that decomposes reviews into argumentative segments, identifies issues via a neurosymbolic module combining LLM features with traditional classifiers, and generates targeted feedback using issue-specific templates refined by a genetic algorithm. Experiments show our method outperforms zero-shot LLM baselines and improves review quality by up to 92.4\%. We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.

Sukannya Purkayastha, Qile Wan, Anne Lauscher, Lizhen Qu, Iryna Gurevych

Publications

Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models

SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation

Around the World in 24 Hours: Probing LLM Knowledge of Time and Place

LazyReview: A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

Agents of Discovery

Detecting Hallucinations in Authentic LLM-Human Interactions

Large Language Models Discriminate Against Speakers of German Dialects

How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild

Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE

SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

Aligned Probing: Relating Toxic Behavior and Model Internals

How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

GIMMICK - Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

Glitter: A Multi-Sentence, Multi-Reference Benchmark for Gender-Fair German Machine Translation

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

FineCite: A Novel Approach For Fine-Grained Citation Context Analysis

Multi³Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models

GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

The Echoes of Multilinguality: Tracing Cultural Value Shifts during Language Model Fine-tuning

Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine Experiment

Multi3Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models

Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements

Why do LLaVA Vision-Language Models Reply to Images in English?

Local Contrastive Editing of Gender Stereotypes

WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case

GeFMT: Gender-Fair Language in German Machine Translation

Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in NLP

What Can Natural Language Processing Do for Peer Review?

The Lou Dataset - Exploring the Impact of Gender-Fair Language in German Text Classification

Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?

SoK: Towards Security and Safety of Edge AI

Argument Quality Assessment in the Age of Instruction-Following Large Language Models

What the Weight?! A Unified Framework for Zero-Shot Knowledge Composition

Large language models for human-machine collaborative particle accelerator tuning through natural language

The Echoes of Multilinguality: Tracing Cultural Value Shifts during LM Fine-tuning

Sparks of Fairness: Preliminary Evidence of Commercial Machine Translation as English-to-German Gender-Fair Dictionaries

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Exploring Jiu-Jitsu Argumentation for Writing Peer Review Rebuttals

Values, Ethics, Morals? On the Use of Moral Concepts in NLP Research

Stereotypes and Smut: The (Mis)representation of Non-cisgender Identities by Text-to-Image Models

A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation

AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting

What about “em”? How Commercial Machine Translation Fails to Handle (Neo-)Pronouns

SocioProbe: What, When, and Where Language Models Learn about Sociodemographics

Fair and Argumentative Language Modeling for Computational Argumentation

Can Demographic Factors Improve Text Classification? Revisiting Demographic Adaptation in the Age of Transformers

Back to the Future: On Potential Histories in NLP

FineDeb: A Debiased Finetuning Approach for Language Models

Welcome to the Modern World of Pronouns: Identity-Inclusive Natural Language Processing beyond Gender

Privacy, Interpretability, and Fairness in the Multilingual Space

Multi-task Citation Content Analysis for Clinical Research Publications

Bridging Fairness and Environmental Sustainability in Natural Language Processing

On the Limitations of Sociodemographic Adaptation with Transformers

Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog

Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals

DS-TOD: Efficient Domain Specialization for Task-Oriented Dialog

Sustainable Modular Debiasing of Language Models

RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models

Linguistic Diversity Scores for NLP Data Sets

Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases

DebIE: A Platform for Implicit and Explicit Debiasing of Word Embedding Spaces

Review for "A Meta-analysis of Semantic Classification of Citations"

Scientia Potentia Est—On the Role of Knowledge in Computational Argumentation

Visual Summary Identification From Scientific Publications via Self-Supervised Learning

MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers

Creating a Domain-diverse Corpus for Theory-based Argument Quality Assessment

Entities as Topic Labels: Combining Entity Linking and Labeled LDA to Improve Topic Interpretability and Evaluability

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers

AraWEAT: Multidimensional Analysis of Biases in Arabic Word Embeddings