Publications

2026

arXiv
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Zeynel A. Uluşan, Burak S. Akbudak, Can S. Erer, and 1 more author

2026

Abs Bib PDF

Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce FormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8%) while specialized theorem provers perform the worst (24.4%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release FormalRewardBench publicly to encourage more research on developing reward models in formal mathematics.
@misc{ulusan2026formalrewardbench, title = {FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models}, author = {Uluşan, Zeynel A. and Akbudak, Burak S. and Erer, Can S. and Şahin, Gözde Gül}, year = {2026}, eprint = {2605.10141}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
arXiv
Clipping-Free Policy Optimization for Large Language Models

Ömer Veysel Çağatan, Barış Akgün, Gözde Gül Şahin, and 1 more author

2026

Abs Bib PDF

Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
@misc{ccaugatan2026clipping, title = {Clipping-Free Policy Optimization for Large Language Models}, author = {Çağatan, Ömer Veysel and Akgün, Barış and Şahin, Gözde Gül and Zhao, Xuandong}, year = {2026}, eprint = {2601.22801}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
arXiv
Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

Ehsan Barkhordar, Abdulfattah Safa, Verena Blaschke, and 3 more authors

2026

Abs Bib PDF

Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.
@misc{barkhordar2026nonenglishpapersreviewedfairly, title = {Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews}, author = {Barkhordar, Ehsan and Safa, Abdulfattah and Blaschke, Verena and Lombart, Erika and de Marneffe, Marie-Catherine and Şahin, Gözde Gül}, year = {2026}, eprint = {2604.07119}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }
EACL
CETVEL: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish

Yakup Abrek Er, Ilker Kesen, Gözde Gül Şahin, and 1 more author

In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Mar 2026

Abs Bib PDF

We introduce CETVEL, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. CETVEL addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. CETVEL covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. CETVEL offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
@inproceedings{er2025cetvel, title = {CETVEL: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish}, author = {Er, Yakup Abrek and Kesen, Ilker and Şahin, Gözde Gül and Erdem, Aykut}, year = {2026}, booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL)}, month = mar, address = {Rabat, Morocco} }
COLI
Instructional Text Across Disciplines: A Survey of Representations, Downstream Tasks, and Open Challenges Toward Capable AI Agents

Abdulfattah Safa, Tamte Kapanadze, Arde Uzunoğlu, and 1 more author

Computational Linguistics, Mar 2026

Abs Bib PDF

Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex, real-world tasks across domains like robotics, business automation, and interactive systems. Despite growing interest in this area, there is a lack of a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 181 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
@article{10.1162/COLI.a.616, author = {Safa, Abdulfattah and Kapanadze, Tamte and Uzunoğlu, Arde and Şahin, Gözde Gül}, title = {Instructional Text Across Disciplines: A Survey of Representations, Downstream Tasks, and Open Challenges Toward Capable AI Agents}, journal = {Computational Linguistics}, pages = {1--66}, year = {2026}, month = mar, issn = {0891-2017}, doi = {10.1162/COLI.a.616}, url = {https://doi.org/10.1162/COLI.a.616}, projects = {ALfaLFas} }
EACL
Overview of the SIGTURK 2026 Shared Task: Terminology-Aware Machine Translation for English–Turkish Scientific Texts

Ali Gebeşçe, Abdulfattah Safa, Ege Uğur Amasya, and 1 more author

In Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026), Mar 2026

Abs Bib PDF

This paper presents an overview of the SIGTURK 2026 Shared Task on Terminology-Aware Machine Translation for English-Turkish Scientific Texts. We address the critical challenge of terminological accuracy in low-resource settings by constructing the first terminology-rich English-Turkish parallel corpus, comprising 3,300 sentence pairs from STEM domains with 10,157 expert-validated term pairs. The shared task consists of three subtasks: term detection, expert-guided correction, and end-to-end post-editing. We evaluate state-of-the-art baselines (including GPT-5.2 and Claude Sonnet 4.5) alongside participant systems employing diverse strategies from fine-tuning to Retrieval-Augmented Generation (RAG). Our results highlight that while massive generalist models dominate zero-shot detection, smaller, domain-adapted models using Supervised Fine-Tuning and Reinforcement Learning can significantly outperform them in end-to-end post-editing. Furthermore, we find that rigid retrieval pipelines often disrupt fluency, whereas Chain-of-Thought prompting allows models to integrate terminology more naturally. Despite these advances, a significant gap remains between automated systems and human expert performance in strict terminology correction.
@inproceedings{gebesce-etal-2026-overview, title = {Overview of the {SIGTURK} 2026 Shared Task: Terminology-Aware Machine Translation for {E}nglish{--}{T}urkish Scientific Texts}, author = {Gebe{\c{s}}{\c{c}}e, Ali and Safa, Abdulfattah and Amasya, Ege U{\u{g}}ur and Şahin, Gözde Gül}, editor = {Oflazer, Kemal and K{"o}ksal, Abdullatif and Varol, Onur}, booktitle = {Proceedings of the Second Workshop Natural Language Processing for {T}urkic Languages ({SIGTURK} 2026)}, month = mar, year = {2026}, address = {Rabat, Morocco}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2026.sigturk-1.20/}, doi = {10.18653/v1/2026.sigturk-1.20}, pages = {236--247}, isbn = {979-8-89176-370-8}, projects = {terminology} }

2025

CHI
Quantifying Divergence for Human-AI Collaboration and Cognitive Trust

Ali Gebeşçe, Müge Kural, Tilek Chubakov, and 1 more author

In , Apr 2025

Abs Bib PDF

Predicting the collaboration likelihood and measuring cognitive trust to AI systems is more important than ever. To do that, previous research mostly focus solely on the model features (e.g., accuracy, confidence) and ignore the human factor. To address that, we propose several decision-making similarity measures based on divergence metrics (e.g., KL, JSD) calculated over the labels acquired from humans and a wide range of models. We conduct a user study on a textual entailment task, where the users are provided with soft labels from various models and asked to pick the closest option to them. The users are then shown the similarities/differences to their most similar model and are surveyed for their likelihood of collaboration and cognitive trust to the selected system. Finally, we qualitatively and quantitatively analyze the relation between the proposed decision-making similarity measures and the survey results. We find that people tend to collaborate with their most similar models – measured via JSD – yet this collaboration does not necessarily imply a similar level of cognitive trust. We release all resources related to the user study (e.g., design, outputs), models, and metrics at our repo.
@inproceedings{gebeşçe2024quantifying, title = {Quantifying Divergence for Human-AI Collaboration and Cognitive Trust}, author = {Gebeşçe, Ali and Kural, Müge and Chubakov, Tilek and Şahin, Gözde Gül}, year = {2025}, isbn = {9798400713958}, url = {https://doi.org/10.1145/3706599.3720105}, doi = {10.1145/3706599.3720105}, publisher = {Association for Computing Machinery}, month = apr, address = {New York, NY, USA}, projects = {ALfaLFas} }
NAACL
A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding

Abdulfattah Safa, and Gözde Gül Şahin

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025

Abs Bib PDF

Dialogue State Tracking (DST) is crucial for understanding user needs and executing appropriate system actions in task-oriented dialogues. Majority of existing DST methods are designed to work within predefined ontologies and assume the availability of gold domain labels, struggling with adapting to new slots values. While Large Language Models (LLMs)-based systems show promising zero-shot DST performance, they either require extensive computational resources or they underperform existing fully-trained systems, limiting their practicality. To address these limitations, we propose a zero-shot, open-vocabulary system that integrates domain classification and DST in a single pipeline. Our approach includes reformulating DST as a question-answering task for less capable models and employing self-refining prompts for more adaptable ones. Our system does not rely on fixed slot values defined in the ontology allowing the system to adapt dynamically. We compare our approach with existing SOTA, and show that it provides up to 20% better Joint Goal Accuracy (JGA) over previous methods on datasets like Multi-WOZ 2.1, with up to 90% fewer requests to the LLM API.
@inproceedings{safa2024zeroshotopenvocabularypipelinedialogue, title = {A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding}, author = {Safa, Abdulfattah and Şahin, Gözde Gül}, booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = apr, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.naacl-long.387/}, pages = {7562--7579}, isbn = {979-8-89176-189-6}, projects = {ALfaLFas} }
COLING
GECTurk WEB: An Explainable Online Platform for Turkish Grammatical Error Detection and Correction

Ali Gebeşçe, and Gözde Gül Şahin

In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, Jan 2025

Abs Bib PDF

Sophisticated grammatical error detection/correction tools are available for a small set of languages such as English and Chinese. However, it is not straightforward – if not impossible – to adapt them to morphologically rich languages with complex writing rules like Turkish which has more than 80 million speakers. Even though several tools exist for Turkish, they primarily focus on spelling errors rather than grammatical errors and lack features such as web interfaces, error explanations and feedback mechanisms. To fill this gap, we introduce GECTurk WEB, a light, open-source, and flexible web-based system that can detect and correct the most common forms of Turkish writing errors, such as the misuse of diacritics, compound and foreign words, pronouns, light verbs along with spelling mistakes. Our system provides native speakers and second language learners an easily accessible tool to detect/correct such mistakes and also to learn from their mistakes by showing the explanation for the violated rule(s). The proposed system achieves 88,3 system usability score, and is shown to help learn/remember a grammatical rule (confirmed by 80% of the participants).
@inproceedings{gebeşçe2024gecturkwebexplainableonline, title = {GECTurk WEB: An Explainable Online Platform for Turkish Grammatical Error Detection and Correction}, author = {Gebeşçe, Ali and Şahin, Gözde Gül}, booktitle = {Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations}, month = jan, year = {2025}, adress = {Abu Dhabi, UAE}, pulisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.coling-demos.16/}, pages = {163--173}, projects = {ALfaLFas} }
Aperta

TWiST: Turkish-English Wikipedia & Thesis STEM Terminology Dataset

Ali Gebeşçe, Gözde Gül Şahin, and Ege Uğur Amasya

Jan 2025

Dataset

Website

2024

arXiv
Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

Gürkan Soykan, and Gözde Gül Şahin

Jan 2024

Abs Bib PDF

Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the "curse of multilinguality" – the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts.
@misc{soykan2024linguisticallyinformedmultilingualinstructiontuning, title = {Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?}, author = {Soykan, Gürkan and Şahin, Gözde Gül}, year = {2024}, eprint = {2410.07809}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, projects = {ALfaLFas} }
ACL
PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset

Arda Uzunoğlu, Abdulfattah Safa, and Gözde Gül Şahin

In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024

Abs Bib

Recently, there has been growing interest within the community regarding whether large language models are capable of planning or executing plans. However, most prior studies use LLMs to generate high-level plans for simplified scenarios lacking linguistic complexity and domain diversity, limiting analysis of their planning abilities. These setups constrain evaluation methods (e.g., predefined action space), architectural choices (e.g., only generative models), and overlook the linguistic nuances essential for realistic analysis. To tackle this, we present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow. It involves tip and warning inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal. Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios. Despite advancements, all models fall short of human performance. Notably, our analysis uncovers intriguing insights, such as variations in model behavior with dropped keywords, struggles of BERT-family and GPT-4 with physical and abstract goals, and the proposed tasks offering valuable prior knowledge for other unseen procedural tasks. The PARADISE dataset and associated resources are publicly available for further research exploration with https://anonymous.4open.science/r/paradise-53BD/README.md.
@inproceedings{uzunoglu-etal-2024-paradise, title = {{PARADISE}: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset}, author = {Uzuno{\u{g}}lu, Arda and Safa, Abdulfattah and {\c{S}}ahin, G{\"o}zde G{\"u}l}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2024}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-acl.599/}, doi = {10.18653/v1/2024.findings-acl.599}, pages = {10085--10102}, projects = {ALfaLFas} }

2023

IJCNLP-AACL
Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish

Arda Uzunoglu, and Gözde Şahin

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov 2023

Abs Bib PDF

Understanding procedural natural language (e.g., step-by-step instructions) is a crucial step to execution and planning. However, while there are ample corpora and downstream tasks available in English, the field lacks such resources for most languages. To address this gap, we conduct a case study on Turkish procedural texts. We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools, where the translation quality and loyalty to the original meaning are validated by a team of experts on a random set. Then, we generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization. To tackle these tasks, we implement strong baseline models via fine-tuning large language-specific models such as TR-BART and BERTurk, as well as multilingual models such as mBART, mT5, and XLM. We find that language-specific models consistently outperform their multilingual models by a significant margin across most procedural language understanding (PLU) tasks.
@inproceedings{uzunoglu-ahin:2023:ijcnlp, author = {Uzunoglu, Arda and Şahin, Gözde}, title = {Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics}, month = nov, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, pages = {804--819}, url = {https://aclanthology.org/2023.ijcnlp-long.52}, projects = {ALfaLFas} }
IJCNLP-AACL
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Atakan Kara, Farrin Marouf Sofian, Andrew Bond, and 1 more author

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov 2023

Abs Bib PDF

Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using the pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) few-shot learning with prefix tuning, achieving strong results. Then we perform a zero-shot evaluation of our pretrained models on the coarse-grained “BOUN -de/-da” and fine-grained expert annotated dataset. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our dataset, baseline models, and synthetic data generation pipeline with https://anonymous.4open.science/r/tr-gec-17D6/.
@inproceedings{kara-EtAl:2023:findings, author = {Kara, Atakan and Marouf Sofian, Farrin and Bond, Andrew and Şahin, Gözde}, title = {GECTurk: Grammatical Error Correction and Detection Dataset for Turkish}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics}, month = nov, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, pages = {278--290}, url = {https://aclanthology.org/2023.findings-ijcnlp.26}, projects = {ALfaLFas} }
INLG
Metric-Based In-context Learning: A Case Study in Text Simplification

Subha Vadlamannati, and Gözde Gül Şahin

In Proceedings of the 16th International Natural Language Generation Conference, Sep 2023

Abs Bib PDF

In-context learning (ICL) for large language models has proven to be a powerful approach for many natural language processing tasks. However, determining the best method to select examples for ICL is nontrivial as the results can vary greatly depending on the quality, quantity, and order of examples used. In this paper, we conduct a case study on text simplification (TS) to investigate how to select the best and most robust examples for ICL. We propose Metric-Based in-context Learning (MBL) method that utilizes commonly used TS metrics such as SARI, compression ratio, and BERT-Precision for selection. Through an extensive set of experiments with various-sized GPT models on standard TS benchmarks such as TurkCorpus and ASSET, we show that examples selected by the top SARI scores perform the best on larger models such as GPT-175B, while the compression ratio generally performs better on smaller models such as GPT-13B and GPT-6.7B. Furthermore, we demonstrate that MBL is generally robust to example orderings and out-of-domain test sets, and outperforms strong baselines and state-of-the-art finetuned language models. Finally, we show that the behaviour of large GPT models can be implicitly controlled by the chosen metric. Our research provides a new framework for selecting examples in ICL, and demonstrates its effectiveness in text simplification tasks, breaking new ground for more accurate and efficient NLG systems.
@inproceedings{mbl-subha23, author = {Vadlamannati, Subha and Şahin, Gözde Gül}, title = {Metric-Based In-context Learning: {A} Case Study in Text Simplification}, booktitle = {Proceedings of the 16th International Natural Language Generation Conference}, month = sep, year = {2023}, address = {Prague, Czech Republic}, publisher = {Association for Computational Linguistics}, projects = {ALfaLFas} }
EACL
Lessons Learned from a Citizen Science Project for Natural Language Processing

Jan-Christoph Klie, Ji-Ung Lee, Kevin Stowe, and 6 more authors

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023

Abs Bib PDF

Many Natural Language Processing (NLP) systems use annotated corpora for training and evaluation. However, labeled data is often costly to obtain and scaling annotation projects is difficult, which is why annotation tasks are often outsourced to paid crowdworkers. Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP. To investigate whether and how well Citizen Science can be applied in this setting, we conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset. Our results show that this can yield high-quality annotations and at- tract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues. We summarize lessons learned in the form of guidelines and provide our code and data to aid future work on Citizen Science.
@inproceedings{klie-etal-2023-lessons, title = {Lessons Learned from a Citizen Science Project for Natural Language Processing}, author = {Klie, Jan-Christoph and Lee, Ji-Ung and Stowe, Kevin and Şahin, Gözde Gül and Moosavi, Nafise Sadat and Bates, Luke and Petrak, Dominic and Eckart De Castilho, Richard and Gurevych, Iryna}, booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, month = may, year = {2023}, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.eacl-main.261}, pages = {3594--3608} }
EACL
MetaQA: Combining Expert Agents for Multi-Skill Question Answering

Haritz Puerto, Gözde Gül Sahin, and Iryna Gurevych

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, May 2023

Abs Bib PDF

The recent explosion of question answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or by combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer candidates. Through quantitative and qualitative experiments we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches in both in-domain and out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future developments of multi-agent systems on https://github.com/UKPLab/MetaQA.
@inproceedings{DBLP:conf/eacl/PuertoSG23, author = {Puerto, Haritz and Sahin, G{\"{o}}zde G{\"{u}}l and Gurevych, Iryna}, editor = {Vlachos, Andreas and Augenstein, Isabelle}, title = {MetaQA: Combining Expert Agents for Multi-Skill Question Answering}, booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, {EACL} 2023, Dubrovnik, Croatia, May 2-6, 2023}, pages = {3548--3562}, publisher = {Association for Computational Linguistics}, year = {2023}, month = may, url = {https://aclanthology.org/2023.eacl-main.259}, timestamp = {Thu, 11 May 2023 17:08:21 +0200}, biburl = {https://dblp.org/rec/conf/eacl/PuertoSG23.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }