Automatic Learning oF ProcedurAl Language From NAtural Language InStructions for Intelligent Assistance
Motivation
Despite the number of studies that report exceptionally high scores (Devlin et. al., 2019) for downstream natural language processing (NLP) tasks, a growing number of studies discuss the gap between their performance on such tasks and on real-world tasks that require “understanding” (Bender & Koller, 2020). Some of the major causes for this are (i) neural models not being able to generalize to out-of-domain data, (ii) downstream tasks not containing the challenges of real-world scenarios and iii) not having suitable evaluation measures.
In order to bring the performance on real-life and downstream tasks closer, this project proposes a novel task for understanding natural language utterances within a more realistic and challenging scope: understanding human-written instructions. Giving step-by-step instructions is one of the primary way of human communication to teach someone a new topic or a task. The research plan envisions a future, where people would also be able to instruct machines with such step-by-step instructions, and this research project aims to take the first step towards that goal by developing necessary tools to parse natural language utterances into a sequence of procedures. The advantage of having such a representation would be having the ability to reduce the statements automatically by using an off-the-shelf interpreter and enrich the final model with domain-specific knowledge/rules which cannot be easily learned from data.
A subfield of lingustics, semantics—the study of meaning—has researched how to best represent meaning for decades. The number of theorems that are introduced are too many to list here, however most theorems underline some common challenges such as quantifiers, negation scope and propose a way to address them in their representation scheme. However most of these representations are designed for sentences, ignoring the inter-sentential connections (e.g., After, then, because) and co-referring with pronouns (e.g., Sue is sick. She (Sue) won’t work tomorrow). Another prominent problem that occurs specifically in instruction text is zero anaphora. Imagine the 2-step procedure: “Mix the macaroni and cheese. Bake for 10 minutes.”. Here what to bake and where to bake are not explicitly stated but inferred from the context. Finally, referring is mostly done to an implicit (non-existing in text) object. For instance if we said “Bake it for 10 minutes”, it would refer to the product of the mixing action.Therefore, the first central research question is: “What is the best way to represent the meaning of step-by-step instructions?”.
Another major challenge arises when we want build models to parse instructions from various domains into executable procedures. We want a model trained with car repair instructions that contains some mixing actions (e.g., mixing paints) to be able to parse a cooking instruction that contains a mixing action correctly. Even though this is fairly easy for humans—thanks to our conceptual reasoning abilities— machines tend to overfit and not generalize to unseen domains, and similarly to unseen instructions. One way to tackle this challenge is to develop models, or improve existing models, with conceptual reasoning abilities. Extraction of concepts, however, requires the ability to identify and remember the reoccurring, important patterns and abstracting them from their surface forms, i.e., symbolize them. Hence the second central research question we pose in this project is: “How can we build generalizable models for processing raw text into well-defined procedures?”.
Final obstacle that stands on the way is the right evaluation measures to track the progress in the direction of generalizable natural language processing models. The task we define already sets up a realistic measure, however, measuring the progress with one single score has been shown to be problematic. The reasons are as follows. First of all, neural models are not interpretable. Hence their strengths and weaknesses can not be analyzed simply by looking at a single score. Second, a single score only shows how good a model performs on this specific test set, rather than how good a model is by means of the skills required by the task (e.g., logical deduction, mathematical reasoning). As mentioned, this project hypothesizes that improving the ability of long-range reasoning would yield more generalizable models. That brings us to the final research question: “Which reasoning/logical skills are required for processing instructions?” and “How can we measure these skills adequately?”
Goals
The project investigates three major research directions to answer the aforementioned RQs:
Large, Structured Dataset of Procedural Information Spanning Multiple Domains: In order to
conduct research on data-driven and generalizable models for procedural language understanding, the
field requires large amounts of annotated corpora of goal-oriented instructions and procedures from
variety of domains.
Evaluation Benchmark for Distinct Cognitive Abilities: The task of interpreting procedures
encloses various linguistic and reasoning challenges. Nonetheless, the researchers are inclined to
evaluate only on the end result using a single score.
Neural/Hybrid Models with Long-Range Reasoning Abilities for Procedural Text: We will contribute with i) investigating the generalizability of existing techniques on the procedural text, and ii) developing novel models inspired from existing cognitive architectures.
This project (ID:1109B322100424) is funded within the scope of Tübitak 2232B International Fellowship for Outstanding Researchers funding scheme. (Funding Period: 09.2022-09.2025)
Predicting the collaboration likelihood and measuring cognitive trust to AI systems is more important than ever. To do that, previous research mostly focus solely on the model features (e.g., accuracy, confidence) and ignore the human factor. To address that, we propose several decision-making similarity measures based on divergence metrics (e.g., KL, JSD) calculated over the labels acquired from humans and a wide range of models. We conduct a user study on a textual entailment task, where the users are provided with soft labels from various models and asked to pick the closest option to them. The users are then shown the similarities/differences to their most similar model and are surveyed for their likelihood of collaboration and cognitive trust to the selected system. Finally, we qualitatively and quantitatively analyze the relation between the proposed decision-making similarity measures and the survey results. We find that people tend to collaborate with their most similar models – measured via JSD – yet this collaboration does not necessarily imply a similar level of cognitive trust. We release all resources related to the user study (e.g., design, outputs), models, and metrics at our repo.
@inproceedings{gebeşçe2024quantifying,title={Quantifying Divergence for Human-AI Collaboration and Cognitive Trust},author={Gebeşçe, Ali and Kural, Müge and Chubakov, Tilek and Şahin, Gözde Gül},year={2025},isbn={9798400713958},url={https://doi.org/10.1145/3706599.3720105},doi={10.1145/3706599.3720105},publisher={Association for Computing Machinery},month=apr,address={New York, NY, USA},projects={ALfaLFas}}
Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Despite growing interest in this area, there lacks a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 177 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
@misc{safa2024systematicsurveyinstructionaltext,title={A Systematic Survey on Instructional Text: From Representation Formats to Downstream NLP Tasks},author={Safa, Abdulfattah and Kapanadze, Tamta and Uzunoğlu, Arda and Şahin, Gözde Gül},year={2024},eprint={2410.18529},archiveprefix={arXiv},primaryclass={cs.CL},projects={ALfaLFas}}
Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the "curse of multilinguality" – the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts.
@misc{soykan2024linguisticallyinformedmultilingualinstructiontuning,title={Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?},author={Soykan, Gürkan and Şahin, Gözde Gül},year={2024},eprint={2410.07809},archiveprefix={arXiv},primaryclass={cs.CL},projects={ALfaLFas}}
A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding
Abdulfattah Safa, and Gözde Gül Şahin
In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025
Dialogue State Tracking (DST) is crucial for understanding user needs and executing appropriate system actions in task-oriented dialogues. Majority of existing DST methods are designed to work within predefined ontologies and assume the availability of gold domain labels, struggling with adapting to new slots values. While Large Language Models (LLMs)-based systems show promising zero-shot DST performance, they either require extensive computational resources or they underperform existing fully-trained systems, limiting their practicality. To address these limitations, we propose a zero-shot, open-vocabulary system that integrates domain classification and DST in a single pipeline. Our approach includes reformulating DST as a question-answering task for less capable models and employing self-refining prompts for more adaptable ones. Our system does not rely on fixed slot values defined in the ontology allowing the system to adapt dynamically. We compare our approach with existing SOTA, and show that it provides up to 20% better Joint Goal Accuracy (JGA) over previous methods on datasets like Multi-WOZ 2.1, with up to 90% fewer requests to the LLM API.
@inproceedings{safa2024zeroshotopenvocabularypipelinedialogue,title={A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding},author={Safa, Abdulfattah and Şahin, Gözde Gül},booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},month=apr,year={2025},address={Albuquerque, New Mexico},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2025.naacl-long.387/},pages={7562--7579},isbn={979-8-89176-189-6},projects={ALfaLFas}}
Sophisticated grammatical error detection/correction tools are available for a small set of languages such as English and Chinese. However, it is not straightforward – if not impossible – to adapt them to morphologically rich languages with complex writing rules like Turkish which has more than 80 million speakers. Even though several tools exist for Turkish, they primarily focus on spelling errors rather than grammatical errors and lack features such as web interfaces, error explanations and feedback mechanisms. To fill this gap, we introduce GECTurk WEB, a light, open-source, and flexible web-based system that can detect and correct the most common forms of Turkish writing errors, such as the misuse of diacritics, compound and foreign words, pronouns, light verbs along with spelling mistakes. Our system provides native speakers and second language learners an easily accessible tool to detect/correct such mistakes and also to learn from their mistakes by showing the explanation for the violated rule(s). The proposed system achieves 88,3 system usability score, and is shown to help learn/remember a grammatical rule (confirmed by 80% of the participants).
@inproceedings{gebeşçe2024gecturkwebexplainableonline,title={GECTurk WEB: An Explainable Online Platform for Turkish Grammatical Error Detection and Correction},author={Gebeşçe, Ali and Şahin, Gözde Gül},booktitle={Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations},month=jan,year={2025},adress={Abu Dhabi, UAE},pulisher={Association for Computational Linguistics},url={https://aclanthology.org/2025.coling-demos.16/},pages={163--173},projects={ALfaLFas}}
ACL
PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset
Arda Uzunoğlu, Abdulfattah Safa, and Gözde Gül Şahin
In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024
Recently, there has been growing interest within the community regarding whether large language models are capable of planning or executing plans. However, most prior studies use LLMs to generate high-level plans for simplified scenarios lacking linguistic complexity and domain diversity, limiting analysis of their planning abilities. These setups constrain evaluation methods (e.g., predefined action space), architectural choices (e.g., only generative models), and overlook the linguistic nuances essential for realistic analysis. To tackle this, we present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow. It involves tip and warning inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal. Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios. Despite advancements, all models fall short of human performance. Notably, our analysis uncovers intriguing insights, such as variations in model behavior with dropped keywords, struggles of BERT-family and GPT-4 with physical and abstract goals, and the proposed tasks offering valuable prior knowledge for other unseen procedural tasks. The PARADISE dataset and associated resources are publicly available for further research exploration with https://anonymous.4open.science/r/paradise-53BD/README.md.
@inproceedings{uzunoglu-etal-2024-paradise,title={{PARADISE}: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset},author={Uzuno{\u{g}}lu, Arda and Safa, Abdulfattah and {\c{S}}ahin, G{\"o}zde G{\"u}l},editor={Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek},booktitle={Findings of the Association for Computational Linguistics: ACL 2024},month=aug,year={2024},address={Bangkok, Thailand},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.findings-acl.599/},doi={10.18653/v1/2024.findings-acl.599},pages={10085--10102},projects={ALfaLFas}}
Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish
Arda Uzunoglu, and Gözde Şahin
In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov 2023
Understanding procedural natural language (e.g., step-by-step instructions) is a crucial step to execution and planning. However, while there are ample corpora and downstream tasks available in English, the field lacks such resources for most languages. To address this gap, we conduct a case study on Turkish procedural texts. We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools, where the translation quality and loyalty to the original meaning are validated by a team of experts on a random set. Then, we generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization. To tackle these tasks, we implement strong baseline models via fine-tuning large language-specific models such as TR-BART and BERTurk, as well as multilingual models such as mBART, mT5, and XLM. We find that language-specific models consistently outperform their multilingual models by a significant margin across most procedural language understanding (PLU) tasks.
@inproceedings{uzunoglu-ahin:2023:ijcnlp,author={Uzunoglu, Arda and Şahin, Gözde},title={Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish},booktitle={Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},month=nov,year={2023},address={Nusa Dua, Bali},publisher={Association for Computational Linguistics},pages={804--819},url={https://aclanthology.org/2023.ijcnlp-long.52},projects={ALfaLFas}}
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Atakan Kara, Farrin Marouf Sofian, Andrew Bond, and
1 more author
In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov 2023
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using the pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) few-shot learning with prefix tuning, achieving strong results. Then we perform a zero-shot evaluation of our pretrained models on the coarse-grained “BOUN -de/-da” and fine-grained expert annotated dataset. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our dataset, baseline models, and synthetic data generation pipeline with https://anonymous.4open.science/r/tr-gec-17D6/.
@inproceedings{kara-EtAl:2023:findings,author={Kara, Atakan and Marouf Sofian, Farrin and Bond, Andrew and Şahin, Gözde},title={GECTurk: Grammatical Error Correction and Detection Dataset for Turkish},booktitle={Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},month=nov,year={2023},address={Nusa Dua, Bali},publisher={Association for Computational Linguistics},pages={278--290},url={https://aclanthology.org/2023.findings-ijcnlp.26},projects={ALfaLFas}}
In-context learning (ICL) for large language models has proven to be a powerful approach for many natural language processing tasks. However, determining the best method to select examples for ICL is nontrivial as the results can vary greatly depending on the quality, quantity, and order of examples used. In this paper, we conduct a case study on text simplification (TS) to investigate how to select the best and most robust examples for ICL. We propose Metric-Based in-context Learning (MBL) method that utilizes commonly used TS metrics such as SARI, compression ratio, and BERT-Precision for selection. Through an extensive set of experiments with various-sized GPT models on standard TS benchmarks such as TurkCorpus and ASSET, we show that examples selected by the top SARI scores perform the best on larger models such as GPT-175B, while the compression ratio generally performs better on smaller models such as GPT-13B and GPT-6.7B. Furthermore, we demonstrate that MBL is generally robust to example orderings and out-of-domain test sets, and outperforms strong baselines and state-of-the-art finetuned language models. Finally, we show that the behaviour of large GPT models can be implicitly controlled by the chosen metric. Our research provides a new framework for selecting examples in ICL, and demonstrates its effectiveness in text simplification tasks, breaking new ground for more accurate and efficient NLG systems.
@inproceedings{mbl-subha23,author={Vadlamannati, Subha and Şahin, Gözde Gül},title={Metric-Based In-context Learning: {A} Case Study in Text Simplification},booktitle={Proceedings of the 16th International Natural Language Generation Conference},month=sep,year={2023},address={Prague, Czech Republic},publisher={Association for Computational Linguistics},projects={ALfaLFas}}