Terminology-Grounded Translation

Bridging the Gap Between Wikipedians and Scientists with Terminology-Aware Translation: A Case Study in Turkish


This project addresses the gap between the escalating volume of English-to-Turkish Wikipedia translations and the insufficient number of contributors, particularly in technical domains. Leveraging expertise from academics’ collaborative terminology dictionary effort, we propose a pipeline system to enhance translation quality. Our focus is on bridging academic and Wikipedia communities, creating datasets, and developing NLP models for terminology identification and retrieval, and terminology-aware translation. The aim is to foster sustained contributions and improve the overall quality of Turkish Wikipedia articles.

The pipeline of our proposed project.

Goals

The project will focus on the following tasks:

  • High-quality parallel corpora for terminology-aware translation: We aim to generate 3,000 parallel sentences in English-Turkish containing the following: i)English text annotated with the technical terms, ii) links to correct terminology entries in the database, and iii) edited translations using the correct terminology with Turkish terms.

  • Term Identification: Build models to identify the technical terms in a multilingual setup.

  • Term Linking: Build models to ground the identified terms in a terminology database (if possible). In case the DB does not contain the term, make a notification system for the domain experts.

  • Terminology-Aware Translation: We will build post-editing and translation systems that will be constrained with the terminology database.

  • Build an Effective Communication Channel: We will survey both communities (Wikipedians and scientists) to identify the best practices to build the bridges, and the ways these two communities can help each other in a sustainable way. We will publish reports, best practices and guidelines.

Team

  • PI: Gözde Gül Şahin
  • Graduate student(s): Ali Gebeşçe
  • Interns: Ege Uğur Amasya, Mina Durhasan
  • Duration: 01.06.2024 - 31.05.2025

Funding

This project is funded by Wikimedia Research Fund. Official URL for the funded project is here.