Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

Loading...

Date

2025

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Peerj inc

Open Access Color

GOLD

Green Open Access

Yes

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Natural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook's No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP.

Description

Soygazi, Fatih/0000-0001-8426-2283; Ergenc Bostanoglu, Belgin/0000-0001-6193-9853

Keywords

Mednli, Nllb, Bert, Natural Language Inference, Natural Language Processing, Language Translation, Llm, Llama, MedNLI, Artificial Intelligence, Natural language processing, Electronic computers. Computer science, Language translation, QA75.5-76.95, NLLB, Natural language inference, BERT

Fields of Science

Citation

WoS Q

Q2

Scopus Q

Q1
OpenCitations Logo
OpenCitations Citation Count
N/A

Source

PeerJ Computer Science

Volume

11

Issue

Start Page

End Page

PlumX Metrics
Citations

Scopus : 2

PubMed : 1

Captures

Mendeley Readers : 8

SCOPUS™ Citations

2

checked on Apr 27, 2026

Page Views

20

checked on Apr 27, 2026

Downloads

1

checked on Apr 27, 2026

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
9.63949029

Sustainable Development Goals

SDG data is not available