Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

Ogul, Iskender Ulgen; Soygazi, Fatih; Bostanoglu, Belgin Ergenc

doi:10.7717/peerj-cs.2662

Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

Files

Primary peerj-cs-2662.pdf (2.63 MB)

Date

2025

Authors

Ogul, Iskender Ulgen

Soygazi, Fatih

Bostanoglu, Belgin Ergenc

Publisher

Peerj inc

Open Access Color

GOLD

Green Open Access

Yes

Publicly Funded

No

Impulse

Average

Influence

Average

Popularity

Average

Abstract

Natural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook's No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP.

Description

Soygazi, Fatih/0000-0001-8426-2283; Ergenc Bostanoglu, Belgin/0000-0001-6193-9853

ORCID

Soygazi, Fatih

Ergenc Bostanoglu, Belgin

Soygazi, Fatih

Ergenc Bostanoglu, Belgin

Keywords

Mednli, Nllb, Bert, Natural Language Inference, Natural Language Processing, Language Translation, Llm, Llama, MedNLI, Artificial Intelligence, Natural language processing, Electronic computers. Computer science, Language translation, QA75.5-76.95, NLLB, Natural language inference, BERT

WoS Q

Q2

Scopus Q

Q1

OpenCitations Citation Count

N/A

Source

PeerJ Computer Science

Volume

11

URI

https://doi.org/10.7717/peerj-cs.2662
https://hdl.handle.net/11147/15426

Collections

WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection
PubMed İndeksli Yayınlar Koleksiyonu / PubMed Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection

PlumX Metrics

Citations

Scopus : 2

PubMed : 1

Captures

Mendeley Readers : 8

Full item page

SCOPUS™ Citations

2

checked on Apr 27, 2026

Page Views

20

checked on Apr 27, 2026

Downloads

1

checked on Apr 27, 2026

Google Scholar™

Check

Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

BIP! Indicators

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Description

ORCID

Keywords

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Citation Count

Source

Volume

Issue

Start Page

End Page

URI

Collections

PlumX Metrics

Citations

Captures

SCOPUS™ Citations

2

Page Views

20

Downloads

1

Google Scholar™

OpenAlex FWCI

9.63949029

Sustainable Development Goals

SDG data is not available