Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

dc.contributor.author Ogul, Iskender Ulgen
dc.contributor.author Soygazi, Fatih
dc.contributor.author Bostanoglu, Belgin Ergenc
dc.date.accessioned 2025-03-25T22:55:22Z
dc.date.available 2025-03-25T22:55:22Z
dc.date.issued 2025
dc.description Soygazi, Fatih/0000-0001-8426-2283; Ergenc Bostanoglu, Belgin/0000-0001-6193-9853 en_US
dc.description.abstract Natural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook's No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP. en_US
dc.identifier.doi 10.7717/peerj-cs.2662
dc.identifier.issn 2376-5992
dc.identifier.scopus 2-s2.0-85219134639
dc.identifier.uri https://doi.org/10.7717/peerj-cs.2662
dc.identifier.uri https://hdl.handle.net/11147/15426
dc.language.iso en en_US
dc.publisher Peerj inc en_US
dc.relation.ispartof PeerJ Computer Science
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Mednli en_US
dc.subject Nllb en_US
dc.subject Bert en_US
dc.subject Natural Language Inference en_US
dc.subject Natural Language Processing en_US
dc.subject Language Translation en_US
dc.subject Llm en_US
dc.subject Llama en_US
dc.title Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.id Soygazi, Fatih/0000-0001-8426-2283
gdc.author.id Ergenc Bostanoglu, Belgin/0000-0001-6193-9853
gdc.author.id Soygazi, Fatih / 0000-0001-8426-2283 en_US
gdc.author.id Ergenc Bostanoglu, Belgin / 0000-0001-6193-9853 en_US
gdc.author.scopusid 57220960947
gdc.author.scopusid 57195222455
gdc.author.scopusid 24478565000
gdc.author.wosid Soygazi, Fatih/Abn-0409-2022
gdc.author.wosid Ogul, Iskender Ulgen/Ncv-6682-2025
gdc.author.wosid Ergenc Bostanoglu, Belgin/O-2529-2015
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C5
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department İzmir Institute of Technology en_US
gdc.description.departmenttemp [Ogul, Iskender Ulgen; Bostanoglu, Belgin Ergenc] Izmir Inst Technol, Comp Engn, Izmir, Turkiye; [Soygazi, Fatih] Adnan Menderes Univ, Comp Engn, Aydin, Turkiye en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q1
gdc.description.volume 11 en_US
gdc.description.woscitationindex Science Citation Index Expanded
gdc.description.wosquality Q2
gdc.identifier.openalex W4406998844
gdc.identifier.pmid 40062299
gdc.identifier.wos WOS:001480000500001
gdc.index.type WoS
gdc.index.type Scopus
gdc.index.type PubMed
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.impulse 1.0
gdc.oaire.influence 2.8232872E-9
gdc.oaire.isgreen true
gdc.oaire.keywords MedNLI
gdc.oaire.keywords Artificial Intelligence
gdc.oaire.keywords Natural language processing
gdc.oaire.keywords Electronic computers. Computer science
gdc.oaire.keywords Language translation
gdc.oaire.keywords QA75.5-76.95
gdc.oaire.keywords NLLB
gdc.oaire.keywords Natural language inference
gdc.oaire.keywords BERT
gdc.oaire.popularity 2.1693905E-10
gdc.oaire.publicfunded false
gdc.openalex.collaboration National
gdc.openalex.fwci 9.63949029
gdc.openalex.normalizedpercentile 0.96
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 0
gdc.plumx.mendeley 8
gdc.plumx.newscount 1
gdc.plumx.pubmedcites 1
gdc.plumx.scopuscites 2
gdc.scopus.citedcount 2
gdc.wos.citedcount 0
relation.isAuthorOfPublication.latestForDiscovery 3b51d444-157d-4dff-a209-e28543a80dcd
relation.isOrgUnitOfPublication.latestForDiscovery 9af2b05f-28ac-4014-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Name:
peerj-cs-2662.pdf
Size:
2.63 MB
Format:
Adobe Portable Document Format
Description:
article