Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

Ogul, Iskender Ulgen; Soygazi, Fatih; Bostanoglu, Belgin Ergenc

doi:10.7717/peerj-cs.2662

Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation

dc.contributor.author	Ogul, Iskender Ulgen
dc.contributor.author	Soygazi, Fatih
dc.contributor.author	Bostanoglu, Belgin Ergenc
dc.date.accessioned	2025-03-25T22:55:22Z
dc.date.available	2025-03-25T22:55:22Z
dc.date.issued	2025
dc.description	Soygazi, Fatih/0000-0001-8426-2283; Ergenc Bostanoglu, Belgin/0000-0001-6193-9853	en_US
dc.description.abstract	Natural language inference (NLI) is a subfield of natural language processing (NLP) that aims to identify the contextual relationship between premise and hypothesis sentences. While high-resource languages like English benefit from robust and rich NLI datasets, creating similar datasets for low-resource languages is challenging due to the cost and complexity of manual annotation. Although translation of existing datasets offers a practical solution, direct translation of domain-specific datasets presents unique challenges, particularly in handling abbreviations, metric conversions, and cultural alignment. This study introduces a pipeline for translating a medical NLI dataset into Turkish, which is a low-resource language. Our approach employs fine-tuning the Llama-3.1 model with selected samples from the Medical Abbreviation dataset (MeDAL) to extract and resolve medical abbreviations. Consequently, NLI pairs are refined with extracted abbreviations and subjected to metric correction. Later, the processed sentences are then translated using Facebook's No Language Left Behind (NLLB) translation model. To ensure quality, we conducted comprehensive evaluations using both machine learning models and medical expert review. Our results show that BERTurk achieved 75.17% accuracy on TurkMedNLI test data and 76.30% on the normalized test set, while BioBERTurk demonstrated comparable performance with 75.59% accuracy on test data and 72.29% on the normalized dataset. Medical experts further validated the translations through manual assessment of sampled sentences. This work demonstrates the effectiveness of large language models in adapting domain-specific datasets for low-resource languages, establishing a foundation for future research in multilingual biomedical NLP.	en_US
dc.identifier.doi	10.7717/peerj-cs.2662
dc.identifier.issn	2376-5992
dc.identifier.scopus	2-s2.0-85219134639
dc.identifier.uri	https://doi.org/10.7717/peerj-cs.2662
dc.identifier.uri	https://hdl.handle.net/11147/15426
dc.language.iso	en	en_US
dc.publisher	Peerj inc	en_US
dc.relation.ispartof	PeerJ Computer Science
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Mednli	en_US
dc.subject	Nllb	en_US
dc.subject	Bert	en_US
dc.subject	Natural Language Inference	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Language Translation	en_US
dc.subject	Llm	en_US
dc.subject	Llama	en_US
dc.title	Turkmednli: a Turkish Medical Natural Language Inference Dataset Through Large Language Model Based Translation	en_US
dc.type	Article	en_US
dspace.entity.type	Publication
gdc.author.id	Soygazi, Fatih/0000-0001-8426-2283
gdc.author.id	Ergenc Bostanoglu, Belgin/0000-0001-6193-9853
gdc.author.id	Soygazi, Fatih / 0000-0001-8426-2283	en_US
gdc.author.id	Ergenc Bostanoglu, Belgin / 0000-0001-6193-9853	en_US
gdc.author.scopusid	57220960947
gdc.author.scopusid	57195222455
gdc.author.scopusid	24478565000
gdc.author.wosid	Soygazi, Fatih/Abn-0409-2022
gdc.author.wosid	Ogul, Iskender Ulgen/Ncv-6682-2025
gdc.author.wosid	Ergenc Bostanoglu, Belgin/O-2529-2015
gdc.bip.impulseclass	C5
gdc.bip.influenceclass	C5
gdc.bip.popularityclass	C5
gdc.coar.access	open access
gdc.coar.type	text::journal::journal article
gdc.collaboration.industrial	false
gdc.description.department	İzmir Institute of Technology	en_US
gdc.description.departmenttemp	[Ogul, Iskender Ulgen; Bostanoglu, Belgin Ergenc] Izmir Inst Technol, Comp Engn, Izmir, Turkiye; [Soygazi, Fatih] Adnan Menderes Univ, Comp Engn, Aydin, Turkiye	en_US
gdc.description.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
gdc.description.scopusquality	Q1
gdc.description.volume	11	en_US
gdc.description.woscitationindex	Science Citation Index Expanded
gdc.description.wosquality	Q2
gdc.identifier.openalex	W4406998844
gdc.identifier.pmid	40062299
gdc.identifier.wos	WOS:001480000500001
gdc.index.type	WoS
gdc.index.type	Scopus
gdc.index.type	PubMed
gdc.oaire.accesstype	GOLD
gdc.oaire.diamondjournal	false
gdc.oaire.impulse	1.0
gdc.oaire.influence	2.8232872E-9
gdc.oaire.isgreen	true
gdc.oaire.keywords	MedNLI
gdc.oaire.keywords	Artificial Intelligence
gdc.oaire.keywords	Natural language processing
gdc.oaire.keywords	Electronic computers. Computer science
gdc.oaire.keywords	Language translation
gdc.oaire.keywords	QA75.5-76.95
gdc.oaire.keywords	NLLB
gdc.oaire.keywords	Natural language inference
gdc.oaire.keywords	BERT
gdc.oaire.popularity	2.1693905E-10
gdc.oaire.publicfunded	false
gdc.openalex.collaboration	National
gdc.openalex.fwci	9.63949029
gdc.openalex.normalizedpercentile	0.96
gdc.openalex.toppercent	TOP 10%
gdc.opencitations.count	0
gdc.plumx.mendeley	8
gdc.plumx.newscount	1
gdc.plumx.pubmedcites	1
gdc.plumx.scopuscites	2
gdc.scopus.citedcount	2
gdc.wos.citedcount	0
relation.isAuthorOfPublication.latestForDiscovery	3b51d444-157d-4dff-a209-e28543a80dcd
relation.isOrgUnitOfPublication.latestForDiscovery	9af2b05f-28ac-4014-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1

Name:: peerj-cs-2662.pdf
Size:: 2.63 MB
Format:: Adobe Portable Document Format
Description:: article

Download

Collections

WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection
PubMed İndeksli Yayınlar Koleksiyonu / PubMed Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection