Comparison of Document Classification Approaches for Turkish Texts

Çobanoğlu, Özlem Ece; Aslan, Burak Galip

Comparison of Document Classification Approaches for Turkish Texts

dc.contributor.advisor	Aslan, Burak Galip
dc.contributor.author	Çobanoğlu, Özlem Ece
dc.contributor.author	Aslan, Burak Galip
dc.contributor.other	03.04. Department of Computer Engineering
dc.contributor.other	03. Faculty of Engineering
dc.contributor.other	01. Izmir Institute of Technology
dc.date.accessioned	2016-01-04T08:53:22Z
dc.date.available	2016-01-04T08:53:22Z
dc.date.issued	2015
dc.description	Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2015	en_US
dc.description	Full text release delayed at author's request until 2018.08.14	en_US
dc.description	Includes bibliographical references (leaves: 55-58)	en_US
dc.description	Text in English; Abstract: Turkish and English	en_US
dc.description	xi, 71 leaves	en_US
dc.description.abstract	Internet usage is exponentially growing day by day. This rapid growth in Internet usage leads to an explosion in the number of electronic documents being produced daily. The huge bulk of documents make it difficult accessing the necessary and relevant information. Due to lack of logical organization, retrieval and processing of the desired information from huge number of documents becomes a complex and time consuming task with human effort. Therefore, document classification is significant task to manage and process the documents. In this thesis, the performance of different classification approaches produced from several algorithms is thoroughly evaluated. The main goal of the thesis is to determine the best combination of document preprocessing steps and classification algorithms. Different feature weighting, construction and selection methods are experimented on Turkish documents. Stemmed and original words and their bi-gram and tri-gram forms are used to construct the features which represent the documents. The effects of several weighting algorithms and the combination of feature selection and weighting algorithms on 3 different classification approaches are interpreted. The performance of 216 different classification process combinations are analyzed. Experimental results show that C4.5 (C4.5 Decision Tree) classification algorithm has the highest accuracy results in 95% of the results. SVM (Support Vector Machine) algorithm produces the closest results to C4.5 and it provides the highest accuracy in 5% of the experimental results. NB (Naive Bayes) algorithm has always the lowest accuracy rate in these 3 different classification algorithm results.	en_US
dc.description.abstract	Gün geçtikçe yaygınlaşan internet kullanımıyla beraber elektronik belgelerde hızlı bir artış yaşanmaktadır. Belgelerin çoğu herhangi bir mantıksal yapıda olmadığı için insan gücü ile bu belge yığınlarının içinden istenilen bilgiye ulaşmak karmaşık ve zaman alıcı bir iştir; bu nedenle belgeleri hızlı bir şekilde düzenlemek, yönetmek ve işlemek için belge sınıflandırma önemli bir işlemdir. Bu tezde, Türkçe belgelerde farklı algoritmaların kullanılması ile birden fazla sınıflandırma yaklaşımının performansları değerlendirilmektedir. Tezin başlıca hedefi belge önişleme adımları ve sınıflandırma algoritmaları arasındaki en iyi kombinasyonun belirlenmesidir. Belgeleri temsil eden özelliklerin oluşturulmasında belgede geçen kelimelerin doğrudan kendileri, kökleri, bi-gram ve tri-gram formları kullanılmıştır. Bu özellik setlerine farklı ağırlıklandırma, seçim ve sınıflandırma algoritmalarının uygulanmasıyla 216 deneysel sonuç elde edilmiştir. Elde edilen deneysel sonuçlara göre, C4.5 (C4.5 Decision Tree) sınıflandırma algoritması sonuçların %95’inde en yüksek doğruluk değerine sahiptir. SVM (Support Vector Machine) algoritması C4.5’e en yakın sonuçları üretmektedir; ve bu sonuçların %5’inde en yüksek doğruluk değerini vermektedir. NB (Naive Bayes) algoritması ise bu 3 farklı sınıflandırma algoritması içinde her zaman en düşük doğruluk oranına sahip olduğu gözlemlenmiştir.	en_US
dc.identifier.citation	Çobanoğlu, Ö. E. (2015). Comparison of document classification approaches for Turkish texts. Unpublished master's thesis, İzmir Institute of Technology, İzmir, Turkey	en_US
dc.identifier.uri	https://hdl.handle.net/11147/4461
dc.language.iso	en	en_US
dc.publisher	Izmir Institute of Technology	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Computer algorithms	en_US
dc.subject	Classification	en_US
dc.subject	Text processing (Computer science)	en_US
dc.subject	C4.5 Decision Tree	en_US
dc.subject	Support Vector Machine	en_US
dc.subject	Naive Bayes method	en_US
dc.title	Comparison of Document Classification Approaches for Turkish Texts	en_US
dc.title.alternative	Türkçe Metinler için Doküman Sınıflandırma Yaklaşımlarının Karşılaştırılması	en_US
dc.type	Master Thesis	en_US
dspace.entity.type	Publication
gdc.author.institutional	Çobanoğlu, Özlem Ece
gdc.coar.access	open access
gdc.coar.type	text::thesis::master thesis
gdc.description.department	Thesis (Master)--İzmir Institute of Technology, Computer Engineering	en_US
gdc.description.publicationcategory	Tez	en_US
gdc.description.scopusquality	N/A
gdc.description.wosquality	N/A
relation.isAuthorOfPublication	97fb9193-a4c3-487d-b86f-5dd85d8cb27e
relation.isAuthorOfPublication.latestForDiscovery	97fb9193-a4c3-487d-b86f-5dd85d8cb27e
relation.isOrgUnitOfPublication	9af2b05f-28ac-4014-8abe-a4dfe192da5e
relation.isOrgUnitOfPublication	9af2b05f-28ac-4004-8abe-a4dfe192da5e
relation.isOrgUnitOfPublication	9af2b05f-28ac-4003-8abe-a4dfe192da5e
relation.isOrgUnitOfPublication.latestForDiscovery	9af2b05f-28ac-4014-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1

Name:: T001394.pdf
Size:: 1.85 MB
Format:: Adobe Portable Document Format
Description:: MasterThesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Master Degree / Yüksek Lisans Tezleri