Master Degree / Yüksek Lisans Tezleri
Permanent URI for this collectionhttps://hdl.handle.net/11147/3008
Browse
15 results
Search Results
Master Thesis Comparison of Classification Algorithms in Pitch Type Prediction Problem(Izmir Institute of Technology, 2020) Türkmen, Fatih; Ergenç Bostanoğlu, BelginThe dramatic increase in the use of IoT devices has been leading to a huge amount of valuable data to be discovered. The knowledge extraction from such a huge amount of data requires an organized scientific set of processes. This requirement has pointed out the importance of the data mining applications. As a major data mining application, classification is a supervised learning technique that requires a feature set and target class through the training process. For the training process, the key point is determining the appropriate feature set for the classification algorithm. The improvements in cutting-edge technologies such as high resolution camera systems have made extracting the insights about next pitch available. Consequently, pitch type prediction has been standing out as an important research topic. In order to predict next pitch type, existing researches mostly focus on pitcher profile, batter profile and previous pitch data in feature set. There is no study analyzing the effect of the zone information in the prediction of the next pitch type. Therefore, this study has analyzed the contribution of zone information in pitch type prediction. Our approach is that, we aimed to reveal the contribution of zones with the high strike low bat rates for pitch type decision in pitcher and batter player match up. This aim directed us to analyze the pitch type prediction problem for both zone-based and non-zone-based approaches so that we can exhibit how much zone information contributes to the problem through different classification algorithms.Master Thesis Tag Based Storage and Retrieval System for Organization Related News(Izmir Institute of Technology, 2019) Parkın, Kübra; Tuğlular, TuğkanFor corporate organizations, it becomes more and more important to gather information about opponents or partners, or any kind of information that can be related to the organization. In a rapidly changing world, ensuring competitiveness for organizations and making consistent strategic decisions are becoming increasingly difficult. Gathering news about the business has an undeniable effect on the decisions of companies. It is essential to keep up with this race in order not to get out of the race. Therefore, what corporate companies need is to have a retrieval system that collects and evaluates information that is relevant to the organization. However, it can be difficult to make use of large amounts of information. What needed is to store that information based on a pattern and make it easy to analyses.Master Thesis Analyzing Social Media Data by Frequent Pattern Mining Methods(Izmir Institute of Technology, 2018) Güvenoğlu, Büşra; Ergenç Bostanoğlu, Belgin; Ergenç Bostanoğlu, BelginData mining is a popular research area that has been studied by many researchers and focuses on finding unforeseen and important information in large dataset. Social media data is one of the most popular and large heterogeneous data collected from social networking sites, microblogs, photo or video sharing sites. Social media represents the entities and their relations. One of the popular data structures used to represent large heterogeneous data in the field of data mining is graphs. The nodes of a graph represent entities and the edges of a graph represent the relations between the entities. So, graph mining is one of the most popular subdivisions of data mining. A frequent pattern is referred to as pattern that is more frequently encountered than the user-defined threshold in a dataset. Frequent patterns in a dataset can give important information about dataset. Using this information, data can be classified or clustered. Frequent patterns can provide different perspective on social media data with respect to sociology, consumer behaviour, marketing, communities. In this thesis, popular frequent pattern mining algorithms have been examined and it has been observed that most algorithms are not suitable for large datasets. Since data in today’s world, especially social networks, has very large data, the existing pattern mining algorithms are not suitable for this data. The aim of this thesis is to implement an existing frequent pattern mining algorithm in parallel manner and to find frequent patterns in a social media data.Master Thesis Spatio-Temporal Modeling of Documents(Izmir Institute of Technology, 2017) Yaşar, Damla; Tekir, SelmaTemporal and geographic information is important aspects of text documents. Thus, it also occurs frequently in many types of text documents in the form of temporal and geographic expressions. Spatio-temporal expressions can be normalized so that their meaning is unambiguous and can be placed on a timeline or pinpointed on a map. A general text document can contain many spatio-temporal expressions that are unrelated to their content. In this thesis, we propose estimating the focus time and focus place of documents that are defined as the time and place that the document’s content refers to. We utilize statistical knowledge from Wikipedia English to calculate association scores that are used to estimate the focus time and place contained in the document. We implement two different association score calculation methodologies and compare their accuracy respectively. The effectiveness of our methods are evaluated on three different time-tagged datasets of documents about historical events in total time frame of 4000 years. Our methods achieve average error of less than 15 years. Our methods are also able to estimate focus place of each document correctly.Master Thesis Develepment of Framework for Frequent Itemset Mining Under Multiple Support Thresholds(Izmir Institute of Technology, 2016) Darrab, Sadeq Hussein Saleh; Ergenç Bostanoğlu, BelginFrequent pattern mining is an essential method of data mining that is used to extract interesting patterns from massive databases. Traditional methods use single minimum support threshold to find out the complete set of frequent patterns. However, in real word applications, using single minimum support threshold is not adequate since it does not reflect the nature of each item and causes a problem called rare item problem. Recently, several methods have been studied to tackle this problem by avoiding using single minimum item support threshold. The nature of each item is considered where different items are specified with different minimum support thresholds. By this, the complete set of frequent patters are generated without creating uninteresting patterns and losing substantial patterns. In this thesis, we propose an efficient method, Multiple Item Support Frequent Pattern growth algorithm, MISFP-growth, to mine the complete set of frequent patterns with multiple item support thresholds. In this method, Multiple Item Support Frequent Pattern tree, MISFP-Tree, is constructed to store all crucial information to mine frequent patterns. Since in the construction of the MISFP-Tree is done with respect to minimum of Multiple Itemset Support values; pruning and reconstruction phases are not required. To show the efficiency of the proposed method, it is compared with a recent tree-based algorithm, CFP-growth++. To evaluate the performance of the proposed algorithm, various experiments are conducted on both real and synthetic datasets. Experimental results reveal that MISFP-growth outperforms the previous algorithm in terms of execution time, memory space as well as scalability.Master Thesis Sales History-Based Demand Prediction by Using Generalized Linear Models(Izmir Institute of Technology, 2016) Özenboy, Başar; Tekir, SelmaImproved data collection and storage capabilities make vast amounts of data available in appropriate formats. Commercial enterprises store their sales data. It’s vital for companies to accurately predict demand by utilizing the existing sales data. Such predictive analytics is a crucial part of their decision support systems to increase the profitability of the company. In predictive data analytics, the branch of regression modeling commonly is used to predict a numerical response variable like sales amount. In recent years, generalized linear models provide a generalization to better address the specificities of the problem at hand. To begin with, they relax the assumption of normally distributed error terms. Moreover, the relationship of the set of predictor variables and the response variable could be represented by a set of link functions rather than the sole choice of the identity function. This thesis models the sales amount prediction problem through the use of generalized linear models. Unique company sales data are explored and fitted accordingly with the right distribution function of the response variable along with an appropriate link function. The experimental results are compared with the other regression models, classification algorithms, and time series models. The model selection is performed via the use of MSE and AIC metrics respectively.Master Thesis Development of an Application for Dynamic Itemset Mining Under Multiple Support Thresholds(Izmir Institute of Technology, 2016) Abuzayed, Nourhan; Ergenç Bostanoğlu, BelginHandling dynamic aspect of databases and multiple support threshold requirement of items are two important challenges of frequent itemset mining algorithms. Frequent itemsets should be updated when the database is updated without re-running the mining algorithm. Frequent itemset mining algorithm should consider different support thresholds in order not to cause rare item problem. Existing dynamic itemset mining algorithms are devised for single support threshold whereas multiple support threshold algorithms are static. This thesis focuses on dynamic update problem of frequent itemsets under multiple support thresholds and introduces Dynamic MIS1 and Dynamic MIS2 algorithms. They are i) tree based and scan the database once, ii) consider multiple support thresholds, and iii) handle increments of additions, additions with new items and deletions. Proposed algorithms are compared to CFP-Growth++ and findings are; in static databases 1) Dynamic MIS1 achieves up to 5 times speed-up against CFP-Growth++ since it does not require tree pruning and merging, 2) execution time of Dynamic MIS2 and CFP-Growth++ are similar, 3) memory usage of Dynamic MIS1 is higher than CFP-Growth++, since it keeps whole tree in memory, in dynamic database 1) Dynamic MIS1 and Dynamic MIS2 perform better than CFP-Growth++ since they run only on increments, 2) Dynamic MIS1 can achieve speed-up of 56 times against CFP-Growth++, whereas the speed-up of Dynamic MIS2 cannot exceed 2 times, 3) Dynamic MIS2 is slightly better than CFP-Growth++ until increment size is less than 85% when the database is large and sparse, 25% when the database is small and dense.Master Thesis An Exact Approach With Minimum Side-Effects for Association Rule Hiding(Izmir Institute of Technology, 2014) Leloğlu, Engin; Ayav, TolgaConcealing sensitive relationships before sharing a database is of utmost importance in many circumstances. This implies to hide the frequent itemsets corresponding to sensitive association rules by removing some items of the database. Research efforts generally aim at finding out more effective methods in terms of convenience, execution time and side-effect. This paper presents a practical approach for hiding sensitive patterns while allowing as much nonsensitive patterns as possible in the sanitized database. We model the itemset hiding problem as integer programming whereas the objective coefficients allow finding out a solution with minimum loss of nonsensitive itemsets. We evaluate our method using three real datasets from FIMI repository and compared the results with previous exact solution and the heuristic study whose procedures are imposed by new approach. The results show that information loss is dramatically minimized without sacrificing so many modifications on databases.Master Thesis Parallelization of a Novel Frequent Itemset Hiding Algorithm on a Cpu-Gpu Platform(Izmir Institute of Technology, 2014) Heye, Samuel Bacha; Ayav, Tolga; Ayav, TolgaData mining is used to extract useful information from large data. But the organizations which mine the data might not be the owner of the data. So, before the owners can make their data accessible for data mining they want to make sure that no sensitive information can be mined from the released data whose discovery by others might harm them. Itemset hiding is one mechanism to prevent the disclosure of sensitive itemsets. In this thesis, a new integer programing based itemset hiding algorithm was developed and a mechanism to speed up the computation time of its implementation was proposed by using parallel computation on Graphical Processing Units (GPUs).Master Thesis Comparison of Different Algorithms for Exploting the Hidden Trends in Data Sources(Izmir Institute of Technology, 2003) Özsevim, Emrah; Püskülcü, HalisThe growth of large-scale transactional databases, time-series databases and other kinds of databases has been giving rise to the development of several efficient algorithms that cope with the computationally expensive task of association rule mining.In this study, different algorithms, Apriori, FP-tree and CHARM, for exploiting the hidden trends such as frequent itemsets, frequent patterns, closed frequent itemsets respectively, were discussed and their performances were evaluated. The perfomances of the algorithms were measured at different support levels, and the algorithms were tested on different data sets (on both synthetic and real data sets). The algorihms were compared according to their, data preparation performances, mining performance, run time performances and knowledge extraction capabilities.The Apriori algorithm is the most prevalent algorithm of association rule mining which makes multiple passes over the database aiming at finding the set of frequent itemsets for each level. The FP-Tree algorithm is a scalable algorithm which finds the crucial information as regards the complete set of prefix paths, conditional pattern bases and frequent patterns by using a compact FP-Tree based mining method. The CHARM is a novel algorithm which brings remarkable improvements over existing association rule mining algorithms by proving the fact that mining the set of closed frequent itemsets is adequate instead of mining the set of all frequent itemsets.Related to our experimental results, we conclude that the Apriori algorithm demonstrates a good performance on sparse data sets. The Fp-tree algorithm extracts less association in comparison to Apriori, however it is completelty a feasable solution that facilitates mining dense data sets at low support levels. On the other hand, the CHARM algorithm is an appropriate algorithm for mining closed frequent itemsets (a substantial portion of frequent itemsets) on both sparse and dense data sets even at low levels of support.
