Molecular Biology and Genetics / Moleküler Biyoloji ve Genetik

Permanent URI for this collectionhttps://hdl.handle.net/11147/9

Browse

Search Results

Now showing 1 - 9 of 9

Citation - WoS: 4
Citation - Scopus: 4
Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for Pre-Microrna Detection
(Informationsmanagement in der Biotechnologie e.V. (IMBio e.V.), 2017) Saçar Demirci, Müşerref Duygu; Allmer, Jens; Allmer, Jens; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
MicroRNAs (miRNAs) are involved in the post-transcriptional regulation of protein abundance and thus have a great impact on the resulting phenotype. It is, therefore, no wonder that they have been implicated in many diseases ranging from virus infections to cancer. This impact on the phenotype leads to a great interest in establishing the miRNAs of an organism. Experimental methods are complicated which led to the development of computational methods for pre-miRNA detection. Such methods generally employ machine learning to establish models for the discrimination between miRNAs and other sequences. Positive training data for model establishment, for the most part, stems from miRBase, the miRNA registry. The quality of the entries in miRBase has been questioned, though. This unknown quality led to the development of filtering strategies in attempts to produce high quality positive datasets which can lead to a scarcity of positive data. To analyze the quality of filtered data we developed a machine learning model and found it is well able to establish data quality based on intrinsic measures. Additionally, we analyzed which features describing pre-miRNAs could discriminate between low and high quality data. Both models are applicable to data from miRBase and can be used for establishing high quality positive data. This will facilitate the development of better miRNA detection tools which will make the prediction of miRNAs in disease states more accurate. Finally, we applied both models to all miRBase data and provide the list of high quality hairpins.
Citation - WoS: 11
Citation - Scopus: 14
Categorization of Species Based on Their Micrornas Employing Sequence Motifs, Information-Theoretic Sequence Feature Extraction, and K-Mers
(Springer Verlag, 2017) Yousef, Malik; Allmer, Jens; Nigatu, Dawit; Levy, Dalit; Allmer, Jens; Henkel, Werner; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
Background: Diseases like cancer can manifest themselves through changes in protein abundance, and microRNAs (miRNAs) play a key role in the modulation of protein quantity. MicroRNAs are used throughout all kingdoms and have been shown to be exploited by viruses to modulate their host environment. Since the experimental detection of miRNAs is difficult, computational methods have been developed. Many such tools employ machine learning for pre-miRNA detection, and many features for miRNA parameterization have been proposed. To train machine learning models, negative data is of importance yet hard to come by; therefore, we recently started to employ pre-miRNAs from one species as positive data versus another species’ pre-miRNAs as negative examples based on sequence motifs and k-mers. Here, we introduce the additional usage of information-theoretic (IT) features. Results: Pre-miRNAs from one species were used as positive and another species’ pre-miRNAs as negative training data for machine learning. The categorization capability of IT and k-mer features was investigated. Both feature sets and their combinations yielded a very high accuracy, which is as good as the previously suggested sequence motif and k-mer based method. However, for obtaining a high performance, a sufficiently large phylogenetic distance between the species and sufficiently high number of pre-miRNAs in the training set is required. To examine the contribution of the IT and k-mer features, an information gain-based feature ranking was performed. Although the top 3 are IT features, 80% of the top 100 features are k-mers. The comparison of all three individual approaches (motifs, IT, and k-mers) shows that the distinction of species based on their pre-miRNAs k-mers are sufficient. Conclusions: IT sequence feature extraction enables the distinction among species and is less computationally expensive than motif calculations. However, since IT features need larger amounts of data to have enough statistics for producing highly accurate results, future categorization into species can be effectively done using k-mers only. The biological reasoning for this is the existence of a codon bias between species which can, at least, be observed in exonic miRNAs. Future work in this direction will be the ab initio detection of pre-miRNA. In addition, prediction of pre-miRNA from RNA-seq can be done.
Citation - WoS: 37
Citation - Scopus: 45
On the Performance of Pre-Microrna Detection Algorithms
(Nature Publishing Group, 2017) Saçar Demirci, Müşerref Duygu; Baumbach, Jan; Allmer, Jens; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
MicroRNAs are crucial for post-transcriptional gene regulation, and their dysregulation has been associated with diseases like cancer and, therefore, their analysis has become popular. The experimental discovery of miRNAs is cumbersome and, thus, many computational tools have been proposed. Here we assess 13 ab initio pre-miRNA detection approaches using all relevant, published, and novel data sets while judging algorithm performance based on ten intrinsic performance measures. We present an extensible framework, izMiR, which allows for the unbiased comparison of existing algorithms, adding new ones, and combining multiple approaches into ensemble methods. In an exhaustive attempt, we condense the results of millions of computations and show that no method is clearly superior; however, we provide a guideline for biomedical researchers to select a tool. Finally, we demonstrate that combining all of the methods into one ensemble approach, for the first time, allows reliable purely computational pre-miRNA detection in large eukaryotic genomes.
Citation - WoS: 20
Citation - Scopus: 25
Microrna Categorization Using Sequence Motifs and K-Mers
(BioMed Central Ltd., 2017) Yousef, Malik; Khalifa, Waleed; Allmer, Jens; Acar, İlhan Erkin; 01. Izmir Institute of Technology; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science
Background: Post-transcriptional gene dysregulation can be a hallmark of diseases like cancer and microRNAs (miRNAs) play a key role in the modulation of translation efficiency. Known pre-miRNAs are listed in miRBase, and they have been discovered in a variety of organisms ranging from viruses and microbes to eukaryotic organisms. The computational detection of pre-miRNAs is of great interest, and such approaches usually employ machine learning to discriminate between miRNAs and other sequences. Many features have been proposed describing pre-miRNAs, and we have previously introduced the use of sequence motifs and k-mers as useful ones. There have been reports of xeno-miRNAs detected via next generation sequencing. However, they may be contaminations and to aid that important decision-making process, we aimed to establish a means to differentiate pre-miRNAs from different species. Results: To achieve distinction into species, we used one species' pre-miRNAs as the positive and another species' pre-miRNAs as the negative training and test data for the establishment of machine learned models based on sequence motifs and k-mers as features. This approach resulted in higher accuracy values between distantly related species while species with closer relation produced lower accuracy values. Conclusions: We were able to differentiate among species with increasing success when the evolutionary distance increases. This conclusion is supported by previous reports of fast evolutionary changes in miRNAs since even in relatively closely related species a fairly good discrimination was possible.
Citation - WoS: 14
Citation - Scopus: 12
The Impact of Feature Selection on One and Two-Class Classification Performance for Plant Micrornas
(PeerJ Inc., 2016) Khalifa, Waleed; Yousef, Malik; Allmer, Jens; Allmer, Jens; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
MicroRNAs (miRNAs) are short nucleotide sequences that form a typical hairpin structure which is recognized by a complex enzyme machinery. It ultimately leads to the incorporation of 18-24 nt long mature miRNAs into RISC where they act as recognition keys to aid in regulation of target mRNAs. It is involved to determine miRNAs experimentally and, therefore, machine learning is used to complement such endeavors. The success of machine learning mostly depends on proper input data and appropriate features for parameterization of the data. Although, in general, two-class classification (TCC) is used in the field; because negative examples are hard to come by, one-class classification (OCC) has been tried for pre-miRNA detection. Since both positive and negative examples are currently somewhat limited, feature selection can prove to be vital for furthering the field of pre-miRNA detection. In this study, we compare the performance of OCC and TCC using eight feature selection methods and seven different plant species providing positive pre-miRNA examples. Feature selection was very successful for OCC where the best feature selection method achieved an average accuracy of 95.6%, thereby being ~29% better than the worst method which achieved 66.9% accuracy. While the performance is comparable to TCC, which performs up to 3% better than OCC, TCC is much less affected by feature selection and its largest performance gap is ~13% which only occurs for two of the feature selection methodologies. We conclude that feature selection is crucially important for OCC and that it can perform on par with TCC given the proper set of features.
Citation - Scopus: 13
Feature Selection for Microrna Target Prediction Comparison of One-Class Feature Selection Methodologies
(Hindawi Publishing Corporation, 2016) Yousef, Malik; Allmer, Jens; Khalifa, Waleed; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
Traditionally, machine learning algorithms build classification models from positive and negative examples. Recently, one-class classification (OCC) receives increasing attention in machine learning for problems where the negative class cannot be defined unambiguously. This is specifically problematic in bioinformatics since for some important biological problems the target class (positive class) is easy to obtain while the negative one cannot be measured. Artificially generating the negative class data can be based on unreliable assumptions. Several studies have applied two-class machine learning to predict microRNAs (miRNAs) and their target. Different approaches for the generation of an artificial negative class have been applied, but may lead to a biased performance estimate. Feature selection has been well studied for the two-class classification problem, while fewer methods are available for feature selection in respect to OCC. In this study, we present a feature selection approach for applying one-class classification to the prediction of miRNA targets. A comparison between one-class and two-class approaches is presented to highlight that their performance are similar while one-class classification is not based on questionable artificial data for training and performance evaluation. We further show that the feature selection method we tried works to a degree, but needs improvement in the future. Perhaps it could be combined with other approaches.
Citation - Scopus: 19
Feature Selection Has a Large Impact on One-Class Classification Accuracy for Micrornas in Plants
(Hindawi Publishing Corporation, 2016) Yousef, Malik; Demirci, Müşerref Duygu Saçar; Allmer, Jens; Allmer, Jens; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
MicroRNAs (miRNAs) are short RNA sequences involved in posttranscriptional gene regulation. Their experimental analysis is complicated and, therefore, needs to be supplemented with computational miRNA detection. Currently computational miRNA detection is mainly performed using machine learning and in particular two-class classification. For machine learning, the miRNAs need to be parametrized and more than 700 features have been described. Positive training examples for machine learning are readily available, but negative data is hard to come by. Therefore, it seems prerogative to use one-class classification instead of two-class classification. Previously, we were able to almost reach two-class classification accuracy using one-class classifiers. In this work, we employ feature selection procedures in conjunction with one-class classification and show that there is up to 36% difference in accuracy among these feature selection methods. The best feature set allowed the training of a one-class classifier which achieved an average accuracy of 95.6% thereby outperforming previous two-class-based plant miRNA detection approaches by about 0.5%. We believe that this can be improved upon in the future by rigorous filtering of the positive training examples and by improving current feature clustering algorithms to better target pre-miRNA feature selection.
Citation - WoS: 30
Machine Learning Methods for Microrna Gene Prediction
(Humana Press, 2014) Saçar, Müşerref Duygu; Allmer, Jens; Allmer, Jens; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
MicroRNAs (miRNAs) are single-stranded, small, noncoding RNAs of about 22 nucleotides in length, which control gene expression at the posttranscriptional level through translational inhibition, degradation, adenylation, or destabilization of their target mRNAs. Although hundreds of miRNAs have been identified in various species, many more may still remain unknown. Therefore, discovery of new miRNA genes is an important step for understanding miRNA-mediated posttranscriptional regulation mechanisms. It seems that biological approaches to identify miRNA genes might be limited in their ability to detect rare miRNAs and are further limited to the tissues examined and the developmental stage of the organism under examination. These limitations have led to the development of sophisticated computational approaches attempting to identify possible miRNAs in silico. In this chapter, we discuss computational problems in miRNA prediction studies and review some of the many machine learning methods that have been tried to address the issues.
Citation - Scopus: 19
Data Mining for Microrna Gene Prediction: on the Impact of Class Imbalance and Feature Number for Microrna Gene Prediction
(Institute of Electrical and Electronics Engineers Inc., 2013) Saçar, Müşerref Duygu; Allmer, Jens; Allmer, Jens; 04.03. Department of Molecular Biology and Genetics; 04. Faculty of Science; 01. Izmir Institute of Technology
MicroRNAs (miRNAs) are small, non-coding RNAs which are involved in the posttranscriptional modulation of gene expression. Their short (18-24) single stranded mature sequences are involved in targeting specific genes. It turns out that experimental methods are limited and that it is difficult, if not impossible, to establish all miRNAs and their targets experimentally. Therefore, many tools for the prediction of miRNA genes and miRNA targets have been proposed. Most of these tools are based on machine learning methods and within that area mostly two-class classification is employed. Unfortunately, truly negative data is impossible to attain and only approximations of negative data are currently available. Also, we recently showed that the available positive data is not flawless. Here we investigate the impact of class imbalance on the learner accuracy and find that there is a difference of up to 50% between the best and worst precision and recall values. In addition, we looked at increasing number of features and found a curve maximizing at 0.97 recall and 0.91 precision with quickly decaying performance after inclusion of more than 100 features. © 2013 IEEE.

Molecular Biology and Genetics / Moleküler Biyoloji ve Genetik

Browse

Filters

Settings

Sort By

Results per page

Search Results