Categorization of Species Based on Their Micrornas Employing Sequence Motifs, Information-Theoretic Sequence Feature Extraction, and K-Mers

dc.contributor.author Yousef, Malik
dc.contributor.author Nigatu, Dawit
dc.contributor.author Levy, Dalit
dc.contributor.author Allmer, Jens
dc.contributor.author Henkel, Werner
dc.coverage.doi 10.1186/s13634-017-0506-8
dc.date.accessioned 2018-01-30T08:10:12Z
dc.date.available 2018-01-30T08:10:12Z
dc.date.issued 2017
dc.description.abstract Background: Diseases like cancer can manifest themselves through changes in protein abundance, and microRNAs (miRNAs) play a key role in the modulation of protein quantity. MicroRNAs are used throughout all kingdoms and have been shown to be exploited by viruses to modulate their host environment. Since the experimental detection of miRNAs is difficult, computational methods have been developed. Many such tools employ machine learning for pre-miRNA detection, and many features for miRNA parameterization have been proposed. To train machine learning models, negative data is of importance yet hard to come by; therefore, we recently started to employ pre-miRNAs from one species as positive data versus another species’ pre-miRNAs as negative examples based on sequence motifs and k-mers. Here, we introduce the additional usage of information-theoretic (IT) features. Results: Pre-miRNAs from one species were used as positive and another species’ pre-miRNAs as negative training data for machine learning. The categorization capability of IT and k-mer features was investigated. Both feature sets and their combinations yielded a very high accuracy, which is as good as the previously suggested sequence motif and k-mer based method. However, for obtaining a high performance, a sufficiently large phylogenetic distance between the species and sufficiently high number of pre-miRNAs in the training set is required. To examine the contribution of the IT and k-mer features, an information gain-based feature ranking was performed. Although the top 3 are IT features, 80% of the top 100 features are k-mers. The comparison of all three individual approaches (motifs, IT, and k-mers) shows that the distinction of species based on their pre-miRNAs k-mers are sufficient. Conclusions: IT sequence feature extraction enables the distinction among species and is less computationally expensive than motif calculations. However, since IT features need larger amounts of data to have enough statistics for producing highly accurate results, future categorization into species can be effectively done using k-mers only. The biological reasoning for this is the existence of a codon bias between species which can, at least, be observed in exonic miRNAs. Future work in this direction will be the ab initio detection of pre-miRNA. In addition, prediction of pre-miRNA from RNA-seq can be done. en_US
dc.description.sponsorship Scientific and Technological Research Council of Turkey (113E326); Zefat Academic College; German Research Foundation (DFG) en_US
dc.identifier.citation Yousef, M., Nigatu, D., Levy, D., Allmer, J., and Henkel, W. (2017). Categorization of species based on their microRNAs employing sequence motifs, information-theoretic sequence feature extraction, and k-mers. Eurasip Journal on Advances in Signal Processing, 2017(1). doi:10.1186/s13634-017-0506-8 en_US
dc.identifier.doi 10.1186/s13634-017-0506-8 en_US
dc.identifier.issn 1687-6180
dc.identifier.scopus 2-s2.0-85032857843
dc.identifier.uri http://doi.org/10.1186/s13634-017-0506-8
dc.identifier.uri http://hdl.handle.net/11147/6764
dc.language.iso en en_US
dc.publisher Springer Verlag en_US
dc.relation info:eu-repo/grantAgreement/TUBITAK/EEEAG/113E326 en_US
dc.relation.ispartof Eurasip Journal on Advances in Signal Processing en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Information theory en_US
dc.subject MicroRNAs en_US
dc.subject Machine learning en_US
dc.subject Sequence motifs en_US
dc.subject RNA en_US
dc.title Categorization of Species Based on Their Micrornas Employing Sequence Motifs, Information-Theoretic Sequence Feature Extraction, and K-Mers en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.institutional Allmer, Jens
gdc.bip.impulseclass C4
gdc.bip.influenceclass C5
gdc.bip.popularityclass C4
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department İzmir Institute of Technology. Molecular Biology and Genetics en_US
gdc.description.issue 1 en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q2
gdc.description.volume 2017 en_US
gdc.description.wosquality Q3
gdc.identifier.openalex W2760930003
gdc.identifier.wos WOS:000412913000001
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.impulse 8.0
gdc.oaire.influence 3.1990697E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Information theory
gdc.oaire.keywords TK7800-8360
gdc.oaire.keywords k-mer
gdc.oaire.keywords MicroRNA
gdc.oaire.keywords TK5101-6720
gdc.oaire.keywords Differentiate miRNAs among species
gdc.oaire.keywords miRNA categorization
gdc.oaire.keywords MicroRNAs
gdc.oaire.keywords Sequence motifs
gdc.oaire.keywords Pre-microRNA
gdc.oaire.keywords Machine learning
gdc.oaire.keywords Telecommunication
gdc.oaire.keywords RNA
gdc.oaire.keywords Electronics
gdc.oaire.popularity 5.2167577E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 0301 basic medicine
gdc.oaire.sciencefields 0206 medical engineering
gdc.oaire.sciencefields 02 engineering and technology
gdc.oaire.sciencefields 03 medical and health sciences
gdc.openalex.collaboration International
gdc.openalex.fwci 1.20253239
gdc.openalex.normalizedpercentile 0.75
gdc.opencitations.count 11
gdc.plumx.crossrefcites 11
gdc.plumx.mendeley 17
gdc.plumx.scopuscites 14
gdc.scopus.citedcount 14
gdc.wos.citedcount 11
relation.isAuthorOfPublication.latestForDiscovery bf9f97a4-6d62-49cd-a7c8-1bc8463d14d2
relation.isOrgUnitOfPublication.latestForDiscovery 9af2b05f-28ac-4013-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Name:
6764.pdf
Size:
1.6 MB
Format:
Adobe Portable Document Format
Description:
Makale

License bundle

Now showing 1 - 1 of 1
Loading...
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: