Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

dc.contributor.author Doğan, Tunca
dc.contributor.author Karaçalı, Bilge
dc.coverage.doi 10.1371/journal.pone.0075458
dc.date.accessioned 2017-04-10T12:55:50Z
dc.date.available 2017-04-10T12:55:50Z
dc.date.issued 2013
dc.description.abstract Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences. en_US
dc.identifier.citation Doğan, T., and Karaçalı, B. (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PLoS One, 8(9). doi:10.1371/journal.pone.0075458 en_US
dc.identifier.doi 10.1371/journal.pone.0075458
dc.identifier.doi 10.1371/journal.pone.0075458 en_US
dc.identifier.issn 1932-6203
dc.identifier.scopus 2-s2.0-84884176982
dc.identifier.uri http://doi.org/10.1371/journal.pone.0075458
dc.identifier.uri https://hdl.handle.net/11147/5277
dc.language.iso en en_US
dc.publisher Public Library of Science en_US
dc.relation.ispartof PLoS One en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Sequence analysis en_US
dc.subject Proteins en_US
dc.subject Genome analysis en_US
dc.subject Genetic database en_US
dc.subject Receiver operating characteristic en_US
dc.title Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.institutional Doğan, Tunca
gdc.author.institutional Karaçalı, Bilge
gdc.author.yokid 11527
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C4
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department İzmir Institute of Technology. Electrical and Electronics Engineering en_US
gdc.description.issue 9 en_US
gdc.description.publicationcategory Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q1
gdc.description.volume 8 en_US
gdc.description.wosquality Q2
gdc.identifier.openalex W2026660542
gdc.identifier.pmid 24069417
gdc.identifier.wos WOS:000326240100126
gdc.index.type WoS
gdc.index.type Scopus
gdc.index.type PubMed
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.impulse 1.0
gdc.oaire.influence 2.9643228E-9
gdc.oaire.isgreen true
gdc.oaire.keywords Science
gdc.oaire.keywords Q
gdc.oaire.keywords Sequence analysis
gdc.oaire.keywords R
gdc.oaire.keywords Genetic database
gdc.oaire.keywords Computational Biology
gdc.oaire.keywords Proteins
gdc.oaire.keywords Reproducibility of Results
gdc.oaire.keywords Receiver operating characteristic
gdc.oaire.keywords Genome analysis
gdc.oaire.keywords Genomics
gdc.oaire.keywords Evolution, Molecular
gdc.oaire.keywords ROC Curve
gdc.oaire.keywords Medicine
gdc.oaire.keywords Cluster Analysis
gdc.oaire.keywords Humans
gdc.oaire.keywords Protein Interaction Domains and Motifs
gdc.oaire.keywords Databases, Protein
gdc.oaire.keywords Algorithms
gdc.oaire.keywords Conserved Sequence
gdc.oaire.keywords Research Article
gdc.oaire.popularity 5.815974E-9
gdc.oaire.publicfunded false
gdc.oaire.sciencefields 0301 basic medicine
gdc.oaire.sciencefields 0303 health sciences
gdc.oaire.sciencefields 03 medical and health sciences
gdc.openalex.collaboration National
gdc.openalex.fwci 0.14178895
gdc.openalex.normalizedpercentile 0.58
gdc.opencitations.count 8
gdc.plumx.crossrefcites 1
gdc.plumx.mendeley 19
gdc.plumx.pubmedcites 4
gdc.plumx.scopuscites 9
gdc.scopus.citedcount 9
gdc.wos.citedcount 7
relation.isAuthorOfPublication.latestForDiscovery a081f8c3-cd7b-40d5-a9ca-74707d1b4dc7
relation.isOrgUnitOfPublication.latestForDiscovery 9af2b05f-28ac-4018-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Name:
5277.PDF
Size:
2.47 MB
Format:
Adobe Portable Document Format
Description:
Makale

License bundle

Now showing 1 - 1 of 1
Loading...
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: