Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Doğan, Tunca; Karaçalı, Bilge

doi:10.1371/journal.pone.0075458

Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

dc.contributor.author	Doğan, Tunca
dc.contributor.author	Karaçalı, Bilge
dc.coverage.doi	10.1371/journal.pone.0075458
dc.date.accessioned	2017-04-10T12:55:50Z
dc.date.available	2017-04-10T12:55:50Z
dc.date.issued	2013
dc.description.abstract	Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.	en_US
dc.identifier.citation	Doğan, T., and Karaçalı, B. (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PLoS One, 8(9). doi:10.1371/journal.pone.0075458	en_US
dc.identifier.doi	10.1371/journal.pone.0075458
dc.identifier.doi	10.1371/journal.pone.0075458	en_US
dc.identifier.issn	1932-6203
dc.identifier.scopus	2-s2.0-84884176982
dc.identifier.uri	http://doi.org/10.1371/journal.pone.0075458
dc.identifier.uri	https://hdl.handle.net/11147/5277
dc.language.iso	en	en_US
dc.publisher	Public Library of Science	en_US
dc.relation.ispartof	PLoS One	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Sequence analysis	en_US
dc.subject	Proteins	en_US
dc.subject	Genome analysis	en_US
dc.subject	Genetic database	en_US
dc.subject	Receiver operating characteristic	en_US
dc.title	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences	en_US
dc.type	Article	en_US
dspace.entity.type	Publication
gdc.author.institutional	Doğan, Tunca
gdc.author.institutional	Karaçalı, Bilge
gdc.author.yokid	11527
gdc.bip.impulseclass	C5
gdc.bip.influenceclass	C5
gdc.bip.popularityclass	C4
gdc.coar.access	open access
gdc.coar.type	text::journal::journal article
gdc.collaboration.industrial	false
gdc.description.department	İzmir Institute of Technology. Electrical and Electronics Engineering	en_US
gdc.description.issue	9	en_US
gdc.description.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
gdc.description.scopusquality	Q1
gdc.description.volume	8	en_US
gdc.description.wosquality	Q2
gdc.identifier.openalex	W2026660542
gdc.identifier.pmid	24069417
gdc.identifier.wos	WOS:000326240100126
gdc.index.type	WoS
gdc.index.type	Scopus
gdc.index.type	PubMed
gdc.oaire.accesstype	GOLD
gdc.oaire.diamondjournal	false
gdc.oaire.impulse	1.0
gdc.oaire.influence	2.9643228E-9
gdc.oaire.isgreen	true
gdc.oaire.keywords	Science
gdc.oaire.keywords	Q
gdc.oaire.keywords	Sequence analysis
gdc.oaire.keywords	R
gdc.oaire.keywords	Genetic database
gdc.oaire.keywords	Computational Biology
gdc.oaire.keywords	Proteins
gdc.oaire.keywords	Reproducibility of Results
gdc.oaire.keywords	Receiver operating characteristic
gdc.oaire.keywords	Genome analysis
gdc.oaire.keywords	Genomics
gdc.oaire.keywords	Evolution, Molecular
gdc.oaire.keywords	ROC Curve
gdc.oaire.keywords	Medicine
gdc.oaire.keywords	Cluster Analysis
gdc.oaire.keywords	Humans
gdc.oaire.keywords	Protein Interaction Domains and Motifs
gdc.oaire.keywords	Databases, Protein
gdc.oaire.keywords	Algorithms
gdc.oaire.keywords	Conserved Sequence
gdc.oaire.keywords	Research Article
gdc.oaire.popularity	5.815974E-9
gdc.oaire.publicfunded	false
gdc.oaire.sciencefields	0301 basic medicine
gdc.oaire.sciencefields	0303 health sciences
gdc.oaire.sciencefields	03 medical and health sciences
gdc.openalex.collaboration	National
gdc.openalex.fwci	0.14178895
gdc.openalex.normalizedpercentile	0.58
gdc.opencitations.count	8
gdc.plumx.crossrefcites	1
gdc.plumx.mendeley	19
gdc.plumx.pubmedcites	4
gdc.plumx.scopuscites	9
gdc.scopus.citedcount	9
gdc.wos.citedcount	7
relation.isAuthorOfPublication.latestForDiscovery	a081f8c3-cd7b-40d5-a9ca-74707d1b4dc7
relation.isOrgUnitOfPublication.latestForDiscovery	9af2b05f-28ac-4018-8abe-a4dfe192da5e

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 5277.PDF
Size:: 2.47 MB
Format:: Adobe Portable Document Format
Description:: Makale

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electrical - Electronic Engineering / Elektrik - Elektronik Mühendisliği
PubMed İndeksli Yayınlar Koleksiyonu / PubMed Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection
WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection