Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Doğan, Tunca; Karaçalı, Bilge

doi:10.1371/journal.pone.0075458

Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Files

5277.PDF (2.47 MB)

Date

2013

Authors

Doğan, Tunca

Karaçalı, Bilge

Publisher

Public Library of Science

Open Access Color

GOLD

Green Open Access

Yes

Publicly Funded

No

Impulse

Average

Influence

Average

Popularity

Top 10%

Abstract

Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.

Fields of Science

0301 basic medicine, 0303 health sciences, 03 medical and health sciences

Citation

Doğan, T., and Karaçalı, B. (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PLoS One, 8(9). doi:10.1371/journal.pone.0075458

WoS Q

Q2

Scopus Q

Q1

OpenCitations Citation Count

8

Source

PLoS One

Volume

8

Issue

9

URI

http://doi.org/10.1371/journal.pone.0075458
https://hdl.handle.net/11147/5277

Collections

Electrical - Electronic Engineering / Elektrik - Elektronik Mühendisliği
PubMed İndeksli Yayınlar Koleksiyonu / PubMed Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection
WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection

PlumX Metrics

Citations

CrossRef : 1

Scopus : 9

PubMed : 4

Captures

Mendeley Readers : 19

Full item page

SCOPUS™ Citations

9

checked on Jun 12, 2026

Web of Science™ Citations

7

checked on Jun 12, 2026

Page Views

1115

checked on Jun 12, 2026

Downloads

487

checked on Jun 12, 2026

Google Scholar™

Check

OpenAlex FWCI

0.14178895

Sustainable Development Goals

9

Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

BIP! Indicators

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

Description

Keywords

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Citation Count

Source

Volume

Issue

Start Page

End Page

URI

Collections

PlumX Metrics

Citations

Captures

SCOPUS™ Citations

9

Web of Science™ Citations

7

Page Views

1115

Downloads

487

Google Scholar™

OpenAlex FWCI

0.14178895

Sustainable Development Goals

INDUSTRY, INNOVATION AND INFRASTRUCTURE