Electrical - Electronic Engineering / Elektrik - Elektronik Mühendisliği
Permanent URI for this collectionhttps://hdl.handle.net/11147/11
Browse
3 results
Search Results
Conference Object Citation - Scopus: 1Doğrusal Olmayan Gömme Teknikleri Altında Gen Dizilerinin Evrimsel İ̇lişkileri(IEEE, 2010) Doğan, Tunca; Karaçalı, BilgeWe present an error analysis on the application of non-linear embedding on pairwise evolutionary distances inferred over a collection of genetic sequences following multiple sequence alignment. To this end, we have generated gene sequences evolved by random substitutions along three different evolutionary pathways with known evolutionary distances between every sequence pair. We have compared the discrepancy between the inferred evolutionary distances to the true distances before and after non-linear embedding into a low dimensional vector space. The results indicate that non-linear embedding achieves significant reduction in error in the estimated evolutionary distances. Consequently, nonlinear embedding of evolutionary distances can provide more reliable inferences on the evolutionary relationships between genetic sequences. ©2010 IEEE.Conference Object 2-D Thresholding of the Connectivity Map Following the Multiple Sequence Alignments of Diverse Datasets(ACTA Press, 2013) Doğan, Tunca; Karaçalı, BilgeMultiple sequence alignment (MSA) is a widely used method to uncover the relationships between the biomolecular sequences. One essential prerequisite to apply this procedure is to have a considerable amount of similarity between the test sequences. It's usually not possible to obtain reliable results from the multiple alignments of large and diverse datasets. Here we propose a method to obtain sequence clusters of significant intragroup similarities and make sense out of the multiple alignments containing remote sequences. This is achieved by thresholding the pairwise connectivity map over 2 parameters. The first one is the inferred pairwise evolutionary distances and the second parameter is the number of gapless positions on the pairwise comparisons of the alignment. Threshold curves are generated regarding the statistical parameter values obtained from a shuffled dataset and probability distribution techniques are employed to select an optimum threshold curve that eliminate as much of the unreliable connectivities while keeping the reliable ones. We applied the method on a large and diverse dataset composed of nearly 18000 human proteins and measured the biological relevance of the recovered connectivities. Our precision measure (0.981) was nearly 20% higher than the one for the connectivities left after a classical thresholding procedure displaying a significant improvement. Finally we employed the method for the functional clustering of protein sequences in a gold standard dataset. We have also measured the performance, obtaining a higher F-measure (0.882) compared to a conventional clustering operation (0.827).Article Citation - WoS: 7Citation - Scopus: 9Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences(Public Library of Science, 2013) Doğan, Tunca; Karaçalı, BilgeIdentifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
