Geodesic Distances for Web Document Clustering

Loading...

Date

Journal Title

Journal ISSN

Volume Title

Open Access Color

Green Open Access

Yes

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

relationships.isProjectOf

relationships.isJournalIssueOf

Abstract

While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article. © 2011 IEEE.

Description

Symposium Series on Computational Intelligence, IEEE SSCI2011 - 2011 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2011; Paris; France; 11 April 2011 through 15 April 2011

Keywords

Cluster analysis, Geodesic distances, Wikipedia, User interfaces, Web document clustering, Geodesic distances, User interfaces, Cluster analysis, info:eu-repo/classification/ddc/004, Wikipedia, Web document clustering

Fields of Science

0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology

Citation

Tekir, S., Mansmann, F., and Keim, D. (2011, April 11-15). Geodesic distances for web document clustering. Paper presented at the IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2011. doi:10.1109/CIDM.2011.5949449

WoS Q

Scopus Q

OpenCitations Logo
OpenCitations Citation Count
N/A

Volume

Issue

Start Page

15

End Page

21
PlumX Metrics
Citations

Scopus : 6

Captures

Mendeley Readers : 3

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
0.93196449

Sustainable Development Goals