STUDI REPRESENTASI N-GRAM PADA ALGORITMA HMRF-KMEANS UNTUK DOCUMENT CLUSTERING

HMRF-KMeans algoritms does not satisfy some requirements for document clustering including data representation that preserves the sequential relationship between words in documents and meaningful cluster description. These requirements will be satisfied by processing documents that preserving the sequential aspect. The development of HMRF-KMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result show that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, and 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations.

Keywords: constraint, document clustering, HMRF-KMeans algorithm, n-gram

REFERENCE

1.Basu, Sugato.(2005), Semi-supervised clustering : probabilistic models, algorithms and experiments, Disertasi Program Doktor, The University of Texas at Austin.

2.Basu, Sugato, http://www.cs.utexas.edu/users/ml/risc, diakses pada bulan November 2007.

3.Basu, Sugato., Banerjee, Arindam.(2004), Mooney, Raymond, Active semi-supervision for pairwise constrained clustering, Proceedings of the SIAM International Conference on Data Mining (SDM-2004), Lake Buena Vista, FL 333-344.

4.Basu, Sugato., Bilenko, Mikhail., Banerjee, Arindam., Mooney, Raymond.(2006), Probabilistic semi-supervised clustering with constraints, dalam Semi-supervised learning, The MIT Press, 73-101.

5.Basu, Sugato., Davidson, Ian.(2006), Clustering with constraints, kdd06.

6.Davidson, Ian., Wagstaff, Kiri., Basu, Sugato. (2006), Measuring Constraint-Set Utility for Partitional Clustering Algorithms, Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 115-126.

7.Dhillon, Inderjit., Fan, James., Guan, Yuqiang.(2001), Efficient clustering of very large document collections, dalam Data Mining for Scientific and Engineering Applications, Kluwer Academic Publisher.

8.Geraci, Filippo., Pellegrini, Marco., Maggini, Marco., Sebastiani, Fabrizio.(2006), Cluster generation and cluster labelling for web snippets, Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE 2006), Glasgow, UK, 25-36.

9.Han, Jiawei., M, Kamber.(2006), Data mining: concepts and techniques, Morgan Kaufmann, 449-451.

10.Li, Yanjun., Chung, Soon.(2005), Text document clustering based on frequent word sequences, CIKM05, 293-294.

11.Mitchell, Tom. (1997), Machine Learning, The McGraw-Hill Company, Inc, 128-152.

12.N-gram, http://www.wikipedia.org, diakses pada bulan November 2007.

13.Steinbach, M., Karypis. G., Kumar, Vipin.(2000), A comparison of document clustering techniques, KDD Workshop on Text Mining.

14.Wagstaff, Kiri.(2006), Value, Cost, and Sharing: Open Issues in Constrained Clustering, Proceedings of the Fifth International Workshop on Knowledge Discovery in Inductive Databases (KDID), 1-7.

15.Wang, John. (2006), Encyclopedia of data warehousing and mining, Idea Group Reference, I, 555-559.

16.Yang, Yiming., Pederson, Jan.(1997), A comparative study on feature selection in text categorization, Proceedings of the Fourteenth International Conference on Machine Learning, 412 420.

Advertisements

3 thoughts on “STUDI REPRESENTASI N-GRAM PADA ALGORITMA HMRF-KMEANS UNTUK DOCUMENT CLUSTERING

  1. maaf bu, saya nita anissa mahasiswa s1 ITTelkom Bandung jurusan informatika, saya sedang berusaha mengimplementasikan HMRF K-Means semi supervised clustering, saya mau bertanya mengenai bagaimana menghitung jarak cosine distance yang diparemeterisasi matriks, matriks A itu berupa matriks apa ya? terima kasih banyak sebelumnya bu..

    Like

  2. Saya edi mau tanya kalau google itu pake algoritma apa dan pagerank nya pake algoritma apa.apa kah performanya lebih bagus dari algoritma baidu(cina)..menurut ibu apakah ada sistem seach engine yg bisa mengunguli sistem pagerank.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s