HMRF-KMeans algoritms does not satisfy some requirements for document clustering including data representation that preserves the sequential relationship between words in documents and meaningful cluster description. These requirements will be satisfied by processing documents that preserving the sequential aspect. The development of HMRF-KMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result show that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, and 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations.
Keywords: constraint, document clustering, HMRF-KMeans algorithm, n-gram
1.Basu, Sugato.(2005), Semi-supervised clustering : probabilistic models, algorithms and experiments, Disertasi Program Doktor, The University of Texas at Austin.
2.Basu, Sugato, http://www.cs.utexas.edu/users/ml/risc, diakses pada bulan November 2007.
3.Basu, Sugato., Banerjee, Arindam.(2004), Mooney, Raymond, Active semi-supervision for pairwise constrained clustering, Proceedings of the SIAM International Conference on Data Mining (SDM-2004), Lake Buena Vista, FL 333-344.
4.Basu, Sugato., Bilenko, Mikhail., Banerjee, Arindam., Mooney, Raymond.(2006), Probabilistic semi-supervised clustering with constraints, dalam Semi-supervised learning, The MIT Press, 73-101.
5.Basu, Sugato., Davidson, Ian.(2006), Clustering with constraints, kdd06.
6.Davidson, Ian., Wagstaff, Kiri., Basu, Sugato. (2006), Measuring Constraint-Set Utility for Partitional Clustering Algorithms, Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 115-126.
7.Dhillon, Inderjit., Fan, James., Guan, Yuqiang.(2001), Efficient clustering of very large document collections, dalam Data Mining for Scientific and Engineering Applications, Kluwer Academic Publisher.
8.Geraci, Filippo., Pellegrini, Marco., Maggini, Marco., Sebastiani, Fabrizio.(2006), Cluster generation and cluster labelling for web snippets, Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE 2006), Glasgow, UK, 25-36.
9.Han, Jiawei., M, Kamber.(2006), Data mining: concepts and techniques, Morgan Kaufmann, 449-451.
10.Li, Yanjun., Chung, Soon.(2005), Text document clustering based on frequent word sequences, CIKM05, 293-294.
11.Mitchell, Tom. (1997), Machine Learning, The McGraw-Hill Company, Inc, 128-152.
12.N-gram, http://www.wikipedia.org, diakses pada bulan November 2007.
13.Steinbach, M., Karypis. G., Kumar, Vipin.(2000), A comparison of document clustering techniques, KDD Workshop on Text Mining.
14.Wagstaff, Kiri.(2006), Value, Cost, and Sharing: Open Issues in Constrained Clustering, Proceedings of the Fifth International Workshop on Knowledge Discovery in Inductive Databases (KDID), 1-7.
15.Wang, John. (2006), Encyclopedia of data warehousing and mining, Idea Group Reference, I, 555-559.
16.Yang, Yiming., Pederson, Jan.(1997), A comparative study on feature selection in text categorization, Proceedings of the Fourteenth International Conference on Machine Learning, 412 420.