HMRF-KMeans algoritms does not satisfy some requirements for document clustering including data representation that preserves the sequential relationship between words in documents and meaningful cluster description. These requirements will be satisfied by processing documents that preserving the sequential aspect. The development of HMRF-KMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result show that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, and 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations.

Keywords: constraint, document clustering, HMRF-KMeans algorithm, n-gram


  1. maaf bu, saya nita anissa mahasiswa s1 ITTelkom Bandung jurusan informatika, saya sedang berusaha mengimplementasikan HMRF K-Means semi supervised clustering, saya mau bertanya mengenai bagaimana menghitung jarak cosine distance yang diparemeterisasi matriks, matriks A itu berupa matriks apa ya? terima kasih banyak sebelumnya bu..


  2. Saya edi mau tanya kalau google itu pake algoritma apa dan pagerank nya pake algoritma apa.apa kah performanya lebih bagus dari algoritma baidu(cina)..menurut ibu apakah ada sistem seach engine yg bisa mengunguli sistem pagerank.


