ABSTRAKSI: Saat ini telah banyak dikembangkan teknik klasterisasi, misalnya teknik menggunakan representasi single-word item, merepresentasikan dokumen teks sebagai “bag of words” dimana suatu dokumen dipandang sebagai sekumpulan kata-kata. Dalam representasi ini tidak ada urutan antar kata maupun kalimat yang diperhatikan karena setiap kata dianggap berdiri sendiri tanpa ada keterhubungan satu sama lain sehingga tidak tepatnya dalam pelabelan hasil cluster.
Permasalahan-permasalahan diatas bisa ditangani dengan menggunakan Clustering Based On Frequent Word Sequences (CFWS). Data berdimensi tinggi dapat diatasi dengan mereduksi term-term yang tidak frequent. Pelabelan cluster dilakukan dengan cara menelusuri “word sequences” di tiap dokumen.
Hasil klasterisasi dengan algoritma ini divisualisasikan secara hirarki dalam bentuk tree. Berdasarkan pengujian, klaster yang dihasilkan oleh algoritma CFWS ini memiliki kualitas deskripsi klaster mewakili isi berita.Kata Kunci : clustering, frequent word sequences, CFWS, F-Measure, purity.ABSTRACT: Currently being developed clustering techniques, such as techniques using single-word representation of items, representing a text document as a "bag of words" in which a document is seen as a set of words. In this representation there is no order between words or sentences are considered because each word is considered stand alone without any connection to one another so not exactly in the cluster labeling results.
The above problems can be handled using Clustering Based On Frequent Word Sequences (CFWS). High dimensional data can be addressed by reducing the terms that are not frequent.Labeling of clusters was done by tracing "word sequences" in each document.
The result of this clustering algorithm is visualized in the form of a hierarchical tree. According to the experiments, clusters generated by the algorithm CFWS has represented cluster description quality news content.Keyword: clustering, frequent word sequences, CFWS, F-Measure, purity.