ABSTRAKSI: Email filtering (pemfilteran email) dilakukan untuk memisahkan email yang merugikan dari email yang tidak merugikan. Masalah pemfilteran email adalah masalah Text Categorization (TC), dimana hanya ada 2 kelas yaitu kelas spam dan kelas legitimate/ham/nonspam. Nonspam mail adalah email yang tidak merugikan si penerima email. Sedangkan spam (Stupid Pointless Annoying Messages) adalah email yang merugikan, karena selain memakai banyak ruang memori pada komputer juga menyebabkan penerima di bawah umur mengakses situs-situs yang tidak seharusnya. Salah satu metode untuk menangani spam mail adalah Statictical Filtering.
Model yang menerapkan statistical filtering adalah Markov Random Field, namun tidak hanya memperhitungkan kata melainkan juga frasa. Hubungan antar kata diperhatikan dan nantinya akan membentuk kata dan frasa berdasarkan ukuran neighborhood-nya. Pembentukan kata dan frasa adalah menggunakan teknik Sparse Binary Polynomial Hashing (SBPH). Kata dan frasa yang terbentuk ini sering disebut fitur. Setiap fitur akan diberi bobot dengan menggunakan teknik pembobotan Exponential Weighting Sequences atau sering disebut Exponential Series (ES) dan Minimum Weighting Sequences (MWS). Penerapan model MRF ini dilakukan dengan memperhatikan hubungan neighborhood antar fitur yang bertetangga. Fitur yang termasuk dalam suatu neighborhood disebut sebagai cliques.
Ukuran neighborhood yang menghasilkan keakurasian yang tinggi didapati pada ukuran window 5 dan 6 dengan pembobotan ES yang mencapai 86.67% yang berada pada threshold 0.9090. Parameter selain akurasi yang diuji adalah spam precision, nonspam precision, spam recall, nonspam recall, akurasi, dan f-measure. Berdasarkan hasil pengujian, MRF terbukti dapat meningkatkan hasil klasifikasi dan menghasilkan nilai akurasi yang baik.Kata Kunci : email filtering, statistical filtering, fitur, Markov Random Field (MRF), SBPH, ES, MWS, neighborhood, cliques.ABSTRACT: Email filtering is way to seperate good email from useless email. Problem of email filtering is Text categorization (TC) which have only two classes. They are spam class and legitimate/ham/nonspam class. Nonspam email is important email that not dissapointed for someone who receive that email, but spam (Stupid Pointless Annoying Messages) is unimportant email that disrupt because it uses many of space memory in computer and if children get this kind of email, they can access uneducated email such as file pornography. One of the method for handling spam email is Statictical Filtering. This filtering method is need to be trained firstly uses two email collection, first collection is spam email and the other collection is nonspam email. With this method, Statictical Filtering predicts spam probability based on words which is usually current in spam email collection or legitimate/nonspam email collection for every new email.
Markov Random Field is one of the method that used statistical filtering method, not only count words but also phrases. Relation among words is important and it will makes phrases based on its neighbourhood size. Forming words and pharses is by Sparse Binary Polynomial Hashing (SBPH) method. These words and phrases are called features. Each feature will be weighted using Exponential Weighting Sequences or Exponential Series (ES) and Minimum Weighting Sequences (MWS). We also need to look neighborhood relation among all features in an email. Features which belong to a neighborhood is called cliques.
The good size of neighborhood which can give the best accuration is found in 5 or 6 when use ES weighting which reach 86.67% accuration at 0.9090 threshold. Some parameters beside accuration had been tested are spam precision, nonspam precision, spam recall, nonspam recall, and f-measure. Refers to the result of the experiment, MRF is proved that it can make the classification result be better and give the good result in accuration.Keyword: email filtering, statistical filtering, feature, Markov Random Field (MRF), SBPH, ES, MWS, neighborhood, cliques.