Comparison of Statistical and Linguistic Feature in K-Nearest Neighbors (KNN) & Neural Network Algorithms for SMS Spam Classification

Muhammad Raihan  Rizal; Muhammad  Fadhli; Yasir  Muin

doi:10.11594/nstp.2025.4812

Authors

Muhammad Raihan Rizal Department of Informatics, Faculty of Engineering, Khairun University, Ternate, North Maluku, 97719, Indonesia
Muhammad Fadhli Department of Informatics, Faculty of Engineering, Khairun University, Ternate, North Maluku, 97719, Indonesia
Yasir Muin Department of Informatics, Faculty of Engineering, Khairun University, Ternate, North Maluku, 97719, Indonesia

DOI:

https://doi.org/10.11594/nstp.2025.4812

Keywords:

Linguistic features, neural network, SMS Spam

Abstract

SMS is still widely used, but the presence of spam SMS has become a serious problem. According to the 2020 Truecaller Insights Report, Indonesia recorded the highest number of spam messages in Asia, with a significant contribution from the financial services sector. This study aims to compare the influence of statistical and linguistic features in SMS spam classification using the K-Nearest Neighbors (KNN) and Neural Network (NN) algorithms. The methodology applied includes problem identification, planning, modeling, model evaluation, model implementation, and testing stages. In this research, data is processed using statistical features (TF-IDF) and linguistic features before being applied to the KNN and NN models. The performance of the models is evaluated based on precision, recall, F1-score, and accuracy metrics. The results show that the NN model using statistical features achieves an accuracy of 98%, KNN with statistical features 95%, NN with linguistic features 85%, and KNN with linguistic features 82%. Overall, the NN with statistical features outperforms the KNN in all tested feature types. From this evaluation, it can be concluded that statistical features are more effective than linguistic features, and the NN method is superior to the KNN method.

Downloads

Download data is not yet available.

References

Abayomi-Alli, O., Misra, S., & Abayomi-Alli, A. (2022). A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurrency and Computation: Practice and Experience, 11(1).

Al-Jumaili, A. S. A., & Tayyeh, H. K. (2020). A hybrid method of linguistic and statistical features for Arabic sentiment analysis. Baghdad Science Journal, 17(1), 385–390. https://doi.org/10.21123/BSJ.2020.17.1(SUPPL.).0385

Dwiyansaputra, R., Nugraha, G. S., Bimantoro, F., & Aranta, A. (2021). Indonesian SMS spam detection using TF-IDF and stochastic gradient descent. Jurnal Teknologi Informasi, Komputer, dan Aplikasi, 3(2), 200–207.

Firmansyah, M. R., Ilyas, R., & Kasyidi, F. (2020). Scientific sentence classification using recurrent neural network. Proceedings of The 11th Industrial Research Workshop and National Seminar, 11(1).

Giri, S., Das, S., Das, S. B., & Banerjee, S. (2023). SMS spam classification–Simple deep learning models with higher accuracy using BUNOW and GloVe word embedding. Journal of Applied Science and Engineering, 26(10).

Herwanto, H., Chusna, N. L., & Arif, M. S. (2021). Classification of spam SMS in Indonesian using Naïve Bayes multinomial algorithm. Jurnal Media Informatika Budidarma, 5(4), 1316–1325. https://doi.org/10.30865/mib.v5i4.3119

Laksono, E. P., Basuki, A., & Bachtiar, F. A. (2020). K-value optimization on KNN algorithm for spam and email ham classification. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 4(2), 377–383.

Ramadhan, A., Lindawati, L., & Rose, M. M. (2023). Comparison of neural network and K-nearest neighbor algorithms in detecting Android malware. Building of Informatics Technology and Science (BITS, 5)(1).

Reviantika, F., Azhar, Y., & Marthasari, G. I. (2021). SMS spam classification analysis using logistic regression. Jurnal Sistem Cerdas, 4(3).

Rumlaklak, N. D., Fanggidae, A., & Polly, Y. T. (2022). Zone status classification in Kupang City using Naïve Bayes classifier algorithm. Jurnal Komputer dan Informatika, 10(1).

Runimeirati, M., Muis, A., & Muhammad, F. (2023). Text mining training using the Python programming language. Abdimas Langkanae, 3(1).

Setifani, N. A., Fitriana, D. N., & Yusuf, A. (2020). Comparison of Naïve Bayes, SVM, and decision tree algorithms for SMS spam classification. Jurnal Sistem Informasi Musirawas, 5(2).

Truecaller. (2020). Truecaller Insights Report 2020. https://www.truecaller.com/blog/insights/truecaller-insights-2020

Truecaller. (2021). Truecaller Insights Report 2021. https://www.truecaller.com/blog/insights/truecaller-insights-2021

Wahid, A., Baharulloh, M., Kahfiansyah, R., Abrilianto, T., Saifudin, A., & Mulyati, S. (2021). SMS spam identification using Naïve Bayes method. Jurnal Informatika Universitas Pamulang, 6(3).

Wijaya, D. P., Murti, L. D., & Rachman, M. R. (2022). Precision and recall in the Online Public Access Catalog (OPAC) of the Archives and Library Office of Bandung City. VISI PUSTAKA: Bulletin of the Interlibrary Information Network, 24(1).

Zhong, Z., Liu, Z., Tegmark, M., & Andreas, J. (2023). The clock and the pizza: Two stories in mechanistic explanation of neural networks. Advances in Neural Information Processing Systems, 36.