Comparison of Statistical and Linguistic Feature in K-Nearest Neighbors (KNN) & Neural Network Algorithms for SMS Spam Classification
DOI:
https://doi.org/10.11594/nstp.2025.4812Keywords:
Linguistic features, neural network, SMS SpamAbstract
SMS is still widely used, but the presence of spam SMS has become a serious problem. According to the 2020 Truecaller Insights Report, Indonesia recorded the highest number of spam messages in Asia, with a significant contribution from the financial services sector. This study aims to compare the influence of statistical and linguistic features in SMS spam classification using the K-Nearest Neighbors (KNN) and Neural Network (NN) algorithms. The methodology applied includes problem identification, planning, modeling, model evaluation, model implementation, and testing stages. In this research, data is processed using statistical features (TF-IDF) and linguistic features before being applied to the KNN and NN models. The performance of the models is evaluated based on precision, recall, F1-score, and accuracy metrics. The results show that the NN model using statistical features achieves an accuracy of 98%, KNN with statistical features 95%, NN with linguistic features 85%, and KNN with linguistic features 82%. Overall, the NN with statistical features outperforms the KNN in all tested feature types. From this evaluation, it can be concluded that statistical features are more effective than linguistic features, and the NN method is superior to the KNN method.
Downloads
References
Abayomi-Alli, O., Misra, S., & Abayomi-Alli, A. (2022). A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurrency and Computation: Practice and Experience, 11(1).
Al-Jumaili, A. S. A., & Tayyeh, H. K. (2020). A hybrid method of linguistic and statistical features for Arabic sentiment analysis. Baghdad Science Journal, 17(1), 385–390. https://doi.org/10.21123/BSJ.2020.17.1(SUPPL.).0385
Dwiyansaputra, R., Nugraha, G. S., Bimantoro, F., & Aranta, A. (2021). Indonesian SMS spam detection using TF-IDF and stochastic gradient descent. Jurnal Teknologi Informasi, Komputer, dan Aplikasi, 3(2), 200–207.
Firmansyah, M. R., Ilyas, R., & Kasyidi, F. (2020). Scientific sentence classification using recurrent neural network. Proceedings of The 11th Industrial Research Workshop and National Seminar, 11(1).
Giri, S., Das, S., Das, S. B., & Banerjee, S. (2023). SMS spam classification–Simple deep learning models with higher accuracy using BUNOW and GloVe word embedding. Journal of Applied Science and Engineering, 26(10).
Herwanto, H., Chusna, N. L., & Arif, M. S. (2021). Classification of spam SMS in Indonesian using Naïve Bayes multinomial algorithm. Jurnal Media Informatika Budidarma, 5(4), 1316–1325. https://doi.org/10.30865/mib.v5i4.3119
Laksono, E. P., Basuki, A., & Bachtiar, F. A. (2020). K-value optimization on KNN algorithm for spam and email ham classification. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 4(2), 377–383.
Ramadhan, A., Lindawati, L., & Rose, M. M. (2023). Comparison of neural network and K-nearest neighbor algorithms in detecting Android malware. Building of Informatics Technology and Science (BITS, 5)(1).
Reviantika, F., Azhar, Y., & Marthasari, G. I. (2021). SMS spam classification analysis using logistic regression. Jurnal Sistem Cerdas, 4(3).
Rumlaklak, N. D., Fanggidae, A., & Polly, Y. T. (2022). Zone status classification in Kupang City using Naïve Bayes classifier algorithm. Jurnal Komputer dan Informatika, 10(1).
Runimeirati, M., Muis, A., & Muhammad, F. (2023). Text mining training using the Python programming language. Abdimas Langkanae, 3(1).
Setifani, N. A., Fitriana, D. N., & Yusuf, A. (2020). Comparison of Naïve Bayes, SVM, and decision tree algorithms for SMS spam classification. Jurnal Sistem Informasi Musirawas, 5(2).
Truecaller. (2020). Truecaller Insights Report 2020. https://www.truecaller.com/blog/insights/truecaller-insights-2020
Truecaller. (2021). Truecaller Insights Report 2021. https://www.truecaller.com/blog/insights/truecaller-insights-2021
Wahid, A., Baharulloh, M., Kahfiansyah, R., Abrilianto, T., Saifudin, A., & Mulyati, S. (2021). SMS spam identification using Naïve Bayes method. Jurnal Informatika Universitas Pamulang, 6(3).
Wijaya, D. P., Murti, L. D., & Rachman, M. R. (2022). Precision and recall in the Online Public Access Catalog (OPAC) of the Archives and Library Office of Bandung City. VISI PUSTAKA: Bulletin of the Interlibrary Information Network, 24(1).
Zhong, Z., Liu, Z., Tegmark, M., & Andreas, J. (2023). The clock and the pizza: Two stories in mechanistic explanation of neural networks. Advances in Neural Information Processing Systems, 36.
Downloads
Published
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Yasir Muin

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this proceedings agree to the following terms:
Authors retain copyright and grant the Nusantara Science and Technology Proceedings right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this proceeding.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the proceedings published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this proceeding.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See the Effect of Open Access).