Comparison of Normalization of Indonesian Slang Words Using the FastText & Word2vec Model with the Natural Language Processing Approach

Rifqah Nur Surayya  M. Jen; Syarifuddin N.  Kapita; Muhammad  Fhadli

doi:10.11594/nstp.2025.4805

Authors

Rifqah Nur Surayya M. Jen Informatics Engineering Study Program, Faculty of Engineering, Khairun University, Ternate, North Maluku, 97716, Indonesia
Syarifuddin N. Kapita Informatics Engineering Study Program, Faculty of Engineering, Khairun University, Ternate, North Maluku, 97716, Indonesia
Muhammad Fhadli Informatics Engineering Study Program, Faculty of Engineering, Khairun University, Ternate, North Maluku, 97716, Indonesia

DOI:

https://doi.org/10.11594/nstp.2025.4805

Keywords:

Communication, FastText, Natural Language Processing, slang words, Twitter, Word2Vec

Abstract

The use of slang words is often used as a means of communication on social media such as Twitter, but it is a problem for certain groups because they are difficult to understand if they are said out of context. This can cause communication to be less effective, especially for those who are not familiar with the slang. Therefore, a word normalization approach is needed to translate words into formal language so that they are better understood by the public. Natural Language Processing (NLP) is a computational technique that analyzes and represents text or spoken language to achieve human-like processing. This research focuses on feature extraction techniques such as FastText and Word2Vec to map words to numerical vectors. The results of testing slang words show that FastText has the highest similarity of 0.9934859978 and the lowest is 0.8928895496, while Word2Vec has the highest similarity of 0.9977979123 and the lowest is 0.0975351095. The time required for FastText for training is 0.432 seconds and for normalization 0.016 seconds, while Word2Vec requires 0.027 seconds for training and 0.006 seconds for normalization.

Downloads

Download data is not yet available.

References

Bakhri, A. I., Tuhpatussania, S., Asfa, N., & Mubarak, S. M. (2021). Normalisasi Teks Komentar Instagram Masyarakat Makassar Menggunakan Metode Levenshtein Distence. Explore, 12(2), 29-32.

Hrp, H. N., Fikry, M., & Yusra, Y. (2023). Angkola Batak language text stemming algorithm based on grammar rules. Journal of Computer System and Informatics (JoSYC), 4(3), 642–648. doi: 10.47065/josyc.v4i3.3458.

Juwiantho, H. (2020). Indonesian Twitter Sentiment Analysis Based on Word2vec Using Deep Convolutional Neural Network. Conference: 2020 International Conference on Data Science and Its Applications (ICoDSA), 7(1), 181–188. doi: 10.25126/jtiik.202071758.

Khairul, R. F. M., & Perdana, S. R. 2023. Architecture of automatic conversation system in Indonesian language with normalization of informal language to standard. Journal of Information Technology and Computer Science, 10(7), 1469–1476. doi: 10.25126/jtiik.1077984.

Khomsah, S., Ramadhani, D. R., & Wijaya, S. (2022). The accuracy comparison between Word2Vec and FastText on sentiment analysis of hotel reviews. Jurnal RESTI (Systems Engineering and Information Technology), 6(3), 352–358. doi: 10.29207/resti.v6i3.3711.

Nurdin, A., Anggo, B., Aji, S., Bustamin, A., & Abidin, Z. (2020). Performance comparison of word embedding Word2vec, Glove, and Fasttext in Text Classification. Technocompact Journal, 14(2), 74.

Pakpahan, I., & Pardede, J. (2023). Analisis sentimen penanganan covid-19 menggunakan metode long short-term memory pada media Sosial Twitter. Jurnal Publikasi Teknik Informatika, 2(1), 12–25. https://doi.org/10.55606/jupti.v1i1.767

Rahma, F. P., Revallina, H. P., & Naura, A. A. (2023). Use of Slang Among Environmental Engineering Students of UPN 'Veteran' East Java Class of 2022. Journal of Social Humanities and Education, 3(2).

Ramadhanti, F., Wibisono, Y., & Sukamto, A. R. (2019). Morphological analysis to handle out-of-vocabulary words in indonesian part-of-speech tagger using hidden markov model. Jurnal Linguistik Komputasional (JLK), 2(1), 6.

Riyaddulloh, R., & Romadhony, A. (2021). Normalization of Indonesian Text Based on Slang Dictionary Case Study: Gadget Product Tweets on Twitter. E-Proceedings of Engineering, 8(4).

Sabrina, N. A. (2021). Internet slang containing code-mixing of English And Indonesian Used By Millennials On Twitter. Kandai, 17(2), 153. doi: 10.26499/jk.v17i2.3422.

Samudro, A. A. (2019). Normalization of Indonesian text in social media based on FastText embeddings. Surabaya.

Togatorop, R. P., Simanjuntak, P. R., Manurung, B. S., & Silalahi, C. M. (2021). Generating entity relationship diagrams from requirement specifications using natural language processing for Indonesian Language. Journal of Computer and Informatics, 9(2), 196–206. doi: 10.35508/jicon.v9i2.5051.

Utami, D. (2010). Characteristics of language use in Facebook status.

Wijaya, U. K., & Setiawan, B. E. (2023). Hate speech detection using convolutional neural network and gated recurrent unit with FastText Feature Expansion on Twitter. Scientific Journal of Electrical Engineering, Computer and Informatics (JITEKI), 9(3), 619–631. doi: 10.26555/jiteki.v9i3.26532.