PENERAPAN MULTILINGUAL BERT UNTUK KLASIFIKASI BAHASA INDONESIA DAN ATAU BAHASA MALAYSIA PADA TEKS PENDEK MEDIA SOSIAL

Authors

  • Moch. Chaidar Chanif Universitas Islam Sultan Agung Semarang
  • Imam Much Ibnu Subroto Universitas Islam Sultan Agung Semarang

DOI:

https://doi.org/10.70248/jrsit.v3i3.3411

Abstract

Penelitian ini bertujuan mengembangkan sistem klasifikasi otomatis untuk membedakan Bahasa Indonesia dan Bahasa Malaysia pada teks pendek media sosial. Metode penelitian yang digunakan meliputi pengumpulan data dari Twitter menggunakan web scraping, seleksi dan preprocessing teks, pelabelan data, pembagian dataset menjadi data latih dan uji, serta penerapan model Multilingual BERT (mBERT) dengan fine-tuning dan evaluasi kinerja menggunakan metrik akurasi, precision, recall, dan F1-score. Hasil penelitian menunjukkan bahwa model mBERT mampu mengklasifikasikan teks dengan akurasi 95,81% dan F1-score rata-rata 0,96, dengan performa yang baik pada kedua bahasa, meskipun terdapat beberapa kesalahan pada teks yang sangat mirip secara kosakata. Simpulan penelitian ini menegaskan bahwa mBERT efektif dan potensial untuk digunakan dalam klasifikasi bahasa serumpun pada teks pendek media sosial.

 

Kata Kunci: Multilingual BERT, Klasifikasi Bahasa, Bahasa Indonesia, Bahasa Malaysia, Teks Pendek

 

References

Ansari, M. Z., Ahmad, T., & Fatima, A. (2020). Feature Selection on Noisy Twitter Short Text Messages for Language Identification.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised Cross-lingual Representation Learning at Scale. http://arxiv.org/abs/1911.02116

Guo, X., Adnan, H. M., & Abidin, M. Z. Z. (2024). Detecting Offensive Language on Malay Social Media: A Zero-Shot, Cross-Language Transfer Approach Using Dual-Branch mBERT. Applied Sciences (Switzerland), 14(13). https://doi.org/10.3390/app14135777

Hashmi, E., Yayilgan, S. Y., & Shaikh, S. (2024). Augmenting sentiment prediction capabilities for code-mixed tweets with multilingual transformers. Social Network Analysis and Mining, 14(1), 1–15. https://doi.org/10.1007/s13278-024-01245-6

Hidayatullah, A. F., Apong, R. A., Lai, D. T. C., & Qazi, A. (2024). Word Level Language Identification in Indonesian-Javanese-English Code-Mixed Text. Procedia Computer Science, 244, 105–112. https://doi.org/10.1016/j.procs.2024.10.183

Hidayatullah, A. F., Apong, R. A., Lai, D. T. C., & Qazi, A. (2025). Pre-trained language model for code-mixed text in Indonesian, Javanese, and English using transformer. Social Network Analysis and Mining, 15(1), 1–17. https://doi.org/10.1007/s13278-025-01444-9

Husyandi, M. (2025). Analisis Komparatif Kosakata Bahasa Indonesia Dan Bahasa Melayu Malaysia Dalam Episode Perdana Serial Drama “Bidaah.” Jurnal Bahasa Asing, 18(1), 74–84. https://doi.org/10.58220/jba.v18i1.110

Jauhiainen, T., Lindén, K., & Jauhiainen, H. (2019). Language model adaptation for language and dialect identification of text. In Natural Language Engineering (Vol. 25, Nomor 5, hal. 561–583). Cambridge University Press. https://doi.org/10.1017/S135132491900038X

Lu, Y. J., & Li, C. Te. (2020). GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 505–514. https://doi.org/10.18653/v1/2020.acl-main.48

Ma, N., Politowicz, A., Mazumder, S., Chen, J., Liu, B., Robertson, E., & Grigsby, S. (2021). Semantic Novelty Detection in Natural Language Descriptions. EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, 866–882. https://doi.org/10.18653/v1/2021.emnlp-main.66

Maxwell-Smith, Z., Kohler, M., & Suominen, H. (2021). Scoping natural language processing in Indonesian and Malay for education applications.

Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet : A pre-trained language model for English Tweets. 9–14.

Patankar, S., & Phadke, M. (2025). A CNN-transformer framework for emotion recognition in code-mixed English–Hindi data. Discover Artificial Intelligence, 5(1), 1–13. https://doi.org/10.1007/s44163-025-00400-y

Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? https://github.com/google-research/bert

Putra, I. F., & Purwarianti, A. (2020, September 8). Improving Indonesian Text Classification Using Multilingual Language Model. 2020 7th International Conference on Advanced Informatics: Concepts, Theory and Applications, ICAICTA 2020. https://doi.org/10.1109/ICAICTA49861.2020.9429038

Review, W. P. (2025). Twitter Users by Country – Indonesia (27.1 million users in 2024).

Ruder, S., Vulić, I., & Søgaard, A. (2019). 11640-Article (PDF)-21826-1-10-20190813. 65, 569–630.

Singh, G., Sharma, S., Kumar, V., Kaur, M., Baz, M., & Masud, M. (2021). Spoken Language Identification Using Deep Learning. Computational Intelligence and Neuroscience, 2021. https://doi.org/10.1155/2021/5123671

Takawane, G., Phaltankar, A., Patwardhan, V., Patil, A., Joshi, R., & Takalikar, M. S. (2023). Language augmentation approach for code-mixed text classification. Natural Language Processing Journal, 5(November), 100042. https://doi.org/10.1016/j.nlp.2023.100042

Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT?

Zhao, S., Gupta, R., Song, Y., & Zhou, D. (2021). Extremely small BERT models from mixed-vocabulary training. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2753–2759. https://doi.org/10.18653/v1/2021.eacl-main.238

Downloads

Published

2026-02-05

How to Cite

Moch. Chaidar Chanif, & Imam Much Ibnu Subroto. (2026). PENERAPAN MULTILINGUAL BERT UNTUK KLASIFIKASI BAHASA INDONESIA DAN ATAU BAHASA MALAYSIA PADA TEKS PENDEK MEDIA SOSIAL. Jurnal Rekayasa Sistem Informasi Dan Teknologi, 3(3), 446–457. https://doi.org/10.70248/jrsit.v3i3.3411

Issue

Section

Artikel