PENERAPAN MULTILINGUAL BERT UNTUK KLASIFIKASI BAHASA INDONESIA DAN ATAU BAHASA MALAYSIA PADA TEKS PENDEK MEDIA SOSIAL
DOI:
https://doi.org/10.70248/jrsit.v3i3.3411Abstract
Penelitian ini bertujuan mengembangkan sistem klasifikasi otomatis untuk membedakan Bahasa Indonesia dan Bahasa Malaysia pada teks pendek media sosial. Metode penelitian yang digunakan meliputi pengumpulan data dari Twitter menggunakan web scraping, seleksi dan preprocessing teks, pelabelan data, pembagian dataset menjadi data latih dan uji, serta penerapan model Multilingual BERT (mBERT) dengan fine-tuning dan evaluasi kinerja menggunakan metrik akurasi, precision, recall, dan F1-score. Hasil penelitian menunjukkan bahwa model mBERT mampu mengklasifikasikan teks dengan akurasi 95,81% dan F1-score rata-rata 0,96, dengan performa yang baik pada kedua bahasa, meskipun terdapat beberapa kesalahan pada teks yang sangat mirip secara kosakata. Simpulan penelitian ini menegaskan bahwa mBERT efektif dan potensial untuk digunakan dalam klasifikasi bahasa serumpun pada teks pendek media sosial.
Kata Kunci: Multilingual BERT, Klasifikasi Bahasa, Bahasa Indonesia, Bahasa Malaysia, Teks Pendek
References
Ansari, M. Z., Ahmad, T., & Fatima, A. (2020). Feature Selection on Noisy Twitter Short Text Messages for Language Identification.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised Cross-lingual Representation Learning at Scale. http://arxiv.org/abs/1911.02116
Guo, X., Adnan, H. M., & Abidin, M. Z. Z. (2024). Detecting Offensive Language on Malay Social Media: A Zero-Shot, Cross-Language Transfer Approach Using Dual-Branch mBERT. Applied Sciences (Switzerland), 14(13). https://doi.org/10.3390/app14135777
Hashmi, E., Yayilgan, S. Y., & Shaikh, S. (2024). Augmenting sentiment prediction capabilities for code-mixed tweets with multilingual transformers. Social Network Analysis and Mining, 14(1), 1–15. https://doi.org/10.1007/s13278-024-01245-6
Hidayatullah, A. F., Apong, R. A., Lai, D. T. C., & Qazi, A. (2024). Word Level Language Identification in Indonesian-Javanese-English Code-Mixed Text. Procedia Computer Science, 244, 105–112. https://doi.org/10.1016/j.procs.2024.10.183
Hidayatullah, A. F., Apong, R. A., Lai, D. T. C., & Qazi, A. (2025). Pre-trained language model for code-mixed text in Indonesian, Javanese, and English using transformer. Social Network Analysis and Mining, 15(1), 1–17. https://doi.org/10.1007/s13278-025-01444-9
Husyandi, M. (2025). Analisis Komparatif Kosakata Bahasa Indonesia Dan Bahasa Melayu Malaysia Dalam Episode Perdana Serial Drama “Bidaah.” Jurnal Bahasa Asing, 18(1), 74–84. https://doi.org/10.58220/jba.v18i1.110
Jauhiainen, T., Lindén, K., & Jauhiainen, H. (2019). Language model adaptation for language and dialect identification of text. In Natural Language Engineering (Vol. 25, Nomor 5, hal. 561–583). Cambridge University Press. https://doi.org/10.1017/S135132491900038X
Lu, Y. J., & Li, C. Te. (2020). GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 505–514. https://doi.org/10.18653/v1/2020.acl-main.48
Ma, N., Politowicz, A., Mazumder, S., Chen, J., Liu, B., Robertson, E., & Grigsby, S. (2021). Semantic Novelty Detection in Natural Language Descriptions. EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, 866–882. https://doi.org/10.18653/v1/2021.emnlp-main.66
Maxwell-Smith, Z., Kohler, M., & Suominen, H. (2021). Scoping natural language processing in Indonesian and Malay for education applications.
Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet : A pre-trained language model for English Tweets. 9–14.
Patankar, S., & Phadke, M. (2025). A CNN-transformer framework for emotion recognition in code-mixed English–Hindi data. Discover Artificial Intelligence, 5(1), 1–13. https://doi.org/10.1007/s44163-025-00400-y
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is Multilingual BERT? https://github.com/google-research/bert
Putra, I. F., & Purwarianti, A. (2020, September 8). Improving Indonesian Text Classification Using Multilingual Language Model. 2020 7th International Conference on Advanced Informatics: Concepts, Theory and Applications, ICAICTA 2020. https://doi.org/10.1109/ICAICTA49861.2020.9429038
Review, W. P. (2025). Twitter Users by Country – Indonesia (27.1 million users in 2024).
Ruder, S., Vulić, I., & Søgaard, A. (2019). 11640-Article (PDF)-21826-1-10-20190813. 65, 569–630.
Singh, G., Sharma, S., Kumar, V., Kaur, M., Baz, M., & Masud, M. (2021). Spoken Language Identification Using Deep Learning. Computational Intelligence and Neuroscience, 2021. https://doi.org/10.1155/2021/5123671
Takawane, G., Phaltankar, A., Patwardhan, V., Patil, A., Joshi, R., & Takalikar, M. S. (2023). Language augmentation approach for code-mixed text classification. Natural Language Processing Journal, 5(November), 100042. https://doi.org/10.1016/j.nlp.2023.100042
Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT?
Zhao, S., Gupta, R., Song, Y., & Zhou, D. (2021). Extremely small BERT models from mixed-vocabulary training. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2753–2759. https://doi.org/10.18653/v1/2021.eacl-main.238


















