Multimodal Emotion Recognition Using Hybrid Large Language Models and Metaheuristic Algorithms

Andino Maseleno; M. Teduh Uliniansyah; Agung Santosa; Lyla Ruslana Aini; Rini Wijayanti; Ahmad Fudholi; Chotirat Ann Ratanamahatana

doi:10.28991/ESJ-2026-010-02-015

Authors

Andino Maseleno 1) Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand. 2) Research Organization for Electronics and Informatics, National Research and Innovation Agency, Jakarta Pusat 10340, Indonesia https://orcid.org/0000-0001-7922-9622
M. Teduh Uliniansyah Research Organization for Electronics and Informatics, National Research and Innovation Agency, Jakarta Pusat 10340, Indonesia https://orcid.org/0000-0002-5224-3704
Agung Santosa Research Organization for Electronics and Informatics, National Research and Innovation Agency, Jakarta Pusat 10340, Indonesia https://orcid.org/0000-0002-9827-4397
Lyla Ruslana Aini Research Organization for Electronics and Informatics, National Research and Innovation Agency, Jakarta Pusat 10340, Indonesia https://orcid.org/0009-0006-7131-8237
Rini Wijayanti Research Organization for Electronics and Informatics, National Research and Innovation Agency, Jakarta Pusat 10340, Indonesia https://orcid.org/0000-0002-6525-0313
Ahmad Fudholi Pusat Pengajian Citra Universiti, Universiti Kebangsaan Malaysia, Bangi, Selangor 43600, Malaysia https://orcid.org/0000-0002-9528-7344
Chotirat Ann Ratanamahatana
chotirat.r@chula.ac.th
Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand https://orcid.org/0000-0002-4168-9495

Vol. 10 No. 2 (2026): April

Research Articles

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

Emotion recognition is a vital component of human–computer interaction and intelligent systems, yet robust multimodal emotion recognition remains challenging due to high-dimensional input space, noisy features, and the complexity of integrating heterogeneous modalities. This study proposes a novel hybrid multimodal framework that enhances both accuracy and computational efficiency by combining the semantic representation capability of Large Language Models (LLMs) with the optimization strengths of metaheuristic algorithms. In the proposed approach, an LLM is utilized to extract high-level contextual features from text and audio streams, while the Binary Artificial Hummingbird Algorithm (BAHA) performs feature selection to remove redundant attributes. Subsequently, the Goose Algorithm (GA) optimizes classifier hyperparameters, and the Komodo Mlipir Algorithm (KMA) conducts late fusion of the final multimodal outputs. Experiments conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, evaluated on six emotion categories, demonstrate that this hybrid approach successfully captures subtle affective cues and surpasses state-of-the-art baselines, achieving an accuracy of 87.5%. Integrating LLMs with multiple specialized metaheuristics therefore yields a substantially more robust emotion recognition pipeline and represents a promising direction toward the development of more emotionally intelligent systems.

[1] Younis, E. M. G., Mohsen, S., Houssein, E. H., & Ibrahim, O. A. S. (2024). Machine learning for human emotion recognition: a comprehensive review. Neural Computing and Applications, 36(16), 8901–8947. doi:10.1007/s00521-024-09426-2.

[2] Guo, R., Guo, H., Wang, L., Chen, M., Yang, D., & Li, B. (2024). Development and application of emotion recognition technology — a systematic literature review. BMC Psychology, 12(1), 95. doi:10.1186/s40359-024-01581-4.

[3] Ramaswamy, M. P. A., & Palaniswamy, S. (2024). Multimodal emotion recognition: A comprehensive review, trends, and challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(6), 1563. doi:10.1002/widm.1563.

[4] Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., & Acharya, U. R. (2024). Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion, 102, 102019. doi:10.1016/j.inffus.2023.102019.

[5] Hazmoune, S., & Bougamouza, F. (2024). Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Engineering Applications of Artificial Intelligence, 133, 108339. doi:10.1016/j.engappai.2024.108339.

[6] Kalateh, S., Estrada-Jimenez, L. A., Nikghadam-Hojjati, S., & Barata, J. (2024). A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges. IEEE Access, 12, 103976–104019. doi:10.1109/ACCESS.2024.3430850.

[7] Zhang, S., Yang, Y., Chen, C., Zhang, X., Leng, Q., & Zhao, X. (2024). Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Systems with Applications, 237, 121692. doi:10.1016/j.eswa.2023.121692.

[8] Raiaan, M. A. K., Mukta, M. S. H., Fatema, K., Fahad, N. M., Sakib, S., Mim, M. M. J., Ahmad, J., Ali, M. E., & Azam, S. (2024). A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access, 12, 26839–26874. doi:10.1109/ACCESS.2024.3365742.

[9] Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2025). A Comprehensive Overview of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 16(5), 1–72. doi:10.1145/3744746.

[10] Pan, B., Hirota, K., Jia, Z., Zhao, L., Jin, X., & Dai, Y. (2023). Multimodal emotion recognition based on feature selection and extreme learning machine in video clips. Journal of Ambient Intelligence and Humanized Computing, 14(3), 1903–1917. doi:10.1007/s12652-021-03407-2.

[11] Chaudhari, P., Kumar, A., Raghaw, C. S., Rehman, M. Z. U., & Kumar, N. (2024). GCM-Net: Graph-enhanced cross-modal infusion with a metaheuristic-driven network for video sentiment and emotion analysis. arXiv Preprint, arXiv:2410.12828. doi:10.48550/arXiv.2410.12828.

[12] Michael, S., & Zahra, A. (2024). Multimodal speech emotion recognition optimization using genetic algorithm. Bulletin of Electrical Engineering and Informatics, 13(5), 3309–3316. doi:10.11591/eei.v13i5.7409.

[13] Mukta, M. S. H., Ahmad, J., Zaman, A., & Islam, S. (2024). Attention and Meta-Heuristic Based General Self-Efficacy Prediction Model from Multimodal Social Media Dataset. IEEE Access, 12, 36853–36873. doi:10.1109/ACCESS.2024.3373558.

[14] Daneshfar, F., Kabudian, S. J., & Neekabadi, A. (2020). Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier. Applied Acoustics, 166, 107360. doi:10.1016/j.apacoust.2020.107360.

[15] Dutta, S., & Ganapathy, S. (2025). LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1–5. doi:10.1109/ICASSP49660.2025.10889998.

[16] Hong, X., Gong, Y., Sethu, V., & Dang, T. (2025). AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1–5. doi:10.1109/ICASSP49660.2025.10888198.

[17] Kim, E. H., Lim, M. J., & Shin, J. H. (2025). MMER-LMF: Multi-Modal Emotion Recognition in Lightweight Modality Fusion. Electronics (Switzerland), 14(11), 2139. doi:10.3390/electronics14112139.

[18] Li, Z., Lu, C., Xu, X., Zhang, K., Gu, Y., Li, B., Zong, Y., & Zheng, W. (2025). Enhancing Task-Specific Feature Learning with LLMs for Multimodal Emotion and Intent Joint Understanding. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1–2. doi:10.1109/ICASSP49660.2025.10890555.

[19] Lu, H., Chen, J., Liang, F., Tan, M., Zeng, R., & Hu, X. (2025). Understanding Emotional Body Expressions via Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1447–1455. doi:10.1609/aaai.v39i2.32135.

[20] Teng, S., Liu, J., Sun, H., Chai, S., Tateyama, T., Lin, L., & Chen, Y. W. (2025). Enhanced Multimodal Depression Detection with Emotion Prompts. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1–5. doi:10.1109/ICASSP49660.2025.10889035.

[21] Xu, X., Lu, C., Li, Z., Liu, Y., Ma, Y., Luo, J., Zong, Y., & Zheng, W. (2025). Reliable Learning from LLM Features for Multimodal Emotion and Intent Joint Understanding. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1–2. doi:10.1109/ICASSP49660.2025.10888958.

[22] Yacoubi, I., Ferjaoui, R., Djeddi, W. E., & Khalifa, A. Ben. (2025). Advancing Emotion Recognition through LLaMA3 and LoRA Fine-Tuning. 22nd IEEE International Multi-Conference on Systems, Signals and Devices, SSD 2025, 348–353. doi:10.1109/SSD64182.2025.10989922.

[23] Yang, Y., Dong, X., & Qiang, Y. (2025). MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25642–25650. doi:10.1609/aaai.v39i24.34755.

[24] Zhang, S., Hu, Y., Yi, X., Nanayakkara, S., & Chen, X. (2025). IntervEEG-LLM: Exploring EEG-Based Multimodal Data for Customized Mental Health Interventions. WWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025, 2320–2326. doi:10.1145/3701716.3717550.

[25] Zhang, Y., Wang, M., Wu, Y., Tiwari, P., Li, Q., Wang, B., & Qin, J. (2025). DialogueLLM: Context and emotion knowledge-tuned large language models for emotion recognition in conversations. Neural Networks, 192. doi:10.1016/j.neunet.2025.107901.

[26] Zhang, Y., Chen, B., Ye, H., Gao, Z., Wan, T., Lan, L., & Xu, K. (2025). Text-guided Multimodal Fusion for the Multimodal Emotion and Intent Joint Understanding. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1–2. doi:10.1109/ICASSP49660.2025.10890680.

[27] Zhou, Z., Guo, Y., Hao, S., & Hong, R. (2025). Multi-Modal Depression Detection in Interview via Exploring Emotional Distribution Information. IEEE Transactions on Multimedia, 27, 6872 - 6883. doi:10.1109/TMM.2025.3590939.

[28] Lian, Z., Sun, H., Sun, L., Chen, K., Xu, M., Wang, K., & Tao, J. (n.d.). Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. Proceedings of the 31st ACM International Conference on Multimedia, 9610–9614. doi:10.1145/3581783.3612836.

[29] Alqurashi, F., & Ahmad, I. (2024). A data-driven multi-perspective approach to cybersecurity knowledge discovery through topic modelling. Alexandria Engineering Journal, 107, 374–389. doi:10.1016/j.aej.2024.07.044.

[30] Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. doi:10.1109/4235.585893.

[31] Igel, C., & Toussaint, M. (2004). A No-Free-Lunch theorem for non-uniform distributions of target functions. Journal of Mathematical Modelling and Algorithms, 3(4), 313–322. doi:10.1023/B:JMMA.0000049381.24625.f7.

[32] Sharma, M., & Kaur, P. (2021). A Comprehensive Analysis of Nature-Inspired Meta-Heuristic Techniques for Feature Selection Problem. Archives of Computational Methods in Engineering, 28(3), 1103–1127. doi:10.1007/s11831-020-09412-6.

[33] Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6.

[34] Antoniou, N., Katsamanis, A., Giannakopoulos, T., & Narayanan, S. (2023). Designing and Evaluating Speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2023-June, 1–5. doi:10.1109/ICASSP49357.2023.10096808.

[35] Zubiaga, A. (2023). Natural language processing in the era of large language models. Frontiers in Artificial Intelligence, 6, 1350306. doi:10.3389/frai.2023.1350306.

[36] Jamthe, S. (2026). Generative AI. Generative AI, 66(1), 1–10. doi:10.1201/9788743808145-1.

[37] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 5999–6009. doi:10.1201/9781003561460-19.

[38] Cuconasu, F., Trappolini, G., Siciliano, F., Filice, S., Campagnano, C., Maarek, Y., Tonellotto, N., & Silvestri, F. (2024). The Power of Noise: Redefining Retrieval for RAG Systems. SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 719–729. doi:10.1145/3626772.3657834.

[39] Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Alobeidli, H., Cappelli, A., ... & Launay, J. (2023). The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems, 36, 79155-79172. doi:10.5555/3666122.3669586.

[40] Chen, Z., Jiang, F., Chen, J., Wang, T., Yu, F., Chen, G., ... & Li, H. (2023). Phoenix: Democratizing ChatGPT across languages. arXiv preprint arXiv:2304.10453. doi:10.48550/arXiv.2304.10453.

[41] Liu, T., & Low, B. K. H. (2023). Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks. arXiv Preprint, arXiv:2305.14201. doi:10.48550/arXiv.2305.14201.

[42] Hamdipour, A., Basiri, A., Zaare, M., & Mirjalili, S. (2025). BAHA: Binary artificial hummingbird algorithm for feature selection. Journal of Computational Science, 92. doi:10.1016/j.jocs.2025.102686.

[43] Hamad, R. K., & Rashid, T. A. (2024). GOOSE algorithm: a powerful optimization tool for real-world engineering challenges and beyond. Evolving Systems, 15(4), 1249–1274. doi:10.1007/s12530-023-09553-6.

[44] Suyanto, S., Ariyanto, A. A., & Ariyanto, A. F. (2022). Komodo Mlipir Algorithm. Applied Soft Computing, 114, 108043. doi:10.1016/j.asoc.2021.108043.

[45] Jiao, Y., & Du, P. (2016). Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quantitative Biology, 4(4), 320–330. doi:10.1007/s40484-016-0081-2.

[46] Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. doi:10.1073/pnas.1900654116.

[47] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3982–3992. doi:10.18653/v1/D19-1410.

[48] Chefer, H., Gur, S., & Wolf, L. (2021). Transformer Interpretability Beyond Attention Visualization. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 782–791. doi:10.1109/CVPR46437.2021.00084.

[49] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7), 130140. doi:10.1371/journal.pone.0130140.

[50] Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. 34th International Conference on Machine Learning, ICML 2017, 7, 5109–5118.

[51] Sturm, I., Lapuschkin, S., Samek, W., & Müller, K. R. (2016). Interpretable deep neural networks for single-trial EEG classification. Journal of Neuroscience Methods, 274, 141–145. doi:10.1016/j.jneumeth.2016.10.008.

[52] Li, J. L., & Lee, C. C. (2019). Attention Learning with Retrievable Acoustic Embedding of Personality for Emotion Recognition. 2019 8th International Conference on Affective Computing and Intelligent Interaction, ACII 2019, 171–177. doi:10.1109/ACII.2019.8925536.

[53] Zou, S. H., Huang, X., Shen, X. D., & Liu, H. (2022). Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation. Knowledge-Based Systems, 258, 109978. doi:10.1016/j.knosys.2022.109978.

[54] Wang, C., Ren, Y., Zhang, N., Cui, F., & Luo, S. (n.d.). Speech emotion recognition based on multi‐feature and multi‐lingual fusion. Multimedia Tools and Applications, 81(4), 4897–4907.

[55] Chamishka, S., Madhavi, I., Nawaratne, R., Alahakoon, D., De Silva, D., Chilamkurti, N., & Nanayakkara, V. (2022). A voice-based real-time emotion detection technique using recurrent neural network empowered feature modelling. Multimedia Tools and Applications, 81(24), 35173–35194. doi:10.1007/s11042-022-13363-4.

[56] Chauhan, K., Sharma, K. K., & Varma, T. (2023). Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP). Circuits, Systems, and Signal Processing, 42(9), 5500–5522. doi:10.1007/s00034-023-02367-6.

[57] Singh, P., Srivastava, R., Rana, K. P. S., & Kumar, V. (2021). A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowledge-Based Systems, 229, 107316. doi:10.1016/j.knosys.2021.107316.

[58] Zhang, K., Li, Y., Wang, J., Wang, Z., & Li, X. (2021). Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis. IEEE Signal Processing Letters, 28, 1898–1902. doi:10.1109/LSP.2021.3112314.

[59] Yoon, Y. C. (2022). Can We Exploit All Datasets? Multimodal Emotion Recognition Using Cross-Modal Translation. IEEE Access, 10, 64516–64524. doi:10.1109/ACCESS.2022.3183587.

[60] Braunschweiler, N., Doddipatla, R., Keizer, S., & Stoyanchev, S. (2022). Factors in Emotion Recognition with Deep Learning Models Using Speech and Text on Multiple Corpora. IEEE Signal Processing Letters, 29, 722–726. doi:10.1109/LSP.2022.3151551.

[61] Abid, A., Abdalla, A., Abid, A., Khan, D., Alfozan, A., & Zou, J. (2019). Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv Preprint, arXiv:1906.02569. doi:10.48550/arXiv.1906.02569.

Acceptance Rate:	21%
Review Speed:	74 days
Issue Per Year:	6
Number of Volumes:	7
Number of Issues:	44
Number of Articles:	493
Number of Reviewers:	1187
Number of Contributors:	1394
Contributing Countries:	83
No. of WoS Citations:	2609
No. of Scopus Citations:	2936
No. of Google Citations:	4161
Google h-index:	29
Google i10-index:	126
Abstract Views:	681,807
PDF Download:	492,524

Multimodal Emotion Recognition Using Hybrid Large Language Models and Metaheuristic Algorithms

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Impediments of Green Finance Adoption System: Linking Economy and Environment

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications

Address

Contact Info:

Multimodal Emotion Recognition Using Hybrid Large Language Models and Metaheuristic Algorithms

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

social

Journal Imprint

Journal Metrics

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Impediments of Green Finance Adoption System: Linking Economy and Environment

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications