Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph

Shue-Kei How; Lee-Yeng Ong; Meng-Chew Leow

doi:10.28991/ESJ-2025-09-06-025

Authors

Shue-Kei How Faculty of Information Science and Technology (FIST), Multimedia University, Melaka 75450, Malaysia
Lee-Yeng Ong
lyong@mmu.edu.my
Faculty of Information Science and Technology (FIST), Multimedia University, Melaka 75450, Malaysia https://orcid.org/0000-0003-4749-3490
Meng-Chew Leow Faculty of Information Science and Technology (FIST), Multimedia University, Melaka 75450, Malaysia https://orcid.org/0000-0001-6327-0735

Vol. 9 No. 6 (2025): December

Research Articles

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

Large Language Models (LLMs) have shown impressive capabilities in natural language understanding and generation tasks. However, their reliance on text-only input limits their ability to handle tasks that require multimodal reasoning. To overcome this, Multimodal Large Language Models (MLLMs) have been introduced, enabling inputs such as images, text, video and audio. While MLLMs address some limitations, they often suffer from hallucinations because of over-reliance on internal knowledge and face high computational costs. Traditional vector-based multimodal RAG systems attempt to mitigate these issues by retrieving supporting information, but often suffer from cross-modal misalignment, where independently retrieved text and image content cannot align meaningfully. Motivated by the structured retrieval capabilities of text-based knowledge graph RAG, this paper proposes VisGraphRAG to address the challenge by modelling structured relationships between images and text within a unified MMKG. This structure enables more accurate retrieval and better alignment across modalities, resulting in more relevant and complete responses. The experimental results show that VisGraphRAG significantly outperforms the vector database-based baseline RAG, achieving a higher answer accuracy of 0.7629 compared to 0.6743. Besides accuracy, VisGraphRAG also shows superior performance in key RAGAS metrics such as multimodal relevance (0.8802 vs 0.7912), showing its stronger ability to retrieve relevance information across modalities. These results underscore the effectiveness of the proposed Multimodal Knowledge Graph (MMKG) methods in enhancing cross-modal alignment and supporting more accurate, context-aware generation in complex multimodal tasks.

[1] Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., & Zhang, Y. (2024). A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly. High-Confidence Computing, 4(2), 100211. doi:10.1016/j.hcc.2024.100211.

[2] Chen, Z., Xu, C., Qi, Y., Jiang, X., & Guo, J. (2025, November). VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training. In Findings of the Association for Computational Linguistics: EMNLP 2025, 8140-8158. doi:10.18653/v1/2025.findings-emnlp.432.

[3] Wu, S., Fei, H., Qu, L., Ji, W., & Chua, T. S. (2024). NExT-GPT: Any-to-Any Multimodal LLM. Proceedings of Machine Learning Research, 235, 53366–53397. doi:10.48550/arXiv.2309.05519.

[4] Zhan, J., Dai, J., Ye, J., Zhou, Y., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., Yan, H., Fu, J., Gui, T., Sun, T., Jiang, Y. G., & Qiu, X. (2024). AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 9637–9662. doi:10.18653/v1/2024.acl-long.521.

[5] Singh, A. (2025). Meta Llama 4: The Future of Multimodal AI. SSRN Electronic Journal (Preprint), 1-15. doi:10.2139/ssrn.5208228.

[6] Zhao, R., Chen, H., Wang, W., Jiao, F., Do, X. L., Qin, C., ... & Joty, S. (2023). Retrieving multimodal information for augmented generation: A survey. arXiv Preprint, arXiv:2303.10868. doi:10.48550/arxiv.2303.10868.

[7] Chen, W., Hu, H., Chen, X., Verga, P., & Cohen, W. W. (2022). MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 5558–5570. doi:10.18653/v1/2022.emnlp-main.375.

[8] Riedler, M., & Langer, S. (2024). Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications. arXiv Preprint, arXiv:2410.21943. doi:10.48550/arXiv.2410.21943.

[9] Joshi, P., Gupta, A., Kumar, P., & Sisodia, M. (2024). Robust Multi Model RAG Pipeline for Documents Containing Text, Table & Images. Proceedings of the 3rd International Conference on Applied Artificial Intelligence and Computing, ICAAIC 2024, 993–999. doi:10.1109/ICAAIC60222.2024.10574972.

[10] Zheng, J., Liang, M., Yu, Y., Li, Y., & Xue, Z. (2024). Knowledge Graph Enhanced Multimodal Transformer for Image-Text Retrieval. Proceedings - International Conference on Data Engineering, 70–82. doi:10.1109/ICDE60146.2024.00013.

[11] Xia, P., Zhu, K., Li, H., Wang, T., Shi, W., Wang, S., Zhang, L., Zou, J., & Yao, H. (2025). Mmed-Rag: Versatile Multimodal Rag System for Medical Vision Language Models. In 13th International Conference on Learning Representations, ICLR 2025, 76330–76359. doi:10.48550/arxiv.2410.13085.

[12] Mankari, S., & Sanghavi, A. (2024). Enhancing Vector based Retrieval Augmented Generation with Contextual Knowledge Graph Construction. 2nd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry, IDICAIEI 2024, 1–6. doi:10.1109/IDICAIEI61867.2024.10842699.

[13] Shavaki, M. A., Omrani, P., Toosi, R., & Akhaee, M. A. (2024). Knowledge Graph Based Retrieval-Augmented Generation for Multi-Hop Question Answering Enhancement. 15th International Conference on Information and Knowledge Technology, IKT 2024, 78–84. doi:10.1109/IKT65497.2024.10892619.

[14] Lee, J., Wang, Y., Li, J., & Zhang, M. (2024). Multimodal Reasoning with Multimodal Knowledge Graph. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 10767–10782. doi:10.18653/v1/2024.acl-long.579.

[15] Chen, Z., Zhang, Y., Fang, Y., Geng, Y., Guo, L., Chen, X., Li, Q., Zhang, W., Chen, J., Zhu, Y., Li, J., Liu, X., Pan, J. Z., Zhang, N., & Chen, H. (2024). Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey. arXiv Preprint, arXiv: 2402.05391. doi:10.48550/arXiv.2402.05391.

[16] Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., & Tan, J. (2020). Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition, 108, 107563. doi:10.1016/j.patcog.2020.107563.

[17] Zheng, J., Liang, M., Yu, Y., Du, J., & Xue, Z. (2024). Multimodal Knowledge Graph-Guided Cross-Modal Graph Network for Image-Text Retrieval. Proceedings - 2024 IEEE International Conference on Big Data and Smart Computing, BigComp 2024, 97–100. doi:10.1109/BigComp60711.2024.00024.

[18] Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C., & Yu, D. (2024). MM-LLMs: Recent advances in multimodal large language models. arXiv Preprint, arXiv:2401.13601. doi:10.48550/arXiv.2401.13601.

[19] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2025). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems, 43(2), 3703155. doi:10.1145/3703155.

[20] Kau, A., He, X., Nambissan, A., Astudillo, A., Yin, H., & Aryani, A. (2024). Combining knowledge graphs and large language models. arXiv preprint, arXiv:2407.06564. doi:10.48550/arxiv.2407.06564.

[21] Chen, Y., Ge, X., Yang, S., Hu, L., Li, J., & Zhang, J. (2023). A survey on Multimodal knowledge Graphs: construction, completion and applications. Mathematics, 11(8), 1815. doi:10.3390/math1108181.

[22] Liu, B., Zhan, L. M., Xu, L., Ma, L., Yang, Y., & Wu, X. M. (2021). Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. Proceedings - International Symposium on Biomedical Imaging, 2021-April, 1650–1654. doi:10.1109/ISBI48211.2021.9434010.

[23] Lahitani, A. R., Permanasari, A. E., & Setiawan, N. A. (2016). Cosine similarity to determine similarity measure: Study case in online essay assessment. Proceedings of 2016 4th International Conference on Cyber and IT Service Management, CITSM 2016. doi:10.1109/CITSM.2016.7577578.

[24] Es, S., James, J., Anke, L. E., & Schockaert, S. (2024). Ragas: Automated evaluation of retrieval augmented generation. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 150-158. doi:10.18653/v1/2024.eacl-demo.16.

[25] Halpin, S. N. (2024). Inter-Coder Agreement in Qualitative Coding: Considerations for its Use. American Journal of Qualitative Research, 8(3), 23–43. doi:10.29333/ajqr/14887.

Acceptance Rate:	21%
Review Speed:	74 days
Issue Per Year:	6
Number of Volumes:	7
Number of Issues:	44
Number of Articles:	493
Number of Reviewers:	1187
Number of Contributors:	1394
Contributing Countries:	83
No. of WoS Citations:	2609
No. of Scopus Citations:	2936
No. of Google Citations:	4161
Google h-index:	29
Google i10-index:	126
Abstract Views:	681,807
PDF Download:	492,524

Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications

Impediments of Green Finance Adoption System: Linking Economy and Environment

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Address

Contact Info:

Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

social

Journal Imprint

Journal Metrics

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications

Impediments of Green Finance Adoption System: Linking Economy and Environment

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic