Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph
Downloads
Large Language Models (LLMs) have shown impressive capabilities in natural language understanding and generation tasks. However, their reliance on text-only input limits their ability to handle tasks that require multimodal reasoning. To overcome this, Multimodal Large Language Models (MLLMs) have been introduced, enabling inputs such as images, text, video and audio. While MLLMs address some limitations, they often suffer from hallucinations because of over-reliance on internal knowledge and face high computational costs. Traditional vector-based multimodal RAG systems attempt to mitigate these issues by retrieving supporting information, but often suffer from cross-modal misalignment, where independently retrieved text and image content cannot align meaningfully. Motivated by the structured retrieval capabilities of text-based knowledge graph RAG, this paper proposes VisGraphRAG to address the challenge by modelling structured relationships between images and text within a unified MMKG. This structure enables more accurate retrieval and better alignment across modalities, resulting in more relevant and complete responses. The experimental results show that VisGraphRAG significantly outperforms the vector database-based baseline RAG, achieving a higher answer accuracy of 0.7629 compared to 0.6743. Besides accuracy, VisGraphRAG also shows superior performance in key RAGAS metrics such as multimodal relevance (0.8802 vs 0.7912), showing its stronger ability to retrieve relevance information across modalities. These results underscore the effectiveness of the proposed Multimodal Knowledge Graph (MMKG) methods in enhancing cross-modal alignment and supporting more accurate, context-aware generation in complex multimodal tasks.
Downloads
[1] Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., & Zhang, Y. (2024). A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly. High-Confidence Computing, 4(2), 100211. doi:10.1016/j.hcc.2024.100211.
[2] Chen, Z., Xu, C., Qi, Y., Jiang, X., & Guo, J. (2025, November). VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training. In Findings of the Association for Computational Linguistics: EMNLP 2025, 8140-8158. doi:10.18653/v1/2025.findings-emnlp.432.
[3] Wu, S., Fei, H., Qu, L., Ji, W., & Chua, T. S. (2024). NExT-GPT: Any-to-Any Multimodal LLM. Proceedings of Machine Learning Research, 235, 53366–53397. doi:10.48550/arXiv.2309.05519.
[4] Zhan, J., Dai, J., Ye, J., Zhou, Y., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., Yan, H., Fu, J., Gui, T., Sun, T., Jiang, Y. G., & Qiu, X. (2024). AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 9637–9662. doi:10.18653/v1/2024.acl-long.521.
[5] Singh, A. (2025). Meta Llama 4: The Future of Multimodal AI. SSRN Electronic Journal (Preprint), 1-15. doi:10.2139/ssrn.5208228.
[6] Zhao, R., Chen, H., Wang, W., Jiao, F., Do, X. L., Qin, C., ... & Joty, S. (2023). Retrieving multimodal information for augmented generation: A survey. arXiv Preprint, arXiv:2303.10868. doi:10.48550/arxiv.2303.10868.
[7] Chen, W., Hu, H., Chen, X., Verga, P., & Cohen, W. W. (2022). MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 5558–5570. doi:10.18653/v1/2022.emnlp-main.375.
[8] Riedler, M., & Langer, S. (2024). Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications. arXiv Preprint, arXiv:2410.21943. doi:10.48550/arXiv.2410.21943.
[9] Joshi, P., Gupta, A., Kumar, P., & Sisodia, M. (2024). Robust Multi Model RAG Pipeline for Documents Containing Text, Table & Images. Proceedings of the 3rd International Conference on Applied Artificial Intelligence and Computing, ICAAIC 2024, 993–999. doi:10.1109/ICAAIC60222.2024.10574972.
[10] Zheng, J., Liang, M., Yu, Y., Li, Y., & Xue, Z. (2024). Knowledge Graph Enhanced Multimodal Transformer for Image-Text Retrieval. Proceedings - International Conference on Data Engineering, 70–82. doi:10.1109/ICDE60146.2024.00013.
[11] Xia, P., Zhu, K., Li, H., Wang, T., Shi, W., Wang, S., Zhang, L., Zou, J., & Yao, H. (2025). Mmed-Rag: Versatile Multimodal Rag System for Medical Vision Language Models. In 13th International Conference on Learning Representations, ICLR 2025, 76330–76359. doi:10.48550/arxiv.2410.13085.
[12] Mankari, S., & Sanghavi, A. (2024). Enhancing Vector based Retrieval Augmented Generation with Contextual Knowledge Graph Construction. 2nd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry, IDICAIEI 2024, 1–6. doi:10.1109/IDICAIEI61867.2024.10842699.
[13] Shavaki, M. A., Omrani, P., Toosi, R., & Akhaee, M. A. (2024). Knowledge Graph Based Retrieval-Augmented Generation for Multi-Hop Question Answering Enhancement. 15th International Conference on Information and Knowledge Technology, IKT 2024, 78–84. doi:10.1109/IKT65497.2024.10892619.
[14] Lee, J., Wang, Y., Li, J., & Zhang, M. (2024). Multimodal Reasoning with Multimodal Knowledge Graph. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1, 10767–10782. doi:10.18653/v1/2024.acl-long.579.
[15] Chen, Z., Zhang, Y., Fang, Y., Geng, Y., Guo, L., Chen, X., Li, Q., Zhang, W., Chen, J., Zhu, Y., Li, J., Liu, X., Pan, J. Z., Zhang, N., & Chen, H. (2024). Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey. arXiv Preprint, arXiv: 2402.05391. doi:10.48550/arXiv.2402.05391.
[16] Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., & Tan, J. (2020). Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognition, 108, 107563. doi:10.1016/j.patcog.2020.107563.
[17] Zheng, J., Liang, M., Yu, Y., Du, J., & Xue, Z. (2024). Multimodal Knowledge Graph-Guided Cross-Modal Graph Network for Image-Text Retrieval. Proceedings - 2024 IEEE International Conference on Big Data and Smart Computing, BigComp 2024, 97–100. doi:10.1109/BigComp60711.2024.00024.
[18] Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C., & Yu, D. (2024). MM-LLMs: Recent advances in multimodal large language models. arXiv Preprint, arXiv:2401.13601. doi:10.48550/arXiv.2401.13601.
[19] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2025). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems, 43(2), 3703155. doi:10.1145/3703155.
[20] Kau, A., He, X., Nambissan, A., Astudillo, A., Yin, H., & Aryani, A. (2024). Combining knowledge graphs and large language models. arXiv preprint, arXiv:2407.06564. doi:10.48550/arxiv.2407.06564.
[21] Chen, Y., Ge, X., Yang, S., Hu, L., Li, J., & Zhang, J. (2023). A survey on Multimodal knowledge Graphs: construction, completion and applications. Mathematics, 11(8), 1815. doi:10.3390/math1108181.
[22] Liu, B., Zhan, L. M., Xu, L., Ma, L., Yang, Y., & Wu, X. M. (2021). Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. Proceedings - International Symposium on Biomedical Imaging, 2021-April, 1650–1654. doi:10.1109/ISBI48211.2021.9434010.
[23] Lahitani, A. R., Permanasari, A. E., & Setiawan, N. A. (2016). Cosine similarity to determine similarity measure: Study case in online essay assessment. Proceedings of 2016 4th International Conference on Cyber and IT Service Management, CITSM 2016. doi:10.1109/CITSM.2016.7577578.
[24] Es, S., James, J., Anke, L. E., & Schockaert, S. (2024). Ragas: Automated evaluation of retrieval augmented generation. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 150-158. doi:10.18653/v1/2024.eacl-demo.16.
[25] Halpin, S. N. (2024). Inter-Coder Agreement in Qualitative Coding: Considerations for its Use. American Journal of Qualitative Research, 8(3), 23–43. doi:10.29333/ajqr/14887.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.




















