Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph

Multimodal Retrieval-Augmented Generation Multimodal Knowledge Graph Multimodal Large Language Models Cross-Modality Alignment

Authors

Downloads

Large Language Models (LLMs) have shown impressive capabilities in natural language understanding and generation tasks. However, their reliance on text-only input limits their ability to handle tasks that require multimodal reasoning. To overcome this, Multimodal Large Language Models (MLLMs) have been introduced, enabling inputs such as images, text, video and audio. While MLLMs address some limitations, they often suffer from hallucinations because of over-reliance on internal knowledge and face high computational costs. Traditional vector-based multimodal RAG systems attempt to mitigate these issues by retrieving supporting information, but often suffer from cross-modal misalignment, where independently retrieved text and image content cannot align meaningfully. Motivated by the structured retrieval capabilities of text-based knowledge graph RAG, this paper proposes VisGraphRAG to address the challenge by modelling structured relationships between images and text within a unified MMKG. This structure enables more accurate retrieval and better alignment across modalities, resulting in more relevant and complete responses. The experimental results show that VisGraphRAG significantly outperforms the vector database-based baseline RAG, achieving a higher answer accuracy of 0.7629 compared to 0.6743. Besides accuracy, VisGraphRAG also shows superior performance in key RAGAS metrics such as multimodal relevance (0.8802 vs 0.7912), showing its stronger ability to retrieve relevance information across modalities. These results underscore the effectiveness of the proposed Multimodal Knowledge Graph (MMKG) methods in enhancing cross-modal alignment and supporting more accurate, context-aware generation in complex multimodal tasks.