Structure-Aware Chunking for Complex Tables in Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation Large Language Model Chunking Table Parsing Complex Table Layout

Authors

  • Xin-Kuang Koay 1) Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia. 2) Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia
  • Lee-Yeng Ong
    lyong@mmu.edu.my
    1) Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia. 2) Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia https://orcid.org/0000-0003-4749-3490
  • Pey-Yun Goh 1) Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia. 2) Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia https://orcid.org/0000-0003-2060-3223

Downloads

Retrieval-Augmented Generation (RAG) is a hybrid method that combines information retrieval with large language models to generate context-aware, factually grounded responses. However, the RAG system relies heavily on well-structured input data to generate accurate and contextually relevant responses. Documents with complex table layouts pose significant challenges, as most chunking strategies are text-centric and often overlook table-rich documents containing multi-column and multi-row structures. Hence, this study proposes a customized structure-aware chunking framework specifically designed for university course documents containing multi-column, multi-row tables with nested headers. The framework employs Camelot for high-fidelity table extraction, followed by customized logic that constructs semantically coherent chunks by preserving academic term, subject name, credit hour, and category. This prevents semantic fragmentation during retrieval. The proposed method is evaluated using the RAGAS framework and compared against several baselines using standard parsing and chunking techniques. Results show that the proposed approach achieves the highest answer accuracy of 0.73 and substantially improves retrieval relevance and contextual precision. These findings demonstrate the framework’s effectiveness in handling structure-dependent academic queries. This study highlights that ensuring both parsing quality and chunking strategy is essential to retain semantic relationships in table-rich documents, offering a practical improvement for RAG systems in structurally complex scenarios.