Structure-Aware Chunking for Complex Tables in Retrieval-Augmented Generation Systems
Downloads
Retrieval-Augmented Generation (RAG) is a hybrid method that combines information retrieval with large language models to generate context-aware, factually grounded responses. However, the RAG system relies heavily on well-structured input data to generate accurate and contextually relevant responses. Documents with complex table layouts pose significant challenges, as most chunking strategies are text-centric and often overlook table-rich documents containing multi-column and multi-row structures. Hence, this study proposes a customized structure-aware chunking framework specifically designed for university course documents containing multi-column, multi-row tables with nested headers. The framework employs Camelot for high-fidelity table extraction, followed by customized logic that constructs semantically coherent chunks by preserving academic term, subject name, credit hour, and category. This prevents semantic fragmentation during retrieval. The proposed method is evaluated using the RAGAS framework and compared against several baselines using standard parsing and chunking techniques. Results show that the proposed approach achieves the highest answer accuracy of 0.73 and substantially improves retrieval relevance and contextual precision. These findings demonstrate the framework’s effectiveness in handling structure-dependent academic queries. This study highlights that ensuring both parsing quality and chunking strategy is essential to retain semantic relationships in table-rich documents, offering a practical improvement for RAG systems in structurally complex scenarios.
Downloads
[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474. doi:10.48550/arXiv.2005.11401.
[2] Yepes, A. J., You, Y., Milczek, J., Laverde, S., & Li, R. (2024). Financial report chunking for effective retrieval augmented generation. arXiv preprint, arXiv:2402.05131. doi:10.48550/arXiv.2402.05131.
[3] Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021). Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: Punta Cana, Dominican Republic: Association for Computational Linguistics, EMNLP 2021, 3784–3803. doi:10.18653/v1/2021.findings-emnlp.320.
[4] Chen, X., Wang, L., Wu, W., Tang, Q., & Liu, Y. (2024). Honest AI: Fine-Tuning" Small" Language Models to Say" I Don't Know", and Reducing Hallucination in RAG. arXiv preprint, arXiv:2410.09699. doi:10.48550/arXiv.2410.09699.
[5] Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open-domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume Online: Association for Computational Linguistics, 874–880. doi:10.18653/v1/2021.eacl-main.74.
[6] Kim, K., & Lee, J.-Y. (2024). RE-RAG: Improving open-domain QA performance and interpretability with relevance estimator in retrieval-augmented generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA: Association for Computational Linguistics. doi:10.18653/v1/2024.emnlp-main.1236.
[7] Wang, H., Huang, W., Deng, Y., Wang, R., Wang, Z., Wang, Y., ... & Wong, K. F. (2024). Unims-rag: A unified multi-source retrieval-augmented generation for personalized dialogue systems. arXiv preprint, arXiv:2401.13256. doi:10.48550/arXiv.2401.13256.
[8] Manathunga, S. S., & Illangasekara, Y. A. (2023). Retrieval augmented generation and representative vector summarization for large unstructured textual data in medical education. arXiv preprint, arXiv:2308.00479. doi:10.48550/arXiv.2308.00479.
[9] Islam, S. B., Rahman, M. A., Hossain, K. S. M., Hoque, E., Joty, S., & Parvez, M. R. (2024). Open-rag: Enhanced retrieval-augmented reasoning with open-source large language models. arXiv preprint, arXiv:2410.01782. doi:10.48550/arXiv.2410.01782.
[10] Biswal, A., Patel, L., Jha, S., Kamsetty, A., Liu, S., Gonzalez, J. E., ... & Zaharia, M. (2024). Text2sql is not enough: Unifying ai and databases with tag. arXiv preprint, arXiv:2408.14717. doi:10.48550/arXiv.2408.14717.
[11] Roychowdhury, S., Krema, M., Mahammad, A., Moore, B., Mukherjee, A., & Prakashchandra, P. (2024, December). ERATTA: Extreme RAG for enterprise-Table to Answers with Large Language Models. 2024 IEEE International Conference on Big Data (BigData), 4605-4610. doi:10.1109/BigData62323.2024.10825910.
[12] Lin, D. (2024). Revolutionizing retrieval-augmented generation with enhanced PDF structure recognition. arXiv preprint, arXiv:2401.12599. doi:10.48550/arXiv.2401.12599.
[13] Bhat, S. R., Rudat, M., Spiekermann, J., & Flores-Herr, N. (2025). Rethinking Chunk Size for Long-Document Retrieval: A Multi-Dataset Analysis. arXiv preprint, arXiv:2505.21700. doi:10.48550/arXiv.2505.21700.
[14] LlamaParse (2025). AI Document Parsing Software. LlamaIndex, San Francisco, United States. Available online: https://www.llamaindex.ai/llamaparse/ (accessed June 2025).
[15] Liang, Y., Yang, L., Wang, C., Xia, C., Meng, R., Xu, X., Wang, H., Payani, A., & Shu, K. (2025). Benchmarking LLMs for political science: A United Nations perspective. arXiv preprint, arXiv:2502.14122. doi:10.48550/arXiv.2502.14122.
[16] Son, G., Hong, J., Fan, H., Nam, H., Ko, H., Lim, S., Song, J., Choi, J., Paulo, G., Yu, Y., & Biderman, S. (2025). When AI co-scientists fail: SPOT — A benchmark for automated verification of scientific research. arXiv preprint, arXiv:2505.11855. doi:10.48550/arXiv.2505.11855.
[17] Adhikari, N. S., & Agarwal, S. (2024). A comparative study of pdf parsing tools across diverse document categories. arXiv preprint, arXiv:2410.09871. doi:10.48550/arXiv.2410.09871.
[18] Chen, S. A., Miculicich, L., Eisenschlos, J. M., Wang, Z., Wang, Z., Chen, Y., Fujii, Y., Lin, H.-T., Lee, C.-Y., & Pfister, T. (2024). TableRAG: Million-token table understanding with language models. Advances in Neural Information Processing Systems, 37, 74899–74921. doi:10.48550/arXiv.2410.04739.
[19] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint, arXiv:2312.10997. doi:10.48550/arXiv.2312.10997.
[20] Smock, B., Pesala, R., & Abraham, R. (2022). PubTables-1M: Towards comprehensive table extraction from unstructured documents. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4634-4642. doi:10.48550/arXiv.2110.00061.
[21] Verma, P. (2025). S2 Chunking: A Hybrid Framework for Document Segmentation through Integrated Spatial and Semantic Analysis. arXiv preprint, arXiv:2501.05485. doi:10.48550/arXiv.2501.05485.
[22] Zhao, J., Ji, Z., Fan, Z., Wang, H., Niu, S., Tang, B., Xiong, F., & Li, Z. (2025). MoC: Mixtures of text chunking learners for retrieval-augmented generation system. arXiv preprint, arXiv:2503.09600. doi:10.48550/arXiv.2503.09600.
[23] Merola, C., & Singh, J. (2025). Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation. arXiv preprint, arXiv:2504.19754. doi:10.48550/arXiv.2504.19754.
[24] Qu, R., Tu, R., & Bao, F. S. (2025). Is semantic chunking worth the computational cost? In Findings of the Association for Computational Linguistics: NAACL 2025, Association for Computational Linguistics, Albuquerque, New Mexico. doi:10.48550/arXiv.2410.13070.
[25] Zhao, B., Ji, C., Zhang, Y., He, W., Wang, Y., Wang, Q., Feng, R., & Zhang, X. (2023). Large language models are complex table parsers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14786–14802. doi:10.18653/v1/2023.emnlp-main.914.
[26] Cremaschi, M., Spahiu, B., Palmonari, M., & Jimenez-Ruiz, E. (2024). Survey on Semantic Interpretation of Tabular Data: Challenges and Directions. arXiv preprint, arXiv:2411.11891. doi:10.48550/arXiv.2411.11891.
[27] Xiao, B., Simsek, M., Kantarci, B., & Alkheir, A. A. (2022). Table structure recognition with conditional attention. arXiv preprint, arXiv:2203.03819. doi:10.48550/arXiv.2203.03819
[28] Anand, R., Paik, H. Y., & Wang, C. (2019). Integrating and querying similar tables from PDF documents using deep learning. arXiv preprint, arXiv:1901.04672. doi:10.48550/arXiv.1901.04672.
[29] Nassar, A., Livathinos, N., Lysak, M., & Staar, P. (2022). TableFormer: Table structure understanding with transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4614–4623. doi:10.48550/arXiv.2203.01017
[30] Qiu, J., Xiao, F., Wang, Y., Mao, Y., Chen, Y., Juan, X., ... & Wang, M. (2025). On path to multimodal historical reasoning: Histbench and histagent. arXiv preprint, arXiv:2505.20246. doi:10.48550/arXiv.2505.20246.
[31] Madrid-García, A., Benavent, D., Plasencia-Rodríguez, C., Rosales-Rosado, Z., Merino-Barbancho, B., & Freites-Núnez, D. (2025). Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating EULAR and ACR Recommendations. Medrxiv, 2025-04. doi:10.1101/2025.04.10.25325588
[32] Es, S., James, J., Espinosa Anke, L., & Schockaert, S. (2024). RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 150–158. doi:10.48550/arXiv.2309.15217.
[33] Chen, J., Zhang, T., Wang, M., Li, Y., Liu, H., Xu, K., & Huang, X. (2024). BGE M3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. doi:10.48550/arXiv.2402.03216.
[34] Cao, H. (2025). Writing style matters: An examination of bias and fairness in information retrieval systems. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM 2025), 336–344. doi:10.1145/3701551.3703514
[35] Chase, H. (2025). LangChain Software: MIT License. Available online: https://www.langchain.com/ (accessed on January 2026).
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.



















