Structure-Aware Chunking for Complex Tables in Retrieval-Augmented Generation Systems

Xin-Kuang Koay; Lee-Yeng Ong; Pey-Yun Goh

doi:10.28991/ESJ-2026-010-01-09

Authors

Xin-Kuang Koay 1) Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia. 2) Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia
Lee-Yeng Ong
lyong@mmu.edu.my
1) Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia. 2) Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia https://orcid.org/0000-0003-4749-3490
Pey-Yun Goh 1) Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia. 2) Centre for Advanced Analytics, CoE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia https://orcid.org/0000-0003-2060-3223

Vol. 10 No. 1 (2026): February

Research Articles

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

Retrieval-Augmented Generation (RAG) is a hybrid method that combines information retrieval with large language models to generate context-aware, factually grounded responses. However, the RAG system relies heavily on well-structured input data to generate accurate and contextually relevant responses. Documents with complex table layouts pose significant challenges, as most chunking strategies are text-centric and often overlook table-rich documents containing multi-column and multi-row structures. Hence, this study proposes a customized structure-aware chunking framework specifically designed for university course documents containing multi-column, multi-row tables with nested headers. The framework employs Camelot for high-fidelity table extraction, followed by customized logic that constructs semantically coherent chunks by preserving academic term, subject name, credit hour, and category. This prevents semantic fragmentation during retrieval. The proposed method is evaluated using the RAGAS framework and compared against several baselines using standard parsing and chunking techniques. Results show that the proposed approach achieves the highest answer accuracy of 0.73 and substantially improves retrieval relevance and contextual precision. These findings demonstrate the framework’s effectiveness in handling structure-dependent academic queries. This study highlights that ensuring both parsing quality and chunking strategy is essential to retain semantic relationships in table-rich documents, offering a practical improvement for RAG systems in structurally complex scenarios.

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474. doi:10.48550/arXiv.2005.11401.

[2] Yepes, A. J., You, Y., Milczek, J., Laverde, S., & Li, R. (2024). Financial report chunking for effective retrieval augmented generation. arXiv preprint, arXiv:2402.05131. doi:10.48550/arXiv.2402.05131.

[3] Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021). Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: Punta Cana, Dominican Republic: Association for Computational Linguistics, EMNLP 2021, 3784–3803. doi:10.18653/v1/2021.findings-emnlp.320.

[4] Chen, X., Wang, L., Wu, W., Tang, Q., & Liu, Y. (2024). Honest AI: Fine-Tuning" Small" Language Models to Say" I Don't Know", and Reducing Hallucination in RAG. arXiv preprint, arXiv:2410.09699. doi:10.48550/arXiv.2410.09699.

[5] Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open-domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume Online: Association for Computational Linguistics, 874–880. doi:10.18653/v1/2021.eacl-main.74.

[6] Kim, K., & Lee, J.-Y. (2024). RE-RAG: Improving open-domain QA performance and interpretability with relevance estimator in retrieval-augmented generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA: Association for Computational Linguistics. doi:10.18653/v1/2024.emnlp-main.1236.

[7] Wang, H., Huang, W., Deng, Y., Wang, R., Wang, Z., Wang, Y., ... & Wong, K. F. (2024). Unims-rag: A unified multi-source retrieval-augmented generation for personalized dialogue systems. arXiv preprint, arXiv:2401.13256. doi:10.48550/arXiv.2401.13256.

[8] Manathunga, S. S., & Illangasekara, Y. A. (2023). Retrieval augmented generation and representative vector summarization for large unstructured textual data in medical education. arXiv preprint, arXiv:2308.00479. doi:10.48550/arXiv.2308.00479.

[9] Islam, S. B., Rahman, M. A., Hossain, K. S. M., Hoque, E., Joty, S., & Parvez, M. R. (2024). Open-rag: Enhanced retrieval-augmented reasoning with open-source large language models. arXiv preprint, arXiv:2410.01782. doi:10.48550/arXiv.2410.01782.

[10] Biswal, A., Patel, L., Jha, S., Kamsetty, A., Liu, S., Gonzalez, J. E., ... & Zaharia, M. (2024). Text2sql is not enough: Unifying ai and databases with tag. arXiv preprint, arXiv:2408.14717. doi:10.48550/arXiv.2408.14717.

[11] Roychowdhury, S., Krema, M., Mahammad, A., Moore, B., Mukherjee, A., & Prakashchandra, P. (2024, December). ERATTA: Extreme RAG for enterprise-Table to Answers with Large Language Models. 2024 IEEE International Conference on Big Data (BigData), 4605-4610. doi:10.1109/BigData62323.2024.10825910.

[12] Lin, D. (2024). Revolutionizing retrieval-augmented generation with enhanced PDF structure recognition. arXiv preprint, arXiv:2401.12599. doi:10.48550/arXiv.2401.12599.

[13] Bhat, S. R., Rudat, M., Spiekermann, J., & Flores-Herr, N. (2025). Rethinking Chunk Size for Long-Document Retrieval: A Multi-Dataset Analysis. arXiv preprint, arXiv:2505.21700. doi:10.48550/arXiv.2505.21700.

[14] LlamaParse (2025). AI Document Parsing Software. LlamaIndex, San Francisco, United States. Available online: https://www.llamaindex.ai/llamaparse/ (accessed June 2025).

[15] Liang, Y., Yang, L., Wang, C., Xia, C., Meng, R., Xu, X., Wang, H., Payani, A., & Shu, K. (2025). Benchmarking LLMs for political science: A United Nations perspective. arXiv preprint, arXiv:2502.14122. doi:10.48550/arXiv.2502.14122.

[16] Son, G., Hong, J., Fan, H., Nam, H., Ko, H., Lim, S., Song, J., Choi, J., Paulo, G., Yu, Y., & Biderman, S. (2025). When AI co-scientists fail: SPOT — A benchmark for automated verification of scientific research. arXiv preprint, arXiv:2505.11855. doi:10.48550/arXiv.2505.11855.

[17] Adhikari, N. S., & Agarwal, S. (2024). A comparative study of pdf parsing tools across diverse document categories. arXiv preprint, arXiv:2410.09871. doi:10.48550/arXiv.2410.09871.

[18] Chen, S. A., Miculicich, L., Eisenschlos, J. M., Wang, Z., Wang, Z., Chen, Y., Fujii, Y., Lin, H.-T., Lee, C.-Y., & Pfister, T. (2024). TableRAG: Million-token table understanding with language models. Advances in Neural Information Processing Systems, 37, 74899–74921. doi:10.48550/arXiv.2410.04739.

[19] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint, arXiv:2312.10997. doi:10.48550/arXiv.2312.10997.

[20] Smock, B., Pesala, R., & Abraham, R. (2022). PubTables-1M: Towards comprehensive table extraction from unstructured documents. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4634-4642. doi:10.48550/arXiv.2110.00061.

[21] Verma, P. (2025). S2 Chunking: A Hybrid Framework for Document Segmentation through Integrated Spatial and Semantic Analysis. arXiv preprint, arXiv:2501.05485. doi:10.48550/arXiv.2501.05485.

[22] Zhao, J., Ji, Z., Fan, Z., Wang, H., Niu, S., Tang, B., Xiong, F., & Li, Z. (2025). MoC: Mixtures of text chunking learners for retrieval-augmented generation system. arXiv preprint, arXiv:2503.09600. doi:10.48550/arXiv.2503.09600.

[23] Merola, C., & Singh, J. (2025). Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation. arXiv preprint, arXiv:2504.19754. doi:10.48550/arXiv.2504.19754.

[24] Qu, R., Tu, R., & Bao, F. S. (2025). Is semantic chunking worth the computational cost? In Findings of the Association for Computational Linguistics: NAACL 2025, Association for Computational Linguistics, Albuquerque, New Mexico. doi:10.48550/arXiv.2410.13070.

[25] Zhao, B., Ji, C., Zhang, Y., He, W., Wang, Y., Wang, Q., Feng, R., & Zhang, X. (2023). Large language models are complex table parsers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14786–14802. doi:10.18653/v1/2023.emnlp-main.914.

[26] Cremaschi, M., Spahiu, B., Palmonari, M., & Jimenez-Ruiz, E. (2024). Survey on Semantic Interpretation of Tabular Data: Challenges and Directions. arXiv preprint, arXiv:2411.11891. doi:10.48550/arXiv.2411.11891.

[27] Xiao, B., Simsek, M., Kantarci, B., & Alkheir, A. A. (2022). Table structure recognition with conditional attention. arXiv preprint, arXiv:2203.03819. doi:10.48550/arXiv.2203.03819

[28] Anand, R., Paik, H. Y., & Wang, C. (2019). Integrating and querying similar tables from PDF documents using deep learning. arXiv preprint, arXiv:1901.04672. doi:10.48550/arXiv.1901.04672.

[29] Nassar, A., Livathinos, N., Lysak, M., & Staar, P. (2022). TableFormer: Table structure understanding with transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4614–4623. doi:10.48550/arXiv.2203.01017

[30] Qiu, J., Xiao, F., Wang, Y., Mao, Y., Chen, Y., Juan, X., ... & Wang, M. (2025). On path to multimodal historical reasoning: Histbench and histagent. arXiv preprint, arXiv:2505.20246. doi:10.48550/arXiv.2505.20246.

[31] Madrid-García, A., Benavent, D., Plasencia-Rodríguez, C., Rosales-Rosado, Z., Merino-Barbancho, B., & Freites-Núnez, D. (2025). Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating EULAR and ACR Recommendations. Medrxiv, 2025-04. doi:10.1101/2025.04.10.25325588

[32] Es, S., James, J., Espinosa Anke, L., & Schockaert, S. (2024). RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 150–158. doi:10.48550/arXiv.2309.15217.

[33] Chen, J., Zhang, T., Wang, M., Li, Y., Liu, H., Xu, K., & Huang, X. (2024). BGE M3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. doi:10.48550/arXiv.2402.03216.

[34] Cao, H. (2025). Writing style matters: An examination of bias and fairness in information retrieval systems. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM 2025), 336–344. doi:10.1145/3701551.3703514

[35] Chase, H. (2025). LangChain Software: MIT License. Available online: https://www.langchain.com/ (accessed on January 2026).

Acceptance Rate:	21%
Review Speed:	74 days
Issue Per Year:	6
Number of Volumes:	7
Number of Issues:	44
Number of Articles:	493
Number of Reviewers:	1187
Number of Contributors:	1394
Contributing Countries:	83
No. of WoS Citations:	2609
No. of Scopus Citations:	2936
No. of Google Citations:	4161
Google h-index:	29
Google i10-index:	126
Abstract Views:	681,807
PDF Download:	492,524

Structure-Aware Chunking for Complex Tables in Retrieval-Augmented Generation Systems

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Impediments of Green Finance Adoption System: Linking Economy and Environment

Address

Contact Info:

Structure-Aware Chunking for Complex Tables in Retrieval-Augmented Generation Systems

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

social

Journal Imprint

Journal Metrics

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Impediments of Green Finance Adoption System: Linking Economy and Environment