Developing an Integrated Genomic Profile for Cancer Patients with the Use of NGS Data

Alexandra Kosvyra, C. Maramis, I. Chouvarda


Next Generation Sequencing (NGS) technologies has revolutionized genomics data research by facilitating high-throughput sequencing of genetic material that comes from different sources, such as Whole Exome Sequencing (WES) and RNA Sequencing (RNAseq). The exploitation and integration of this wealth of heterogeneous sequencing data remains a major challenge. There is a clear need for approaches that attempt to process and combine the aforementioned sources in order to create an integrated profile of a patient that will allow us to build the complete picture of a disease. This work introduces such an integrated profile using Chronic Lymphocytic Leukemia (CLL) as the exemplary cancer type. The approach described in this paper links the various NGS sources with the patients’ clinical data. The resulting profile efficiently summarizes the large-scale datasets, links the results with the clinical profile of the patient and correlates indicators arising from different data types. With the use of state-of-the-art machine learning techniques and the association of the clinical information with these indicators, which served as the feature pool for the classification, it has been possible to build efficient predictive models. To ensure reproducibility of the results, open data were exclusively used in the classification assessment. The final goal is to design a complete genomic profile of a cancer patient. The profile includes summarization and visualization of the results of WES and RNAseq analysis (specific variants and significantly expressed genes, respectively) and the clinical profile, integration/comparison of these results and a prediction regarding the disease trajectory. Concluding, this work has managed to produce a comprehensive clinico-genetic profile of a patient by successfully integrating heterogeneous data sources. The proposed profile can contribute to the medical research providing new possibilities in personalized medicine and prognostic views.


Bioinformatics; Sequencing Analysis; High-Throughput Sequencing; Data Mining.


Via, Marc, Christopher Gignoux, and Esteban Burchard. “The 1000 Genomes Project: New Opportunities for Research and Social Challenges.” Genome Medicine 2, no. 1 (2010): 3. doi:10.1186/gm124.

Kahn, S. D. “On the Future of Genomic Data.” Science 331, no. 6018 (February 2011): 728-729. doi: 10.1126/science.1197891.

Behjati, S., and P. S. Tarpey. “What is Next Generation Sequencing.”, Research in Practice 98, no. 6 (August 2013): 236–238. doi: 10.1136/archdischild-2013-304340.

Nagymihály, Marianna, Attila Szűcs, and Attila Kereszt. "Next-Generation Sequencing and its new possibilities in medicine." Acta Biologica Szegediensis 59, no. suppl. 2. (2015): 323-339.

Warr, A., C. Robert, D. Hume, A. Archibald, N. Deeb, and M. Watson. “Exome Sequencing: Current and Future Perspectives.” G3 5, no. 8 (August 2015): 1543–1550. doi: 10.1534/g3.115.018564.

Wang, Z., M. Gerstein, and M. Snyder. “RNA-Seq: a revolutionary tool for transcriptomics.” Nature Reviews Genetics 10, no. 1 (January 2009): 57–63. doi: 10.1038/nrg2484.

Guo, Y., X Ding., Y. Shen, G. J. Lyon, and K. Wang. “SeqMule: automated pipeline for analysis of human exome/genome sequencing data.” Scientific Reports 5 (September 2015). doi: 10.1038/srep14283.

Hintzsche, J., J. Kim, V. Yadav, C. Amato, S. E. Robinson, E. Seelenfreund, Y. Shellman, J. Wisell, A. Applegate, M. McCarter, N. Box, J. Tentler, S. De, W.A. Robinson, and A. C. Tan. “IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples.” Journal of the American Medical Informatics Association 23, no. 4 ( July 2016): 721–730. doi: 10.1093/jamia/ocw022.

D'Antonio, M., P. D'Onorio De Meo, D. Paoletti, B. Elmi, M. Pallocca, N. Sanna, E. Picardi, G. Pesole, and T. Castrignanò. “WEP: a high-performance analysis pipeline for whole-exome data.” BMC Bioinformatics 14, no. 7 (April 2013). doi: 10.1186/1471-2105-14-S7-S11.

Wang, K., M. Li, and H. Hakonarson. “ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.” Nucleic Acids Research 38, no. 16 (September 2010): 1-7. doi: 10.1093/nar/gkq603.

Trapnell, C., A. Roberts, L. Goff, G. Pertea, D. Kim, D. Kelley, H. Pimentel, S. Salzberg, J. Rinn, and L. Pachter. “Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.” Nature Protocol 7, no 3 (March 2012): 562-578. doi:10.1038/nprot.2012.016.

Pertea, M., D. Kim, G. Pertea, J. T. Leek, and S. L. Salzberg. “Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.” Nature Protocol 11, no. 9 (September 2016): 1650–1667. doi: 10.1038/nprot.2016.095.

Cornwell, M., M. Vangala, L. Taing, Z. Herbert, J. Köster, B. Li, H. Sun, T. Li, J. Zhang, X. Qiu, M. Pun, R. Jeselsohn, M. Brown, S. Liu, and H. Long. “VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis.” BMC Bioinformatics 19 (April 2018): 135-148. doi: 10.1186/s12859-018-2139-9.

Fonseca, N., R. Petryszak, J. Marioni, and A. Brazma. “iRAP - an integrated RNA-seq Analysis Pipeline.” bioRxiv (June 2014). doi: 10.1101/005991.

Robinson, J. T., H. Thorvaldsdóttir,W. Winckler, M. Guttman, E. S. Lander, G. Getz, and J. P. Mesirov. “Integrative genomics viewer.” Nature Biotechnology 29, no.1 (January 2011): 24–26. doi: 10.1038/nbt.1754.

Chelaru, F., L. Smith, N. Goldstein, and H. Bravo. “Epiviz: interactive visual analytics for functional genomics data.” Nature Methods 11, no. 9 (September 2014): 938–940. doi: 10.1038/nmeth.3038.

Codina-Solà, M., B. Rodríguez-Santiago, A. Homs, J. Santoyo, M. Rigau, G. Aznar-Laín, M. del Campo, B. Gener, E. Gabau, M. P. Botella, A. Gutiérrez-Arumí, G. Antiñolo, L. A. Pérez-Jurado. “Integrated analysis of whole-exome sequencing and transcriptome profiling in males with autism spectrum disorders.” Molecular Autism 6 (April 2015): 21:36. doi: 10.1186/s13229-015-0017-0.

Wilkerson, M. D., C. R. Cabanski, W. Sun, K. A. Hoadley, V. Walter, L. E. Mose, M. A. Troester, P. S. Hammerman, J. S. Parker, C. M. Perou, and D. N. Hayes. “Integrated RNA and DNA sequencing improves mutation detection in low purity tumors.” Nucleic Acids Research 42, no. 13 (July 2014): e107. doi: 10.1093/nar/gku489.

Landesfeind, M., B. Zeitouni, A. Peille, and V. Vuaroqueaux. “Combining whole-exome and RNA-Seq data improves the quality of PDX mutation profiles.” Cancer Research 76, no. 14 (July 2016): 2701-2701. doi: 10.1158/1538-7445.AM2016-2701.

Cappelli, E., G. Felici, and E. Weitschek. “Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction.” BioData Mining 11, no. 22 (October 2018). doi: 10.1186/s13040-018-0184-6.

Li, C., J. Lee, J. Ding, and S. Sun. “Integrative analysis of gene expression and methylation data for breast cancer cell lines.” BioData Mining 11, no. 13 (June 2018). doi: 10.1186/s13040-018-0174-8.

Fleck, J. L., A. B. Pavel, and C. G. Cassandras, “Integrating mutation and gene expression cross-sectional data to infer cancer progression.” BMC Systems Biology 10, no. 12 (January 2016). doi: 10.1186/s12918-016-0255-6.

Yu, H., D. C. Samuels, Y. Zhao, and Y. Guo. “Architectures and accuracy of artificial neural network for disease classification from omics data.” BMC Genomics 20, no. 167 (March 2019). doi: 10.1186/s12864-019-5546-z.

Zafeiris, D., S. Rutella, and G. R. Ball. “An Artificial Neural Network Integrated Pipeline for Biomarker Discovery Using Alzheimer's Disease as a Case Study.” Computational and Structural Biotechnology Journal 16 (February 2018): 77-87. doi: 10.1016/j.csbj.2018.02.001.

Young, E, D Noerenberg, L Mansouri, V Ljungström, M Frick, L-A Sutton, S J Blakemore, et al. “EGR2 Mutations Define a New Clinically Aggressive Subgroup of Chronic Lymphocytic Leukemia.” Leukemia 31, no. 7 (November 28, 2016): 1547–1554. doi:10.1038/leu.2016.359.

Moreno, C., and E. Montserrat. “Genetic lesions in chronic lymphocytic leukemia: what’s ready for prime time use?” Haematologica 95, no. 1 (January 2010): 12–15. doi: 10.3324/haematol.2009.016873.

Liu, X., X. Jian, and E. Boerwinkle. “dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions.” Human Mutation 32, no. 8 (August 2011): 894–899. doi: 10.1002/humu.21517.

Trapnell, C., L. Pachter, and S. L. Salzberg. “TopHat: discovering splice junctions with RNA-Seq.” Bioinformatics 25, no. 9 (May 2009): 1105-1111. doi: 10.1093/bioinformatics/btp120.

Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511-515. doi: 10.1038/nbt.1621.

Trapnell, C., D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn, and L. Pachter. “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31, no. 1 (January 2013): 46-53. doi: 10.1038/nbt.2450.

Goff, L., C. Trapnell, and D. Kelley. “cummeRbund: Analysis, exploration, manipulation, and visualization of Cufflinks high-throughput sequencing data. R package version 2.8.2” (2013).

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (October 2011): 2825−2830. doi: hal-00650905v1.

Full Text: PDF

DOI: 10.28991/esj-2019-01178


  • There are currently no refbacks.

Copyright (c) 2019 Alexandra Kosvyra