Feature Transformation on Big Data for Species Classification in Machine Learning
Downloads
Classification of bacterial species, particularly for closely related taxa, remains a major challenge in many areas, e.g., public health, food industries, and many others. The issues are mainly caused by overlapping genetic features of organisms and data complexities. In this study, a bacterial taxonomic identification framework that integrates genome-derived motif sequences with machine learning was introduced. Two hundred and forty genome sequences from Salmonella enterica, representing six subspecies and ten serovars, were used for modelling. Sequence motifs were predicted from single-copy orthologous core genes of the downloaded genomes. Single nucleotide polymorphisms (SNPs) within these motifs were extracted and numerically encoded as machine learning features. The 20 top-most informative predictors from feature selections were used for model training in Random Forest and Support Vector Machine. Comparing the output from multiple analyses, the Random Forest model achieved the highest accuracy of 97.92%, demonstrating reliable differentiation of Salmonella at both subspecies and serovar levels. This research presents two key innovations: i) the use of sequence motifs as molecular signatures for bacterial classification; ii) a novel feature engineering method that transforms genome-derived data into machine learning-readable features. The proposed framework offers a practical and scalable solution for fine-level bacterial classification and has high potential to be applied for other microbial taxa.
Downloads
[1] Edet, U., Antai, S., Brooks, A., Asitok, A., Enya, O., & Japhet, F. (2017). An Overview of Cultural, Molecular and Metagenomic Techniques in Description of Microbial Diversity. Journal of Advances in Microbiology, 7(2), 1–19. doi:10.9734/jamb/2017/37951.
[2] Franco-Duarte, R., Černáková, L., Kadam, S., Kaushik, K. S., Salehi, B., Bevilacqua, A., Corbo, M. R., Antolak, H., Dybka-Stępień, K., Leszczewicz, M., Tintino, S. R., de Souza, V. C. A., Sharifi-Rad, J., Coutinho, H. D. M., Martins, N., & Rodrigues, C. F. (2019). Advances in chemical and biological methods to identify microorganisms—from past to present. Microorganisms, 7(5), 130. doi:10.3390/microorganisms7050130.
[3] Srinivasan, R., Karaoz, U., Volegova, M., MacKichan, J., Kato-Maeda, M., Miller, S., Nadarajan, R., Brodie, E. L., & Lynch, S. V. (2015). Use of 16S rRNA gene for identification of a broad range of clinically relevant bacterial pathogens. PLoS ONE, 10(2), 117617. doi:10.1371/journal.pone.0117617.
[4] Mishra, A., Nam, G. H., Gim, J. A., Seong, M., Choe, Y., Lee, H. E., Jo, A., Kim, S., Kim, D. H., Cha, H. J., Kang, H. Y., Choi, Y. H., & Kim, H. S. (2017). Comparative evaluation of 16S rRNA gene in worldwide strains of Streptococcus iniae and Streptococcus parauberis for early diagnostic marker. Genes and Genomics, 39(7), 779–791. doi:10.1007/s13258-017-0542-7.
[5] Clarridge, J. E. (2004). Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clinical Microbiology Reviews, 17(4), 840–862. doi:10.1128/CMR.17.4.840-862.2004.
[6] Grinevich, D., Harden, L., Thakur, S., & Callahan, B. (2024). Serovar-level identification of bacterial foodborne pathogens from full-length 16S rRNA gene sequencing. MSystems, 9(3), 00757–23. doi:10.1128/msystems.00757-23.
[7] Bertolo, A., Valido, E., & Stoyanov, J. (2024). Optimized bacterial community characterization through full-length 16S rRNA gene sequencing utilizing MinION nanopore technology. BMC Microbiology, 24(1), 58. doi:10.1186/s12866-024-03208-5.
[8] Yan, S., Zhang, W., Li, C., Liu, X., Zhu, L., Chen, L., & Yang, B. (2021). Serotyping, MLST, and Core Genome MLST Analysis of Salmonella enterica From Different Sources in China During 2004–2019. Frontiers in Microbiology, 12. doi:10.3389/fmicb.2021.688614.
[9] Jacob, J. J., Rachel, T., Shankar, B. A., Gunasekaran, K., Iyadurai, R., Anandan, S., & Veeraraghavan, B. (2020). MLST based serotype prediction for the accurate identification of non typhoidal Salmonella serovars. Molecular Biology Reports, 47(10), 7797–7803. doi:10.1007/s11033-020-05856-y.
[10] Floridia-Yapur, N., Rusman, F., Diosque, P., & Tomasini, N. (2021). Genome data vs MLST for exploring intraspecific evolutionary history in bacteria: Much is not always better. Infection, Genetics and Evolution, 93, 104990. doi:10.1016/j.meegid.2021.104990.
[11] Georgiades, K., & Raoult, D. (2011). Defining pathogenic bacterial species in the genomic era. Frontiers in Microbiology, 1, 151. doi:10.3389/fmicb.2010.00151.
[12] Hassler, H. B., Probert, B., Moore, C., Lawson, E., Jackson, R. W., Russell, B. T., & Richards, V. P. (2022). Phylogenies of the 16S rRNA gene and its hypervariable regions lack concordance with core genome phylogenies. Microbiome, 10(1), 104. doi:10.1186/s40168-022-01295-y.
[13] Alikhan, N. F., Zhou, Z., Sergeant, M. J., & Achtman, M. (2018). A genomic overview of the population structure of Salmonella. PLoS Genetics, 14(4), 1007261. doi:10.1371/journal.pgen.1007261.
[14] Yates, J. R., & Osterman, A. L. (2007). Introduction: Advances in genomics and proteomics. Chemical Reviews, 107(8), 3363-3366. doi:10.1021/cr068201u.
[15] Chan, J. Z. M., Halachev, M. R., Loman, N. J., Constantinidou, C., & Pallen, M. J. (2012). Defining bacterial species in the genomic era: Insights from the genus Acinetobacter. BMC Microbiology, 12(1), 302. doi:10.1186/1471-2180-12-302.
[16] Hugenholtz, P., Chuvochina, M., Oren, A., Parks, D. H., & Soo, R. M. (2021). Prokaryotic taxonomy and nomenclature in the age of big sequence data. ISME Journal, 15(7), 1879–1892. doi:10.1038/s41396-021-00941-x.
[17] Trees, E., Carleton, H. A., Folster, J. P., Gieraltowski, L., Hise, K., Leeper, M., Nguyen, T. A., Poates, A., Sabol, A., Tagg, K. A., Tolar, B., Vasser, M., Webb, H. E., Wise, M., & Lindsey, R. L. (2024). Genetic Diversity in Salmonella enterica in Outbreaks of Foodborne and Zoonotic Origin in the USA in 2006–2017. Microorganisms, 12(8), 1563. doi:10.3390/microorganisms12081563.
[18] Uelze, L., Grützke, J., Borowiak, M., Hammerl, J. A., Juraschek, K., Deneke, C., Tausch, S. H., & Malorny, B. (2020). Typing methods based on whole genome sequencing data. One Health Outlook, 2(1), 3. doi:10.1186/s42522-020-0010-1.
[19] Pightling, A. W., Pettengill, J. B., Luo, Y., Baugher, J. D., Rand, H., & Strain, E. (2018). Interpreting whole-genome sequence analyses of foodborne bacteria for regulatory applications and outbreak investigations. Frontiers in Microbiology, 9, 1482. doi:10.3389/fmicb.2018.01482.
[20] Jin, Y., Li, Y., Huang, S., Hong, C., Feng, X., Cai, H., Xia, Y., Li, S., Zhang, L., Lou, Y., & Guan, W. (2024). Whole-Genome Sequencing Analysis of Antimicrobial Resistance, Virulence Factors, and Genetic Diversity of Salmonella from Wenzhou, China. Microorganisms, 12(11), 2166. doi:10.3390/microorganisms12112166.
[21] Jiang, M., Bu, C., Zeng, J., Du, Z., & Xiao, J. (2021). Applications and challenges of high performance computing in genomics. CCF Transactions on High Performance Computing, 3(4), 344–352. doi:10.1007/s42514-021-00081-w.
[22] Bagger, F. O., Borgwardt, L., Jespersen, A. S., Hansen, A. R., Bertelsen, B., Kodama, M., & Nielsen, F. C. (2024). Whole genome sequencing in clinical practice. BMC Medical Genomics, 17(1), 39. doi:10.1186/s12920-024-01795-w.
[23] Qin, Y., Wu, L., Zhang, Q., Wen, C., Van Nostrand, J. D., Ning, D., Raskin, L., Pinto, A., & Zhou, J. (2023). Effects of error, chimera, bias, and GC content on the accuracy of amplicon sequencing. MSystems, 8(6), 01025–23. doi:10.1128/msystems.01025-23.
[24] Jia, H., Tan, S., & Zhang, Y. E. (2024). Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs. Genomics, Proteomics and Bioinformatics, 22(2), 24. doi:10.1093/gpbjnl/qzae024.
[25] Nguembang Fadja, A., Riguzzi, F., Bertorelle, G., & Trucchi, E. (2021). Identification of natural selection in genomic data with deep convolutional neural network. BioData Mining, 14(1), 51. doi:10.1186/s13040-021-00280-9.
[26] Hamed, B. A., Ibrahim, O. A. S., & Abd El-Hafeez, T. (2023). Optimizing classification efficiency with machine learning techniques for pattern matching. Journal of Big Data, 10(1), 124. doi:10.1186/s40537-023-00804-6.
[27] Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W., & O’Sullivan, J. M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Frontiers in Bioinformatics, 2, 927312. doi:10.3389/fbinf.2022.927312.
[28] Yin, L., Zhang, H., Zhou, X., Yuan, X., Zhao, S., Li, X., & Liu, X. (2020). KAML: Improving genomic prediction accuracy of complex traits using machine learning determined parameters. Genome Biology, 21(1), 146. doi:10.1186/s13059-020-02052-w.
[29] Ashton, P. M., Nair, S., Peters, T. M., Bale, J. A., Powell, D. G., Painset, A., Tewolde, R., Schaefer, U., Jenkins, C., Dallman, T. J., De Pinna, E. M., & Grant, K. A. (2016). Identification of Salmonella for public health surveillance using whole genome sequencing. PeerJ, 2016(4), 1752. doi:10.7717/peerj.1752.
[30] Chattaway, M. A., Langridge, G. C., & Wain, J. (2021). Salmonella nomenclature in the genomic era: a time for change. Scientific Reports, 11(1), 7494. doi:10.1038/s41598-021-86243-w.
[31] Chen, S. H., Parker, C. H., Croley, T. R., & McFarland, M. A. (2021). Genus, species, and subspecies classification of salmonella isolates by proteomics. Applied Sciences (Switzerland), 11(9), 4264. doi:10.3390/app11094264.
[32] Pearce, M. E., Langridge, G. C., Lauer, A. C., Grant, K., Maiden, M. C. J., & Chattaway, M. A. (2021). An evaluation of the species and subspecies of the genus Salmonella with whole genome sequence data: Proposal of type strains and epithets for novel S. enterica subspecies VII, VIII, IX, X and XI. Genomics, 113(5), 3152–3162. doi:10.1016/j.ygeno.2021.07.003.
[33] O’Leary, N. A., Wright, M. W., Brister, J. R., Ciufo, S., Haddad, D., McVeigh, R., Rajput, B., Robbertse, B., Smith-White, B., Ako-Adjei, D., Astashyn, A., Badretdin, A., Bao, Y., Blinkova, O., Brover, V., Chetvernin, V., Choi, J., Cox, E., Ermolaeva, O., … Pruitt, K. D. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1), D733–D745. doi:10.1093/nar/gkv1189.
[34] Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2013). GenBank. Nucleic Acids Research, 41(D1), 36– 42. doi:10.1093/nar/gks1195.
[35] Dong, Y., Sun, F., Ping, Z., Ouyang, Q., & Qian, L. (2020). DNA storage: Research landscape and future prospects. National Science Review, 7(6), 1092–1107. doi:10.1093/nsr/nwaa007.
[36] Emerson, D., Agulto, L., Liu, H., & Liu, L. (2008). Identifying and characterizing bacteria in an era of genomics and proteomics. BioScience, 58(10), 925–936. doi:10.1641/B581006.
[37] Cotter, D. J., Webster, T. H., & Wilson, M. A. (2023). Genomic and demographic processes differentially influence genetic variation across the human X chromosome. PLoS ONE, 18(11 November), 287609. doi:10.1371/journal.pone.0287609.
[38] Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068-2069. doi:10.1093/bioinformatics/btu153.
[39] Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology, 20(1), 238. doi:10.1186/s13059-019-1832-y.
[40] Bailey, T. L., Johnson, J., Grant, C. E., & Noble, W. S. (2015). The MEME suite. Nucleic acids research, 43(W1), W39-W49. doi:10.1093/nar/gkv416.
[41] Edgar, R. (2024). rcedgar/muscle: C++. Available online: https://github.com/rcedgar/muscle (accessed on November 2025).
[42] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.
[43] Horng, Y. T., Dewi Panjaitan, N. S., Chang, H. J., Wei, Y. H., Chien, C. C., Yang, H. C., Chang, H. Y., & Soo, P. C. (2022). A protein containing the DUF1471 domain regulates biofilm formation and capsule production in Klebsiella pneumoniae. Journal of Microbiology, Immunology and Infection, 55(6P2), 1246–1254. doi:10.1016/j.jmii.2021.11.005.
[44] Gromova, E. S., & Khoroshaev, A. V. (2003). Prokaryotic DNA Methyltransferases: The Structure and the Mechanism of Interaction with DNA. Molecular Biology, 37(2), 260–272. doi:10.1023/A:1023301923025.
[45] Lyko, F. (2018). The DNA methyltransferase family: A versatile toolkit for epigenetic regulation. Nature Reviews Genetics, 19(2), 81–92. doi:10.1038/nrg.2017.80.
[46] Bateman, A., Martin, M. J., Orchard, S., Magrane, M., Ahmad, S., Alpi, E., Bowler-Barnett, E. H., Britto, R., Bye-A-Jee, H., Cukura, A., Denny, P., Dogan, T., Ebenezer, T. G., Fan, J., Garmiri, P., da Costa Gonzales, L. J., Hatton-Ellis, E., Hussein, A., Ignatchenko, A., … Zhang, J. (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. doi:10.1093/nar/gkac1052.
[47] Amyes, S. G. B. (2013). 1. Origins. Bacteria, 1–6. doi:10.1093/actrade/9780199578764.003.0001.
[48] Winand, R., Bogaerts, B., Hoffman, S., Lefevre, L., Delvoye, M., Van Braekel, J., Fu, Q., Roosens, N. H. C., De Keersmaecker, S. C. J., & Vanneste, K. (2020). Targeting the 16s rRNA gene for bacterial identification in complex mixed samples: Comparative evaluation of second (illumina) and third (oxford nanopore technologies) generation sequencing technologies. International Journal of Molecular Sciences, 21(1), 298. doi:10.3390/ijms21010298.
[49] Johnson, J. S., Spakowicz, D. J., Hong, B. Y., Petersen, L. M., Demkowicz, P., Chen, L., Leopold, S. R., Hanson, B. M., Agresta, H. O., Gerstein, M., Sodergren, E., & Weinstock, G. M. (2019). Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nature Communications, 10(1), 5029. doi:10.1038/s41467-019-13036-1.
[50] Aanensen, D. M., Feil, E. J., Holden, M. T. G., Dordel, J., Yeats, C. A., Fedosejev, A., Goater, R., Castillo-Ramírez, S., Corander, J., Colijn, C., Chlebowicz, M. A., Schouls, L., Heck, M., Pluister, G., Ruimy, R., Kahlmeter, G., Åhman, J., Matuschek, E., Friedrich, A. W., … Kearns, A. (2016). Whole-genome sequencing for routine pathogen surveillance in public health: A population snapshot of invasive Staphylococcus aureus in Europe. MBio, 7(3), 10 1128 00444–16. doi:10.1128/mBio.00444-16.
[51] Nouioui, I., Carro, L., García-López, M., Meier-Kolthoff, J. P., Woyke, T., Kyrpides, N. C., Pukall, R., Klenk, H. P., Goodfellow, M., & Göker, M. (2018). Genome-based taxonomic classification of the phylum actinobacteria. Frontiers in Microbiology, 9(AUG), 355158. doi:10.3389/fmicb.2018.02007.
[52] Xu, X., He, M., Xue, Q., Li, X., & Liu, A. (2024). Genome-based taxonomic classification of the genus Sulfitobacter along with the proposal of a new genus Parasulfitobacter gen. nov. and exploring the gene clusters associated with sulfur oxidation. BMC Genomics, 25(1), 389. doi:10.1186/s12864-024-10269-3.
[53] He, Y., Shen, Z., Zhang, Q., Wang, S., & Huang, D. S. (2021). A survey on deep learning in DNA/RNA motif mining. Briefings in Bioinformatics, 22(4), 229. doi:10.1093/bib/bbaa229.
[54] Vens, C., Rosso, M. N., & Danchin, E. G. J. (2011). Identifying discriminative classification-based motifs in biological sequences. Bioinformatics, 27(9), 1231–1238. doi:10.1093/bioinformatics/btr110.
[55] Majchrowska, S., Pawłowski, J., Guła, G., Bonus, T., Hanas, A., Loch, A., ... & Drulis-Kawa, Z. (2021). AGAR a microbial colony dataset for deep learning detection. arXiv Preprint, arXiv:2108.01234. doi:10.48550/arXiv.2108.01234.
[56] Kotwal, S., Rani, P., Arif, T., Manhas, J., & Sharma, S. (2022). Automated Bacterial Classifications Using Machine Learning Based Computational Techniques: Architectures, Challenges and Open Research Issues. Archives of Computational Methods in Engineering, 29(4), 2469–2490. doi:10.1007/s11831-021-09660-0.
[57] Wu, Y., & Gadsden, S. A. (2023). Machine learning algorithms in microbial classification: a comparative analysis. Frontiers in Artificial Intelligence, 6. doi:10.3389/frai.2023.1200994.
[58] Khasim, S., Ghosh, H., Rahat, I. S., Shaik, K., & Yesubabu, M. (2024). Deciphering Microorganisms through Intelligent Image Recognition: Machine Learning and Deep Learning Approaches, Challenges, and Advancements. EAI Endorsed Transactions on Internet of Things, 10. doi:10.4108/eetiot.4484.
[59] Ramos-Briceño, D. A., Flammia-D’Aleo, A., Fernández-López, G., Carrión-Nessi, F. S., & Forero-Peña, D. A. (2025). Deep learning-based malaria parasite detection: convolutional neural networks model for accurate species identification of Plasmodium falciparum and Plasmodium vivax. Scientific Reports, 15(1), 3746. doi:10.1038/s41598-025-87979-5.
[60] Fiannaca, A., La Paglia, L., La Rosa, M., Lo Bosco, G., Renda, G., Rizzo, R., Gaglio, S., & Urso, A. (2018). Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics, 19(7), 198. doi:10.1186/s12859-018-2182-6.
[61] Helaly, M. A., Rady, S., & Aref, M. M. (2021). Deep Learning for Taxonomic Classification of Biological Bacterial Sequences. Studies in Big Data, 77, 393–413. doi:10.1007/978-3-030-59338-4_20.
[62] Meharunnisa, M., Sornam, M., & Ramesh, B. (2024). An Optimized Hybrid Model for Classifying Bacterial Genus using an Integrated CNN-RF Approach on 16S rDNA Sequences. Journal of Scientific and Industrial Research, 83(4), 392–404. doi:10.56042/jsir.v83i4.2670.
[63] Arning, N., Sheppard, S. K., Bayliss, S., Clifton, D. A., & Wilson, D. J. (2021). Machine learning to predict the source of campylobacteriosis using whole genome data. PLoS Genetics, 17(10), 1009436. doi:10.1371/journal.pgen.1009436.
[64] Cohen, S., Rokach, L., Motro, Y., Moran-Gilad, J., & Veksler-Lublinsky, I. (2021). minMLST: machine learning for optimization of bacterial strain typing. Bioinformatics, 37(3), 303–311. doi:10.1093/bioinformatics/btaa724.
[65] Wang, L., Tang, J.-W., Li, F., Usman, M., Wu, C.-Y., Liu, Q.-H., Kang, H.-Q., Liu, W., & Gu, B. (2022). Identification of Bacterial Pathogens at Genus and Species Levels through Combination of Raman Spectrometry and Deep-Learning Algorithms. Microbiology Spectrum, 10(6), 258022. doi:10.1128/spectrum.02580-22.
[66] Ren, Y., Zheng, Y., Wang, X., Qu, S., Sun, L., Song, C., Ding, J., Ji, Y., Wang, G., Zhu, P., & Cheng, L. (2024). Rapid identification of lactic acid bacteria at species/subspecies level via ensemble learning of Ramanomes. Frontiers in Microbiology, 15. doi:10.3389/fmicb.2024.1361180.
[67] Kim, E., Yang, S. M., Ham, J. H., Lee, W., Jung, D. H., & Kim, H. Y. (2025). Integration of MALDI-TOF MS and machine learning to classify enterococci: A comparative analysis of supervised learning algorithms for species prediction. Food Chemistry, 462, 140931. doi:10.1016/j.foodchem.2024.140931.
[68] Jeon, Y., Lee, S., Jeon, Y. J., Kim, D., Ham, J. H., Jung, D. H., Kim, H. Y., & You, J. (2025). Rapid identification of pathogenic bacteria using data preprocessing and machine learning-augmented label-free surface-enhanced Raman scattering. Sensors and Actuators B: Chemical, 425, 136963. doi:10.1016/j.snb.2024.136963.
[69] Cserhati, M., Xiao, P., & Guda, C. (2019). K-mer-Based Motif Analysis in Insect Species across Anopheles, Drosophila, and Glossina Genera and Its Application to Species Classification. Computational and Mathematical Methods in Medicine, 2019(1), 4259479. doi:10.1155/2019/4259479.
[70] Liang, Q., Bible, P. W., Liu, Y., Zou, B., & Wei, L. (2020). DeepMicrobes: taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1), 9. doi:10.1093/nargab/lqaa009.
[71] Mock, F., Kretschmer, F., Kriese, A., Böcker, S., & Marz, M. (2022). Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proceedings of the National Academy of Sciences of the United States of America, 119(35), 2122636119. doi:10.1073/pnas.2122636119.
[72] W.H.O. (2024). Salmonella (non-typhoidal). World Health Organization (W.H.O.), Geneva, Switzerland. Available online: https://www.who.int/news-room/fact-sheets/detail/salmonella-(non-typhoidal) (accessed on November 2025).
[73] Radomski, N., Cadel-Six, S., Cherchame, E., Felten, A., Barbet, P., Palma, F., Mallet, L., Le Hello, S., Weill, F. X., Guillier, L., & Mistou, M. Y. (2019). A simple and robust statistical method to define genetic relatedness of samples related to outbreaks at the genomic scale – application to retrospective salmonella foodborne outbreak investigations. Frontiers in Microbiology, 10(OCT), 2413. doi:10.3389/fmicb.2019.02413.
[74] Rizzo, R., Fiannaca, A., La Rosa, M., & Urso, A. (2016). A deep learning approach to DNA sequence classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 9874 LNCS, 129–140. doi:10.1007/978-3-319-44332-4_10.
[75] Bhandari, N., Khare, S., Walambe, R., & Kotecha, K. (2021). Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Computer Science, 7, 1–17. doi:10.7717/PEERJ-CS.365.
[76] Zou, X., Nguyen, M., Overbeek, J., Cao, B., & Davis, J. J. (2022). Classification of bacterial plasmid and chromosome derived sequences using machine learning. PLoS ONE, 17(12 December), 279280. doi:10.1371/journal.pone.0279280.
[77] Sharan, R., & Myers, E. W. (2005). A motif-based framework for recognizing sequence families. Bioinformatics, 21(SUPPL. 1), 387– 393. doi:10.1093/bioinformatics/bti1002.
[78] Xiong, H., Capurso, D., Sen, Ś., & Segal, M. R. (2011). Sequence-based classification using discriminatory motif feature selection. PLoS ONE, 6(11), 27382. doi:10.1371/journal.pone.0027382.
[79] Parmar, A., Katariya, R., & Patel, V. (2019). A Review on Random Forest: An Ensemble Classifier. Lecture Notes on Data Engineering and Communications Technologies, 26, 758–763. doi:10.1007/978-3-030-03146-6_86.
[80] Deng, X., Milligan, K., Ali-Adeeb, R., Shreeves, P., Brolo, A., Lum, J. J., Andrews, J. L., & Jirasek, A. (2022). Group and Basis Restricted Non-Negative Matrix Factorization and Random Forest for Molecular Histotype Classification and Raman Biomarker Monitoring in Breast Cancer. Applied Spectroscopy, 76(4), 462–474. doi:10.1177/00037028211035398.
[81] Antonio Eng Lim, P., & Hee Park, C. (2024). A collaborative ensemble construction method for federated random forest. Expert Systems with Applications, 255, 124742. doi:10.1016/j.eswa.2024.124742.
[82] Naser, S., Thompson, F. L., Hoste, B., Gevers, D., Vandemeulebroecke, K., Cleenwerck, I., Thompson, C. C., Vancanneyt, M., & Swings, J. (2005). Phylogeny and identification of enterococci by atpA gene sequence analysis. Journal of Clinical Microbiology, 43(5), 2224–2230. doi:10.1128/JCM.43.5.2224-2230.2005.
[83] Thompson, C. C., Thompson, F. L., Vicente, A. C. P., & Swings, J. (2007). Phylogenetic analysis of vibrios and related species by means of atpA gene sequences. International Journal of Systematic and Evolutionary Microbiology, 57(11), 2480–2484. doi:10.1099/ijs.0.65223-0.
[84] Evseev, P., Lukianova, A., Tarakanov, R., Tokmakova, A., Shneider, M., Ignatov, A., & Miroshnikov, K. (2022). Curtobacterium spp. and Curtobacterium flaccumfaciens: Phylogeny, Genomics-Based Taxonomy, Pathogenicity, and Diagnostics. Current Issues in Molecular Biology, 44(2), 889–927. doi:10.3390/cimb44020060.
[85] Jolley, K. A., Bliss, C. M., Bennett, J. S., Bratcher, H. B., Brehony, C., Colles, F. M., Wimalarathna, H., Harrison, O. B., Sheppard, S. K., Cody, A. J., & Maiden, M. C. J. (2012). Ribosomal multilocus sequence typing: Universal characterization of bacteria from domain to strain. Microbiology, 158(4), 1005–1015. doi:10.1099/mic.0.055459-0.
[86] Frapolli, M., Défago, G., & Moënne-Loccoz, Y. (2007). Multilocus sequence analysis of biocontrol fluorescent Pseudomonas spp. producing the antifungal compound 2,4-diacetylphloroglucinol. Environmental Microbiology, 9(8), 1939–1955. doi:10.1111/j.1462-2920.2007.01310.x.
[87] Kim, C., Oh, K. K., Jothi, R., & Park, D. S. (2024). An innovative approach to decoding genetic variability in Pseudomonas aeruginosa via amino acid repeats and gene structure profiles. Scientific Reports, 14(1), 22610. doi:10.1038/s41598-024-73031-5.
[88] Reichler, S. J., Murphy, S. I., Martin, N. H., & Wiedmann, M. (2021). Identification, subtyping, and tracking of dairy spoilage-associated Pseudomonas by sequencing the ileS gene. Journal of Dairy Science, 104(3), 2668–2683. doi:10.3168/jds.2020-19283.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.




















