Feature Transformation on Big Data for Species Classification in Machine Learning

Big Data Bioinformatics Feature Selection Machine Learning Sequence Motifs

Authors

Downloads

Classification of bacterial species, particularly for closely related taxa, remains a major challenge in many areas, e.g., public health, food industries, and many others. The issues are mainly caused by overlapping genetic features of organisms and data complexities. In this study, a bacterial taxonomic identification framework that integrates genome-derived motif sequences with machine learning was introduced. Two hundred and forty genome sequences from Salmonella enterica, representing six subspecies and ten serovars, were used for modelling. Sequence motifs were predicted from single-copy orthologous core genes of the downloaded genomes. Single nucleotide polymorphisms (SNPs) within these motifs were extracted and numerically encoded as machine learning features. The 20 top-most informative predictors from feature selections were used for model training in Random Forest and Support Vector Machine. Comparing the output from multiple analyses, the Random Forest model achieved the highest accuracy of 97.92%, demonstrating reliable differentiation of Salmonella at both subspecies and serovar levels. This research presents two key innovations: i) the use of sequence motifs as molecular signatures for bacterial classification; ii) a novel feature engineering method that transforms genome-derived data into machine learning-readable features. The proposed framework offers a practical and scalable solution for fine-level bacterial classification and has high potential to be applied for other microbial taxa.