Diagnosis of Covid-19 Via Patient Breath Data Using Artificial Intelligence

Using machine learning algorithms for the rapid diagnosis and detection of the COVID-19 pandemic and isolating the patients from crowded environments are very important to controlling the epidemic. This study aims to develop a point-of-care testing (POCT) system that can detect COVID-19 by detecting volatile organic compounds (VOCs) in a patient's exhaled breath using the Gradient Boosted Trees Learner Algorithm. 294 breath samples were collected from 142 patients at Istanbul Medipol Mega Hospital between December 2020 and March 2021. 84 cases out of 142 resulted in negatives, and 58 cases resulted in positives. All these breath samples have been converted into numeric values through five air sensors. 10% of the data have been used for the validation of the model, while 75% of the test data have been used for training an AI model to predict the coronavirus presence. 25% have been used for testing. The SMOTE oversampling method was used to increase the training set size and reduce the imbalance of negative and positive classes in training and test data. Different machine learning algorithms have also been tried to develop the e-nose model. The test results have suggested that the Gradient Boosting algorithm created the best model. The Gradient Boosting model provides 95% recall when predicting COVID-19 positive patients and 96% accuracy when predicting COVID-19 negative patients.

The goal of this study is to develop a POCT system that can detect COVID-19 by accurately decomposing volatile organic compounds (VOCs) in a patient's breath. There are studies in the literature that show that deep learning can be effectively used in the detection and diagnosis of COVID-19, particularly through radiology modalities [5][6][7]. For this purpose, a hand-held electronic nose (e-nose) device is designed and built. The device contains a tube that patients blow into, and it can accurately detect the existence of a SARS-CoV-2 infection in just a few seconds. Operating the device doesn't require any special training, and it is designed to be used in public areas such as stadiums, airports, restaurants, and shopping malls.

2-1-E-Nose Structure
The e-noses that have been built for this study employ five different gas sensors: MQ2, MQ3, MQ7, MQ8, and MQ135. Each sensor is sensitive to a different gas compound in human breath and detects the presence of the gas within a range of 0-1000 ppm (parts per million) [8].
The sensors used in the e-noses can be summarized as below:  MQ2 is a combustible gas sensor. It has high sensitivity to LPG, Propane, Methane, and other combustible gases.
 MQ3 is a cork gas sensor suitable for alcohol, gasoline, CH4, Hexane, LPG and carbon monoxide detection.
 MQ7 is a sensor that is very sensitive to carbon monoxide.
 MQ8 sensor is used for detecting high concentrations of the Hydrogen gas.
 The MQ135 air quality sensor can detect the presence of many gases, especially NH3, benzene, alcohol and carbon dioxide [9].

2-2-Application
For this study, a handheld e-nose device is built for data collection and breath analysis, as shown in Figure 1. E-noses were used for coronavirus detection in 142 patient cases at Medipol Mega Hospital between December 2020 and March 2021. For each patient, two or all the following methods have been used for coronavirus testing:  Breath analysis with e-noses in specialized cabins;  Nasal and throat swabs;  PCR tests.

Figure 1. E-nose device model
Out of 292 breath samples collected from 142 patients 84 cases resulted negative, and 58 cases resulted positive. Collected data is stored in a database, which is then used for creating an accurate artificial intelligence (AI) model for disease detection. Figure 2 presents the workflow of the approach used in this study.

2-2-1-Data Preparation
While PCR test result data based on nasal and throat swabs is binary (positive or negative), sensors in the e-nose generate non-discrete numeric values. This allows the analysis of results from combinations of multiple sensors. For example, while results from the MQ2 sensor can have a strong impact on the coronavirus test results, a combination of MQ2 and MQ3 together may provide a stronger association with the coronavirus test results. For this purpose, this study also considers the following metrics and their squares: The resulting dataset used in this study has 27 attributes, and 294 records. 90% of this data (264 records) is used for training an AI model to predict the coronavirus presence, and the remaining 10% (30 records) is used for validating the results. Out of the 264 records, 75% were used for training the model and 25% were used for testing. Only after successful testing results were achieved, the model was used for validating the result with the 10% of the data. Figure 3 shows the details of the data preparation step.
Standard classifiers give biased results in the direction of the larger subset when the dataset is unevenly distributed. The dataset used in this study is also unstable with a 4:1 ratio, and standard users may give incorrect results. So, before training, a model must address this issue. Figure 4 shows two types of data sampling: over sampling and under sampling. In under sampling algorithm, majority class blue points are reduced to the same size as the minority class red data points. In over sampling minority class, the red data points increased to the same size of the majority class blue data points [10,11].  The next step used in data preparation is to synthetically increase the training set size and reduce the imbalance of negative and positive class sizes. Increased training set size offers more accurate results and balancing the data set reduces overfitting (i.e., learning majority cases only). For this purpose, the SMOTE (Synthetic Minority Oversampling Technique) oversampling method was used. SMOTE is a popular method that generates synthetic data for the minority data classes. Because majority of the collected breath data is Covid-negative (i.e., doesn't contain any trace of the SARS-CoV-2 virus), SMOTE helped balancing the negative and positive cases in the dataset. We used SMOTE data resampling techniques to solve the problem. SMOTE is a sampling algorithm that implements the k-nearest neighbor (KNN) algorithm approach. The algorithm selects the K nearest neighbors, combines them, and generates synthetic data as a result. (Figure 5).

Figure 5. Over Sampling Algorithms based on SMOTE [11]
After the balancing phase, data was augmented to achieve a better training. After augmentation the data set size increased to 1,254 records. Table 1 shows the number of records and class in the dataset which has been used in the study. After the data has been balanced, we have trained a gradient boosting algorithm to check the efficiency of the model. The impact of balancing the dataset can be seen in the initial test results that are given in Table 2. For model evaluation, we used recall and precision values from the confusion matrix. The original (imbalanced) dataset was initially used to train an imbalanced model by using the gradient boosting algorithm. This dataset was imbalanced with ratio 4:1 meaning, for every 4 negative patient there are 1 positive cases. This model provides 68% precision, namely, out of 100 COVIDnegative patients, the unbalanced model predicts 68 patients as negative (true negative) and 32 patients as positive (false negative). Also, this model results in a 70% recall, which means that out of 100 Covid-19 positive patients, the imbalanced model predicts 70 patients as Covid-19 positive (true positive) and 30 patients as negative (false positive). Table 2 shows that all precision, recall, and accuracy results have improved greatly after balancing the dataset and retraining the model. A balanced model predicts 95% recall, 96% precision, and 96% accuracy.

2-2-2-Gradient Boosted Trees Learner Algorithm
This study uses the KNIME platform [12] to visualize data and create learning models. The learning algorithm that was chosen for this study is Gradient Boosted Trees. In the case of the gradient-assisted decision tree algorithm, the weak learners are the decision trees, and they are prone to the problem of overfitting. To reduce this risk, a model combining multiple decision trees is used in this study. Random forests use a method called bagging to combine many decision trees into a single tree. At each iteration, the random forests randomly pick any number of features, and create decision trees accordingly. The results of the decision trees are aggregated based on the voting principle. Random selection of features solves that overfitting problem that is present in decision trees [13].
The gradient boosted trees learner algorithm can be described as follows: On a dataset (x, y) with x features and y targets, the loss function L can be calculated as the Squared Residuals as follows: where Obs and Pred show the observed and predicted values respectively. The L function is differentiable:  Initialization: The algorithm tries to choose the best prediction by minimizing the L function (squared residuals). Deriving the optimal value for the class would provide predictions that will weigh the average of the samples. o For each leaf j = 1… Jm, compute the output value that minimizes the sum of squared residuals (SSR). Outputs of all samples stored in a certain leaf will be predicted.
o Make a new prediction for each sample by updating, according to a learning rate lr ∈ (0,1): The new value is computed by summing the previous prediction and all the predictions into which the sample falls [14].

3-Results and Discussion
In this study, besides the gradient boosting Machine Learning (ML) algorithm, other ML algorithms such as logistic regression, gradient boosting, random forest support vector machine, KNN, decision tree, and Naïve Bayes have also been used for comparison. Nevertheless, as seen in Table 3, gradient boosting surpasses all others in terms of precision, recall, and accuracy values. However, it is well known that, depending on the composition of the dataset, other algorithms may also achieve good predictions, but in our case, gradient boosting provided the best results. Finally, performance of Gradient Boosting has been evaluated, as seen in Table 4. Statistics like specificity, positive predictive value (PPV), negative predictive value (NPV) and others besides receiver operating characteristics (ROC) curve suggest that model can be used for predictions. That means e-nose may be used in place of PCR tests.  Table 4 shows that the trained model can very accurately detect true negatives (specificity). In other words, if the enose decides that someone doesn't have the SARS-CoV-2 virus, the probability that this person is infected is less than 4%. The same table also shows that both PPV and NPV are higher than 85%, and accuracy is almost 93%. ROC curves for both training and validation are given in Figures 6 and 7, respectively. The ROC curves are plotted against the PCR test results, and the area under the curve (AUC) shows that the trained models have the same separation level as the PCR tests.

3-1-E-Nose Device
For this study, a handheld device for breath analysis is built as seen in Figure 8 (patent pending). The device is equipped with the gas sensors that were listed in the Introduction section. It allows the patients to blow through a reusable silicone tube and generates a diagnostic output for COVID-19 in a few seconds. The e-nose device also has wireless and Bluetooth capabilities for easy data transfer to computers. Also, newly trained models can be easily loaded onto the device through its communication port.

4-Conclusion
Gradient boosting is one of the popular artificial intelligence / machine learning algorithms used in the literature. In this study, we used Gradient Boosting AI model on artificially balanced and augmented breath data taken from 142 patients through e-noses that were also designed and created for this study. All the patients who participated in the study either showed symptoms of COVID-19 in that way or another.
Data set have been divided in two sets as 10% validation and 90% for AI training. The AI training dataset has been divided into two portions: 75% training and 25% testing of the results. The AI training data has been balanced in terms of the class variable (i.e., Covid Positive or Negative) and augmented with the KNN algorithm. In fact, the study analyzes different AI algorithms' such as SMOTE, KNN, and Gradient Boosting AI algorithms, for their performance with the collected breath data. The class variable is derived from PCR test results; therefore, the study relies on PCR tests. Our results show that the e-noses can predict COVID-19 with 96% accuracy with respect to the PCR test results. Contrary to the PCR tests that require at least a few hours for the results, the e-noses can decide the COVID-19 outcome in a matter of seconds. Therefore, the e-noses can be used in place of the PCR tests as a quick and cheap alternative.

5-2-Data Availability Statement
The data presented in this study are available on request from the corresponding author.

5-3-Funding
The authors received no financial support for the research, authorship, and/or publication of this article.