Comparisons of SVM Kernels for Insurance Data Clustering

This paper will study insurance data clustering using Support Vector Machine (SVM) approaches. It investigates the optimum condition employing the three most popular kernels of SVM, i.e., linear, polynomial, and radial basis kernel. To explore sum insured datasets, kernel comparisons for Root Mean Square Error (RMSE) and density analysis have been provided. It employs these kernels to classify based on sum insured datasets. The objective of this research is to demonstrate to industrial researchers that data grouping may be accomplished in an organized, error-free, and efficient manner utilizing R programming and the SVM approach. In this study, we check the insurance data for the sum insured with statistical methods in the form of Model Performance Evaluation (MPE), Receiver Operating Characteristics (ROC), Area Under Curve (AUC), partial AUC (pAUC), smoothing, confidence intervals, and thresholds. Then, sum insured data are followed up to classify using SVM kernels. This paper finds new ideas for evaluating insurance data using the SVM approach with multiple kernels. This novel research emphasizes the statistical analysis methods for insurance data and uses the SVM method for more accurate data classification. Finally, it informs that this research is a pure finding, and there has never been any research on this subject. This research was conducted using the sum insured data as a sample from the Office of the Insurance Commission (OIC) in Thailand as an independent insurance institution providing actual data.

Insurance is very significant when viewed statistically. High health care can be covered by it. If someone is affected by a disaster or infectious disease, it will not be so burdensome to pay for health care [5][6][7]. The government's goal is to minimize the gap for every citizen to access the best health care. Therefore, the encouragement for citizens to have insurance continues to be echoed. James et al. (2013), Shah et al. (2017), Sritart et al. (2021) and Rojjananukulpong et al. (2021) [8][9][10][11] undertook research to improve the insurance system, and the aim is to develop a direct and contextspecific technique to measure inequality in remote areas of Thailand in the hope of assisting in the growth of efficiency in the use of more mobile health clinics. Buchanna et al. (2022) and Marshall et al. (2021) [7,12] examined the utilization of urban and rural health services to see the differences between the two. The biggest obstacle to emergency transport is Bangkok traffic, where cars do not usually clear the way for an ambulance. Although insurance provides peace of mind for UHC cardholders, there are still difficult to diagnose and requires specialized treatment that can often have devastating financial consequences [13,14]. Treatments such as significant organ transplants, surgeries, and so forth are treatments that cost much money and require multiple hospital stays of long duration. For such situations, patients should increase their insurance coverage through primary and health insurance to get additional financial protection at an affordable cost. Given the rising cost of health care in Thailand, a medical emergency can quickly deplete your savings. Most insurance companies quote six to ten times your annual salary as an appropriate amount for life insurance. Health insurance aims to provide financial protection if you suffer an illness to protect your savings [1,2,15].
In terms of income, inequality is problematic for every individual in every region of Thailand. There is a great demand to boost efficiency and streamline processes in the highly competitive insurance industry. Most carriers suffer from manual process stages, including several repetitive duties such as reviewing a claim entry for completeness and seeking missing information such as a police report from a traffic accident. Handling an underwriting request or claim necessitates matching the request with customer information, typically stored in legacy systems. Finding critical data, such as customer status and the related right to reimbursement, is frequently a barrier to the claims process's end-to-end automation. The convergence of policyholders, insurance personnel, and technology presents tremendous prospects for challenging claims processing. This research offers innovation in insurance data management using the Support Vector Machine (SVM) method. This study examines the sum of insured data classifying by the SVM method, which previously analyzed the data using statistical methods. In the end, we also provide information about the best kernels from some of the chosen kernels. The research methodology used is literature study, namely, studying material from various related sources such as journals, books, and the internet. These research steps are: Sum insured data collection using Microsoft Excel; Perform data analysis with several statistical methods carried out on the R programming; Testing sum insured data using the SVM method with linear, polynomial, and RBF kernels; The classifying results are assumed to be 0 with the sum insured contribution (pay more) and 1 for the regional group with the sum insured contribution (pay less); The RBF kernel is the most accurate for classifying by the SVM method. The flowchart in Figure 1 gives an illustration of our research methodology.

Figure 1. Research steps in classification with SVM kernels
The primary motivation is to assess data quality by decreasing mistakes in insurance data classifications. It offers new applications for categorizing insured data in numerous provinces in Thailand. To analyze Model Performance Evaluation (MPE), Receiver Operating Characteristics (ROC), Area Under Curve (AUC), partial AUC (pAUC), smoothing, confidence intervals, and thresholds have been explained in Yiengprugsawan et al. (2009) [16] and Puenpatom & Rosenman (2008) [17]. At the outset of the talk, it will provide a comparative sum insured statistics from many different years. Then, it attempted to reclassify by utilizing the insurance amount from each province in Thailand. It presumed that Bangkok is an affluent location since residents in the capital have a comparatively high income. This insurance research utilized Thailand's Insurance Commission datasets, used R programming to categorize datasets in simulations, and employed the most commonly used approach, SVM [18][19][20]. This paper's outline is as follows: Section 1 includes an introduction in basic facts about the subject, study motivation, and a detailed arrangement of the work. Section 2 confirms the amount of sum insured data by comparing statistics for the sum insured from different yearssection 3 simulated SVM using three kernels to find the optimal kernel. It selected these three kernels because they have distinctive qualities and almost customarily distributed data, and many academics widely utilize them in SVM analysis. Section 4 displays the tabular results categorization of the sum insured data. Next, it concludes that a province with 0 (poor) results signifies that they must pay high for their premium. In the case of output 1 (good), the payment is much low for their premium. Finally, Section 5 gives conclusions and future work.

2-Validating Data
This section will implement some statistical analysis on the sum insured data used for SVM classification [6,8]. It initiated the data validation process from the model performance evaluation section to the statistical comparison of the selected data samples. This report includes information from 2011, 2012, and 2014. Proceeding our study of data from each area in Thailand, including using models from 2011, 2012, and 2014, it attempts to reproduce statistical comparisons between sum insured data, shown in Figure 2. It deployed R programming to validate several data checks, separated into four major groups before being validated using additional scientific computations. The R code for testing statistical comparisons is referred to Puenpatom & Rosenman (2008) [17]. While using the three kernels, it identified minimal faults to simulate. It concludes that 0 is the wrong area, representing a high insurance cost, and 1 is the excellent area, i.e., the area with reduced insurance prices.

Figure 2. Statistical Comparison
The following sections will examine how SVM kernels work with their programs to execute sum insured data.

3-SVM Kernels
The SVM linear kernel, the SVM polynomial kernel, and the SVM RBF kernel are employed in simulations. While conducting the simulation, we apply the collected data on the sum insured for 2011, 2012 and 2014. For instance, in 2011, the data belong the total sum insured data from every province in Thailand. Thailand is split into four regions: the central, the north, the northeast, and the south. It grouped the 77 areas into categories based on their geographical location. There were 26 parts in the central regions, 17 in the north, 20 in the northeast, and 14 in the south. It uses the same technique to allocate the 2012 and 2014 data. It intend to discover the optimal kernel for categorizing insured data and information for future study by evaluating the data using SVM with three alternative kernels. Assuming that level 0 is the recommended area with the obligation to pay more sum insured, while level 1 is considered better than level 0, the sum insured can be paid less. The simulation section will begin by designing a sum insured database in Microsoft Excel, then employ it in R programming through the SVM classification by utilizing caret packages, including the linear kernel, polynomial, and RBF. All numerical experiments are on Windows 10 Pro, Intel Core i5-7500 CPU, 8GB RAM, 64 bits. The following is the R code for SVM linear testing # SVM Linear trctrl1 <-trainControl(method = "repeatedcv", number = 10, repeats = 3) svm_Linear1 <-train(Lv ~., data = training1, method = "svmLinear", trControl=trctrl1, preProcess = c("center", "scale"),   Figure 3 shows the data of sum insured in Thailand with a 90% of accuracy and a tuning parameter of C, also known as the cost that defines the probable misclassifications. It imposes a penalty on the model to make a mistake, i.e., the greater value of C, the less likely the SVM linear kernel algorithm would misclassify a point [21]. In this scenario, caret generates an SVM linear classifier with C equal to 1. It can input several C values to find an optimal C value that optimizes model cross-validation accuracy. The following is the R code for SVM polynomial testing set.seed(123) svm_Poly_Grid1 <-train(Lv ~., data = training1, method = "svmPoly", trControl=trctrl1,tuneGrid=grid1,preProcess = c("scale")) sel.poly svm_Poly_Grid1 plot(svm_Poly_Grid1, lwd=2, cex = 2.5, col = "red", bg = "black", pch=21) Thus the SVM classifiers using non-linear kernels, i.e., polynomial or radial basis functions can be produced using the mentioned code. The caret package could quickly compute the polynomial and the radial basis function SVM nonlinear models (Figure 4).

Figure 4. SVM Polynomial Kernel
The caret package automatically chooses the ideal model tuning parameters and optimizing model accuracy. In these tries, the scales are 2, 4, 8, 10 with degrees 1, 2, 3, 4 and the final values after executing the program are degree is equal to 1, scale is 0.1 and C is 0.25 with 92% accuracy [21]. The R code for SVM RBF testing is presented as follows: plot(svm_Radial_Grid1, lwd=2, cex = 2.5, col = "red", bg = "black", pch=21) svm_Radial_Grid1$bestTune test_pred_grid1<-predict(svm_Radial_Grid1, newdata = testing1) test_pred_grid1 confusionMatrix(table(test_pred_grid1, testing1$Lv)) It routinely utilizes the caret package while doing SVM analysis with different kernels. The program seems efficient since it categorizes data to reduce the error and improve the accuracy.
In Figure 5, the tuning parameter sigma is an ideal model at 0.2714456 with C=16 and 96% of accuracy. Histograms and stem and leaf plots both assist in providing an overview of metrics of the central tendency and symmetry of observational data. The code for RMSE testing is presented in as follows: summary(resamps) cols = list(col=c("blue","green","red"), pch=c (2,4,8), lwd = 2) bwplot(resamps, metric = "RMSE", par.settings = list(plot.symbol=cols,box.rectangle = cols, box.dot = cols, box.umbrella=cols)) Box and Whisker Plots, also known as BoxPlot or Box-Plot, are another graphical presentation that may summarize more specific information about the distribution of observed data values (box-plot). As the name implies, Box and Whisker is a form that consists of a Box (box) and a Whisker. A BoxPlot SVM based on each kernel is shown in Figure  6 of the Root Mean Square Error (RMSE).
BoxPlot graphically presents a summary of a sample distribution that can characterize the shape of the data distribution (skewness), central tendency measures and observational data spread (diversity). The results of Mean Absolute Error (MAE), RMSE and R-Squared are obtained from Figure 6.

Figure 6. BloxPlot of RMSE for each of the Kernels
The code for checking the kernel comparisons in is presented as follows # RMSE vs. Density densityplot(resamps, metric = "RMSE", par.settings = list(superpose.line = list(col = c("blue","green","red"), lwd = 2)), plot.points=FALSE, auto.key=TRUE) An RMSE simulation of density will then be studied using Table 1 by exhibiting the comparison between kernels as shown below (Figure 7). The comparison of the three kernels in Figure 7 based on RMSE and density concluded that the radial kernel is superior to the other kernels, and the SVM RBF kernel is the simulation approach with the highest accuracy.

4-Insurance Data Clustering
Classification using SVM with multiple related kernels is performed because SVM classification is the best method for doing in-depth research with current data to reduce mistakes. As a result, all of the data verification it has done jointly through simulations is to inform prudence and correctness. Further analysis may be performed by introducing premium data or modifying the categorization algorithm. The significant findings of this investigation are presented in Table 2.  Insurance firms can utilize the findings of this categorization to determine whether provinces receive more or less of the sum guaranteed. Based on Table 2, for the Chiang Mai (0) area, residents with original domiciles in Chiang Mai must make high insurance payments. Meanwhile, residents with Lampang (1) identity cards pay low insurance premiums. The determination of the average to determine the area to pay high or low contributions is flexible and is determined by the insurance company that issues the average policy for the sum insured.

5-Conclusion
An error developed when the quantity of the sum insured in a specific location was not calculated accurately. In this case, the SVM method provided a more accurate and effective alternative step in classifying the sum insured, following a thorough examination of all three prominent SVM kernels. It concluded that the SVM approach should only use the radial kernel for future research. The essential aspect was that the radial kernel had a low RMSE and a high density, so the output of the resultant data analysis was close to the likelihood of correctness with the realization of the area of each zone. In Table 2, the results of data analysis for each region in Thailand were given. In the payment of the sum insured premium, a person's decision to pay more or less was based on the person's address on the identity card. Even if a person traveled to work in another area in Thailand, the contribution was still decided based on the original address of residence. For example, a student from Pattani (Thailand), whose area was at level 1, implied that the insurance coverage was small. If moving to Bangkok for studying, where the area was at level 0, then the sum insured charged to the student was still at level 1 because the original address on the resident's identity card was from the Pattani area. However, if a student changed a residential address permanently, the sum insured charged would be at level 0 according to the Bangkok area. This new study offered a detailed and accurate classification of the insurance data where the primary basis was that there were still some imperfections in the classification of data in Thailand. It tried to find renewable methods that were more effective and efficient for data clustering by presenting the findings in this paper. Further studies of this manuscript could be expanded by comparing the SVM method with other classification methods or by replacing the sum insured data with other insurance data, such as analysis of the insurance stock market.

6-2-Data Availability Statement
Data was obtained from Office of the Insurance Commission (OIC) in Thailand and available on request from the corresponding author.