A Systematic Review on Emotion Recognition System Using Physiological Signals: Data Acquisition and Methodology

Emotion recognition systems (ERS) have become a popular research field to contribute to human- machine interaction in different areas. Different kinds of applications on ERS can serve different purposes. Artificial intelligence (AI) and the internet of things (IoT) are the technologies behind such applications. The main objective of this study is to enable researchers and developers to search for the most suitable options to develop an emotional state recognition system. More specifically, this paper presents work on ERS, which is built using physiological signals extracted from biosensors. It also presents details of how the extracted physiological signals are used to identify the user's emotional state. In this review, the sensors are categorized based on their modality: contact-based sensors and contactless sensors. Next, the ERS process is presented together with the reported results for each described technique. Articles from four different research databases were reviewed, of which 147 articles from 2009 to 2021 were referred to that are related to ERS using physiological signals. This paper should be significant for researchers developing systems that integrate human emotion recognition capability. The findings reported here can guide them in choosing suitable methods for their systems.

Year

2-3-Limitations of the Search
A limited number of the database are available, and even from the available database many of the research papers, only the abstracts are accessible due to subscription issues. Therefore, only the open access and articles from subscribed sources can be properly analysed. This is the main challenge faced in this study.

2-4-Common Limitations of Existing Systematic Reviews
Most systematic reviews were focused on a single emotion induction (such as music [10]) or a single signal retrieval option (such as EEG [11,12]). Hence, the existing works did not consider comparing between different index tests.

3-Emotion Models
The RQ1 is answered in this section, where a detailed description of the emotional models commonly used among researchers is presented. The emotional states must be distinct and assessed quantitatively for effective emotional recognition. The primary emotions were defined decades ago by psychologists. Researchers widely adopted two models of emotions: a) discrete and b) multi-dimensional emotional state.
In order to build an ERS, emotional data need to be generated where a group of participants is usually provided with a sequence of emotionally evocative materials to induce their emotional state. These materials can be pictures, movies, music, situational simulation, computer games, or recollection. The collected data is later used to train and build ERS model.

3-1-Discrete Emotion Models
The discrete emotion models provide fundamental emotion theories as stated by Ekman (1992) [14], Izard (2007) [15], and Plutchik (2007) [16]. No one emotional state is more essential than another state. Two or more emotional states might be triggered together in some situations. For instance, the emotional state of admiration and joy can be activated as a person falls in love, whereas sadness can be felt once a person is hurt, and hurt might trigger anger. Hence, it is possible to get a mixture of emotional feelings.
In early 1992, Ekman (1992) [14] stated six standard emotion states (i.e., happy, sad, anger, fear, surprise, and disgust). He also mentioned other emotional states as producing responses and mixtures of these standard emotional states. In 1980 [16], a wheel model was proposed as a discrete emotion model. The wheel of emotions has primary emotional states of anger, anticipation, joy, trust, sadness, surprise, fear, and disgust ( Figure 2). It also contains the weaker and stronger intensity of emotions, where the center includes a more vigorous intensity, and at the flower blooms are the weaker feelings.

Figure 2. Plutchik's Wheel of Emotions [17]
Izard (2007) [15], also defined a set of basic emotions. According to the author, basic emotions are part of the human evolution course, and every essential emotional state is related to simple brain activity. Moreover, he categorized ten primary emotional states (i.e., interest, joy, surprise, sadness, fear, shyness, guilt, anger, disgust, and contempt). Plutchik (2001) [16] stated 24 pairs of separate work, i.e., feelings that combined two different emotional states. Finally, the wheel of emotions can be paired into four classes, as follows: Primary pair: e.g., alarm = fear+ surprise (one petal apart) Secondary pair: e.g., Envy = Sadness+Anger (two petals apart) Tertiary pair: e.g., Delight = Joy+ surprise (three petal apart) Opposite pair: e.g., Conflict = distraction+ interest (opposite petal apart) Another similar emotion model presented based on Robert Plutchik's model is Hourglass of Emotions (shown in Figure 3) Plutchik and Hourglass model uses a similar method, with the difference in the number of variables. There is more axis along which the emotion is modeled. On top of it, authors explicitly assigned specific labels to certain regions, providing some sort of compatibility with categorical models. These models are more advanced in terms of complexity and the ability to express complex emotions. Despite their complexity, the mentioned models do not take into account the context and personal typology of the reader or author.

3-2-Multidimensional Emotion Space Model
An emotional state may contain different concentrations; for instance, sadness can be at a different level, such as very sad or moderately sad. Therefore, psychologists proposed multi-dimensional emotion space models to differentiate the different levels of emotions. Cambria (2012) [18] suggested that these emotional levels can be divided into two categories, i.e., 2D dimension, valence, and arousal. In this theory, to specify human feeling, negative valence indicates unpleasant and positive valence indicates pleasant, while low arousal indicates passive and high arousal indicates active. For example, in Russell's Circumflex Model (presented in Figure 4), 'happy' has positive valence and high arousal, whereas 'sad' has negative valence and low arousal. Although the positive and negative emotional states can be easily differentiated using Russell's Circumflex Model, it failed to recognize individual emotional states. To overcome this issue, a 3D model has been proposed (see Figure 5). Dominance is the additional dimension axis; it ranges from submissive to dominant. Dominance helps reflect the specific emotional state to make the outcome more exact and clear. For example, both anger and fear in the distress quadrant for the 2D model can now be easily differentiated due to the dominance axis, where anger is dominant and fear is submissive.

4-Modality of Physiological Based Emotions Recognition System
In this section, RQ2 is answered. Physiological measurements are produced from the central and autonomic nervous systems. These measurements can be obtained from positioning the biosensors, as indicated in Figure 6.

4-1-Electroencephalogram (EEG)
An EEG signal is from electrical activities in the brain and can be extracted using electrodes attached to the scalp. It is one of the best inputs to detect emotions with high precision. The standard methods use 8, 16, or 32 pairs of electrodes which are attached to four specific places on the scalp (i.e., nasion, inion, and right and left preauricular points) by using the frontal, temporal, parietal, and occipital lobes (as shown in Figure 7-a) [23]. These electrodes are typically attached to the scalp using headsets or electrically conductive adhesive [24]. An example of placing electrodes on the scalp for recording EEG signals is presented in Figure 7-b and proposed by Heng et al. (2020) [25]; the signals were obtained and represented in a graph. However, EEG based system is best for clinical use, as it is very time-consuming to set the electrodes and overcome the sensible noise characteristics.

4-2-Electrocardiogram (ECG)
Electrocardiogram (ECG) signals are frequently used for the valuation of heart functionality. Like EEG, electrical activities of the heart are measured. The muscular contraction and relaxation are from the heart. It is obtained using electrodes placed on a subject's body [27]. The heart is a vital organ in the human body, and a person's feeling influences its rhythm. Therefore, ECG signals can be beneficial for ERS [28].
There are various methods to obtain ECG; one standard method is the 12-lead ECG technique. Typically, ten electrodes are used to receive 12-lead ECG signals. The ten electrodes are set on limb leads and chest leads, as shown in Figure 8-a. Here, four limb sensors are placed on the flat space of the lower innermost legs area and lower innermost forearm area, or higher innermost arms area and higher innermost thighs area, or higher innermost arms area and innermost abdomen area. Figure 8-b shows the placement of the sensor on the chest area as presented by Tada et al. (2015) [29].

4-3-Electromyogram (EMG)
Electromyography (EMG) is an evaluation and recording method for electrical potential generated by muscle cells. In different areas, these signals can be used for various purposes. In the ERS case, the signals can be used to find perceptive emotions based on physiological reactions. Ekman and Friesen first stated the connection between actions, muscles, and emotions in 1978. The selection of muscles depends on the type of analysis; the most common muscles used are corrugator supercilii, occipitofrontalis, levator labii superioris, orbicularis oculi zygomaticus major [31][32][33]. Figure 9 shows the electrode placement to obtain emotional expression using EMG. The responses from four selected muscles are received using Ag/AgCl miniature surface electrodes by electrolyte gel. Here, the emotional expression can be measured by the activity of the Zygomaticus major for smiling, which is connected to happiness; the Corrugator supercilii measures wrinkle of one's forehead, which is related to fear, sadness, anger, etc.; the Levator labii Superioris measures upper lip, which is connected to worried, nervous, upset, etc.; and the Lateral frontalis measures eyebrowraising, which is related to fear, surprise, etc. Other possible measurements of EMG for emotional expression are presented in Table 2.

4-4-Heart Rate Variability (HRV)
Heart rate variability (HRV) is the Heart Rate (HR) weight by average beats per minute. It records the individual variations in time (or variability) among sequential heartbeats. HRV measurement provides data about heart variability, which can be used to predict emotional states. HRV data can be examined at a particular time in every heartbeat cycle and its regularity [37]. An unusual heartbeat can be due to emotional level changes. The sympathetic and parasympathetic autonomous nervous system in the human body regulates variability in HR. The parasympathetic nerves slow the heartbeat, and the sympathetic nerves accelerate the heartbeat. The variations in emotional levels, such as anxiety and physical pressure, influence HR [38,39]. A low HRV could indicate relaxation, whereas a high HRV indicates frustration.
Furthermore, HRV measurements are also dependent on age, gender, physical condition, mental stress, eating habits, weight, blood pressure, and glucose level. Inherited genetic factors also affect HRV.
The standard method to measure HRV is by using an ECG [37], which gets a reading of the primary electro biological signal of heart activity. HRV can be calculated using the RR pulse intervals of the ECG signal. PPG is another method to get readings of HRV. PPG can be obtained by placing the probes on the brachial artery, radial or ulnar artery, and tibial artery for both hemibodies [40]. Two sensors are used to obtain PPG, the first sensor (emitter) emits light to the skin, and the second sensor (detector) detects light emitted back from artery pulse wave. The difference in light emitted and reflected is due to the person's blood volume (BV). The BV is produced by capillary dilation and constriction and can be used to estimate HR. Moreover, IBI (inter-beat-interval) from the PPG signal is calculated to obtain the HRV reading. The electrodes/probes placement to obtain HRV is presented in Figure 10-a. Figure 10-b shows ECG illustration and photoplethysmography (PPG) signals. In Figure 11-a, a more detailed operation of PPG sensors is presented. In the image, the reflection is from narrow arteries, where the pressure is lower with greater reflection, causing dialosic value. Whereas if the reflection is from wider arteries, it will absorb more, and this indicates higher pressure, creating systolic value (shown in Figure 10 Remote photoplethysmography (rPPG) makes it possible to measure the cardiovascular pulse wave via a contactless approach by recording different back-scattered light levels remotely using ambient light and vision systems [41]. The rPPG measurements read human cardiac activity using a video camera. The process is similar to PPG, but it is a contactless process. Figure 11-b shows rPPG that gets reading from different light levels (i.e., red, green, blue) reflected from the skin. This processing was used by Benezeth et al. (2018) [39], to detect emotions without contacting the device with a human to obtain the measurements.

4-5-Electrooculography (EOG)
EOG is one of the standard eye movement measurement techniques. It is processed by assessing corneo-retinal polarity. The cornea is set in front of the human eye, which has a positive polarity, while the retina is positioned in the back of the human eye, which has negative polarity. Furman and Wuyts (2012) [23] presented a primary implementation of EOG for ophthalmological diagnosis. For measuring eye movement, electrode pairs are typically set either on the left or right side of the eye (Horizontal) or above and below the eye (vertical), as shown in Figure 12-a. When the eyeball moves from the centre towards any one of the electrodes set between the eye's position, a voltage spike is produced ( Figure 12-b) [42]. The basic concept of using EOG for ERS is to detect eye-blinking, which is beneficial for detecting emotional states, such as surprise/ stress [43]. Besides emotions, EOG is also helpful in detecting fatigue, concentration, and drowsiness [44]. EOG extraction can be via contact or contactless methods. Contact measurements can be implemented by similar instruments used for EMG signal retrieval. On the other hand, the contactless technique can be accomplished using video oculography (VOG) camera systems or using infrared oculography (IROG) cameras [42].

4-6-Galvanic Skin Response (GSR)/ Electrodermal Activity (EDA) / Skin Conductance (SC)
The GSR/EDA/SC is an uninterrupted raw data of electrical factors from the human skin. Here, skin conductions are taken as the main factor, where sweet response produces a different amount of salt on the skin and consequently the electrical resistance alteration from the skin to skin [45]. Sweat cause moisture on the skin's surface and brings fluctuations towards the stability of positive and negative ions in electrodes [47]. Sweat is produced due to the activation of the sweat gland. It is the unconscious reaction of a human [9] and a reflection of changes in the sensitive nervous system [48]. Some of the emotional responses cause sweat reactions, mostly on hand palms, fingers, and feet soles. . [49] had tested the emotional sweating across the body on 13 parts and found that attaching sensors on fingers, feet, forehead, shoulders, neck, and chest is the best position to get high SC responsiveness from emotional state changes. These positions to obtain SC measurements for recognizing emotional states are indicated in Figure 13.

4-7-Skin Temperature (SKT)
SKT is one of the best parameters for automatic ERS. SKT is the unconscious reaction of a human related to SC and also HR. SKT measurement is determined by the thermal radiation of the skin's surface, and it is a valuable indicator of emotional states, which is reflected by the Autonomic Nervous System (ANS) activity.
Similarly, SC, SKT is obtained from palmar surfaces and plantar surfaces (such as nose-tip, and fingertip), as it is influenced by endothelial cells. Basically, broken blood vessels present on the skin's surface will increase the temperature while relaxing and decrease the temperature if in a state of stress or anxiety [50].
The contactless measurement can be processed using electromagnetic radiation released on the skin's surface. It allows the measurement of SKT from a distance. However, there are challenges related to the sensor's accuracy and its coverage. A contact-less system was developed by Kosonogov et al. (2017) [50] to obtain SKT measurements. The measurement is obtained by using an infrared thermal imaging camera FLIR; an example output is presented in Figure 14. SKT data can also be obtained by using contact-based SKT sensors in the finger. Ayata et al. (2017) [51] used a Fingertip temperature (FTT) sensor for emotion state monitoring and recognition. When the person is relaxed, the fingertip is warm than in a tense state of emotion. In a relaxed mood, vessels are dilated, and the fingertip becomes warmer.

4-8-Respiration (RSP)/ Respiratory Belt (RB)
RSP biosensor or RB is commonly a stretchable band made of latex rubber; it is used for recording human breathing activity. The RSP is usually worn over the abdomen. The elastic band stretch data is measured as a change in voltage level. The standard recording is RSP rate and breathing depth. In Figure 15, a pressure sensor (EMFit) is shown, where the sensor is attached to the skin using a belt. Here, the human's ribcage volume changes due to respiration compressing the attached sensor [52].
RSP measurement is closely related to other cardiac measures; a deep breath can affect RSP, EMG, and SC measurements. The rate of RSP is typically reduced while relaxing, whereas when the person is stressed out, it results in momentary RSP interruption. In the case of the emotion level, negative emotions contribute to abnormal RSP patterns.
However, irregular RSP cycles can be caused while talking too [53]; hence it is essential to keep this in context while classifying RSP signals. Another method to obtain RSP is by measuring the carbon dioxide (CO2) contents of inhaling and exhaling air. It is known as capnography or measuring chest cavity expansion [53]. Moreover, it is also possible to obtain RSP by using EMG data acquired from the respiratory muscles [54].

4-9-Discussion
There are many different ways to obtain data from various sensors and methods. It is essential to select the best approach. The selection also depends on the requirement of the user and the application area of the ERS. Here, the basic categorization clarifies the standard physiological sensors used for ERS.  Figure 16 groups the ERS measurement techniques, which help the selection of the appropriate measurement sensors for ERS. This selection is based on unconscious data collection. As in the conscious process, emotion identification is made using self-evaluation questionnaires, which is less reliable as the participant may not identify the real emotions or provide inaccurate answers due to the inability to understand the questions. However, the unconscious process has many varieties for measurement parameters. These physiological signals also are divided into electrical and non-electrical categories. It can be stated that these electrical parameters are the main signals, which provide a maximum precise outcome, whereas the non-electrical measurements provide human body responses strained by electric signals. The electrical parameters measurements consist of two structures a) direct (self-generating) sensors (such as EEG, ECG, HRV, EMG, EOG) and b) modulating sensors (such as GSR). Moreover, the non-electrical signal can be obtained using contact and contactless approaches like the thermal camera, rPPG, IROG, and VOG.
An engineer that builds ERS for advanced driver-assistance systems might be interested in a non-invasive approach. Hence contactless sensors like the rPPG and video oculography systems are better suited for the project. Whereas accuracy is essential in building an ERS-based healthcare system, thus in this case, direct electrical sensors like EEG or ECG are the better option.

5-Emotion Recognition System (ERS)
In this section, methods for physiological signal-based ERS (RQ3-5) are discussed. The system can be trained using two different ways. The most common way is to use traditional machine learning (ML) models, and another option is DL training approaches. The conventional method required feature extractions and feature selection techniques, whereas DL is free from these methods [55]. DL can train the model without feature extraction due to the capability of inherent data information and extracting features by design. The overall ERS process is presented in Figure 17.

5-1-Pre-Processing
Preprocessing of physiological signals is essential for both traditional and DL approaches, where noise due to crosstalk, electromagnetic interference, etc., was eliminated.
Baseline wander one of the typical noises presented in ECG and PPF recording, which might be caused usually by motion, respiration, and skin-electrode impedance, and so on. For instance, to solve this issue of noisy data, the moving average preprocessing technique can be implemented to make the signal smoother. Another standard method is wavelet transform (WT), which was also used to eliminate Baseline wander and detection of wave characteristics [56]. The most common form of filtering is the "low-pass filter" technique [57,58], and the "High pass filters" technique [59]. There are several other filter methods, such as independent component analysis (ICA), empirical mode decomposition (EMD), and discrete wavelet transform (DWT). The EMD is WT; it is used for filtering and noise cancellation, an entirely signaldependent and adaptive approach suitable for real-time applications [60]. Lahmiri and Boukadoum (2015) [61] stated that DWT is more effective than Fourier-based filtering in the case of filtering noise due to multi-scale approximation (MSA). ICA, on the other hand, can extract and decompose multichannel signals from different sources into independent components with mixed signals in linear combination [62].
In general, in the signal collection, the abnormal data due to noise cannot be avoided. It requires some expertise to remove the artifact components through visual observation. Standard original signals contain interference signals. Different approaches (filtering, DWT, ICA, EMD) are needed on distinctive bio-signals and various sensors of intrusions to remove the noise related to frequency domain and time domain characteristics.

5-2-Traditional Machine Learning Method Flow
There are three main steps in the ML, i.e., feature extraction, feature selection, and emotion classification. In this section, RQ3 and RQ4 are answered.

5-2-1-Feature Extraction
Feature extraction is a significant step in the ER system. Several feature extraction methods have been popular among researchers of ERS. For example, in Hindarto and Sumarno (2016) [63], the Fast Fourier Transform (FFT) was used to extract features from EEG signals; FFT lost the little amount of signal data during transformation. However, without having enough samples, it could not extract enough frequencies.
Time-frequency distributions (TFD) method works on obtaining features from time-frequency with stationary principle. The features contain both the time and frequency domains data from the provided signal. However, the main issue of TFD is it requires noiseless signals to perform effectively, and the data needs to be pre-processed correctly to remove the noise from raw data. Another issue is the performance is slow because of the gradient ascent computation [64].
Meanwhile, Eigenvalue and Eigenvector methods [65] are another option for feature extraction methods. Eigenvector only obtains frequency data from the sinusoid signal. Unlike TFD, it can process signals buried with noise by possible Eigen decomposition for correlation. The disadvantage of this technique is Eigenvector might produce false zeros for the Pisarenko approach of the lowest eigenvalue.
WT and phase space reconstruction has also been used for feature extraction. WT is known to work better for sudden and transient changes in the signal. It also works well in irregular data patterns. It can be used to extract both time and frequency features from the data with linear features. However, it is challenging to select a proper mother wavelet for extracting valuable features. The WT method can be categorized into continuous wavelet transform (CWT), and discrete wavelet transform (DWT). In CWT, the disadvantage is the continuous change in scaling and translation parameters. Another WT method approach is the dual-tree complex wavelet transform (DT-CWT) [66]. DT-CWT uses both real and imaginary tree wavelet filters to obtain complex shifted and dilated mother wavelets. Unlike DWT, DT-CWT has the projecting characteristics of estimated changing value of invariance and higher anti-aliasing [67].
Auto-regressive method (AR) [64] is used to obtain a frequency feature from the sharp spectral signal. AR works to reduce the loss of signal spectral issues and enhance frequency determination. It performs well even if short data segments are provided, as it does not depend on the size of data for the infinite AR spectrum. However, it is challenging to estimate model order in the spectral. If the model order is selected wrongly, the AR model will provide a deprived spectral value.
There are other feature extraction methods available. EEG feature extraction using differential entropy (DE) is discussed in Chen et al. (2019) [11]. Empirical mode decomposition (EMD) can be used to obtain different time and frequency data and obtain a series of intrinsic mode functions (IMFs) data [68]. Researchers used it for extracting physiological signals for emotion state recognition, such as ECG [69] and EEG [68]. However, it also suffers from unavoidable limitations in some domains. The issues stated by   [70] are the frequency resolution stopping criterion, complementary ensemble empirical mode decomposition, and sampling frequency influence. There is no specific proven statement for stopping criterion threshold setting for the frequency resolution [70].  [72]. The main benefit of extracting features using DBN is that it is an unsupervised process and can handle many unlabeled data [73]. Moreover, DBNs can calculate the necessary variables' output weight by integrating the inference procedure approximation. However, few limitations due to the inference procedure of DBNs are only restricted to bottom-up pass. It contains a greedy layer that only learns a single layer's features and does not adjust again with the rest of the layers [74].
The features extracted depend and vary based on the type of the physiological signals and also the features extraction method used. Some of the features are presented in Table 3 according to the physiological signal.

5-2-2-Feature Selection & Reduction
Once features are extracted from the raw data, it is essential to find the quality and informative features that might be correlated with each other and remove the features which might be unrelated. Useless features may cause the following issues:  It can cause data analytics overfitting issues and weak outcomes, which leads to low prediction accuracy.
 It might take a long time to process data with useless features.
Hence, feature optimization is a necessity. The popular techniques for feature selections are reliefF, linear discriminant analysis (LDA), principal component analysis (PCA), and kernel-PCA.
ReliefF algorithm is a filter method that ponders each feature according to its relevance to each class. For example, Gómez-Lara et al. (2019) [84] used a relief algorithm to filter each feature according to its relevance to each class from EEG signals. Another researcher proved that the feature selection of EEG signal data with reliefF algorithm could improve classification accuracy [85]. The reliefF algorithm can also select ECG signals [86].
The LDA and PCA are traditional linear techniques and technically are features reduction methods. Zhang et al. (2018) [56] recommended feature reduction rather than feature selection method.   [87] applied PCA and obtained great accuracy of 100% for ECG data input for all levels of emotional states and dictionaries. The Kernel PCA feature selection uses a global nonlinear approach. However, with feature reduction techniques, a few important data might be lost in the reduction process.
Stepwise regression (SW) consists of regressing multiple variables by removing the most minor contributing predictors step by step. Only independent variables with non-zero coefficients are included in the final regression model. The Akaike information criterion (AIC) is used in SW as the stopping criterion. However, it also contains a few significant issues, as presented by Smith (2018) [88]. One of the issues is a local optimization calculated by involving parameters one-by-one is not guaranteed to be a global optimum. Another problem is it follows automatic rules that take in statistical correlations value without concerning whether or not it is sensible.
Genetic algorithm (GA) has also been applied to the features selection method. In GA, the selection is based on the natural biological evaluation. In nature, organisms have evolved over generations to better adapt to their environment. GAs can be used to maximize the performance of a predictive model on an unseen data set. GAs need a population of individuals and several generations to produce better approximations depending on some mutation and crossover probability parameters. At each generation, according to a fitness criterion, a new set of individuals, i.e., subsets of predictors, is created and also recombined using operators from natural genetics. As GA is processed using global optimization calculations, it is susceptible to being over-fitted, mainly when it is integrated with distance-based classifiers [89].
Random forest (RF) is a known Machine Learning algorithm that contains a feature selection function using the Gini value. Resampling methods such as cross-validation and bootstrap are helpful for feature selection during model building. These methods can maximize the model's performance but increase the computational cost. RFrecursive feature elimination (RF-RFE) provides a reliable assessment of predictors and presents a ranked set of the best predictors at the end.

5-2-3-Machine Learning Algorithms for Emotion Classification
After identifying the useful features through feature selection, the ML model must be trained using the selected input data, and the outcome is the class of each data; this is known as classifications. There are multiple classification techniques that are popular in ERS, namely; probabilistic neural network (PNN) [87], linear discriminant analysis (LDA) [28], recurrent neural network (RNN) [66], quadratic discriminant analysis (QDA) [66], K-nearest neighbor (KNN) [66], random forest (RF) [90] and support vector machine (SVM) [91].

Artificial Neural Network (ANN)
ANN based classifiers are commonly used in this field. A multilayer perceptron (MLP) is a class of feedforward artificial neural networks (FANN). The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptron (with threshold activation). For the introductory presentation of MLP, a diagram is presented in Figure 18 with one hidden layer, which is connected with every node of the input and output layer. The basic formula of the ANN hidden layer calculation is using Equation 1, and the output layer is calculated using Equation 2.
Here, the wlj and wkj are weight vectors. Input data vector is 'x', output data vector is 'y', and hidden data vector is 'h'. The input word vector position is 'i', the position of the hidden layer node is 'j', and the position of the output work vector is 'k'.

Figure 18. MLP
MLP can tolerate and be correctly trained even with missing data. It has fault tolerance capability; thus, a single node corruption does not prevent correct output calculation. However, it has a few limitations, such as relying on lag observations. Moreover, it is essential to map correctly between input and output to get good performance with static mapping function and fix output and input. Another variant of the ANN classifier is the recurrent neural network (RNN). In RNN, the current hidden state and current input data are calculated by the previous hidden state. For a clearer picture of RNN, an unfolded view is presented in Figure 19. The basic formula of RNN is presented in Equation 3.
where, ℎ presents hidden state at t. Where t is time-step. is current input data. The final parameter for is theta (which encapsulates the weights and biases for the network), for example, Wy and Wx.

Figure 19. RNN Unfold
Here, Wy and Wx are weight vectors for output and hidden layer respectively. Where W value is weight vector for different time-step t. Input word vector is 'x' and output word vector is 'y'. Moreover, to calculate the activation function, Equation 4 is used, and Equation 5 is used to calculate the output: RNN is capable of taking any dimension of input data. The trained model can remember every information throughout time, which makes it a valuable model for time-series data. It can share weights across the time-steps. Even if the training dataset is large, the model size is not large. An improved version of RNN, namely simple recurrent units (SRU), has been implemented by Wei et al. (2020) [66] as EEG-based ERS, and it is proven that SRU is better than MB and SVM. However, RNN has a few drawbacks. Due to the recurrent processing, the complete computation process is slow. It also does not work well in exploding or gradient vanishing data.

Support Vector Machine (SVM)
SVM performs classification by finding the best and optimal hyperplane for separating the data into two classes with maximum margin and no interior points. Hence to obtain the optimal hyperplane, Equation 6 is used.
Here, b indicates value for Intercept and bias term of the hyperplane equation The distance of a hyperplane presented in Equation 6, from a given point vector = ∅( ) can be calculated using Equation 7.
Here, for length (l), Euclidean norm is ‖ ‖ 2 , and is calculated using Equation 8: SVM is one of the researchers' choices due to its effectiveness and memory efficiency. SVM is more effective in high-dimensional spaces. In Elsayyad et al. (2017) [86], SVM is used for classifying ECG signals. Another work that also used SVM is by Domínguez-Jiménez et al. (2020) [91]. They compared the classification using mean accuracy and ROC and concluded SVM with the linear kernel (SVML) provided good performance for identifying sadness and amusement. Nevertheless, there are a few downsides of SVM, such as not being able to make multi-class classification, SVM optimization, and correct probability estimates and parameter selection. It was then improved by Chang and Lin (2011) [92], and it was also implemented by a researcher for emotion recognition [93].

Random Forest (RF)
For each decision tree, calculating the importance of a node using Gini Importance as presented in Equation 9, here assumed there is only child nodes (binary tree): Here, is weight of samples to reach node m. Where as is calculated impurity for the node m. ( ) and ℎ ( ) is a child node from left and right split on node m, respectively.
And by involving the predictor variable, all splits are averaged in the forest, which can be used to average splits on variables contained in a group. If considering dataset C splatted into two class, T1 and T2, with the amount of data N1 and N2, respectively, then the Gini index for T is defined in Equation 10.
The smallest split Gini (T) is selected to split the node, as it has lower impurities. The Gini index value of one class node will be 0. As mentioned in the previous section, RF calculates the importance of a node using Gini Importance. Peker et al. (2015) [85] compared RF with feedforward neural network (FFNN), SVM, NB, C4.5 decision tree algorithm, and radial basis function (RBF) network of ANN and found that RF provides the best classification accuracy. The main reason RF gives good performance is the reduced overfitting issue in decision trees due to the learning process. It works for classification and regression cases and even for categorical and continuous data. It also overcomes the data quality issue of missing data. However, the number of many trees processing requires higher computational power and even higher processing time.

Naive Bayes (NB)
Bayes Theorem provides a way of calculating posterior probability P( | ) from P(x), P( ), and P( |x) using Equation 11.
Here, 'x' is class and 'f' is features for presenting the class. P(x) is the prior probability of the class 'x' occurring independently. The prior probability of features 'f' is P(f). P(x|f) is the posterior probability of the class 'x' occurring in a given 'f'. Whereas P(f|x) is the likelihood, which is the probability of features occurring, given class 'x'.
One of the known classification algorithm is NB. It is used commonly for text classification spam filtering, multiclass prediction, etc. However, the main issue of NB is it provides output considering the input features as independent. It limits the use of the learning algorithm in a few cases. The algorithms set the probability to zero for the unknown test data set. Hence, it is essential to process the smoothing technique before training the model.

Adaboost
Adaboost works as a boosting algorithm for decision trees on binary classification cases. It basically strengthens up a group of weak classifiers to achieve reasonable accuracy. Every weak value is assigned with a weight to boost up the learning. Initially, the weight (w) is set to each instance by using Equation 12.
where, is the input data of position , and the size of data used to train is .
Next, a weak classifier will be trained using the weighted samples by using the training dataset. Then the misclassification rate is obtained using Equation 13.
where, is the total value of correctly predicted output by the trained model.

Equation 8 is then modified with the weighting value of the training input and obtained in Equation 14
: Here, is weight calculated and is prediction error for input value i. The 1 if misclassified and 0 if correctly classified.
Another necessary calculation is to obtain a stage value; this can be done using Equation 15. It provides the weight of any predictions made by the model.
Here, ln is the natural logarithm and is misclassification rate calculated by Equation 14.
Adaboost classification is known to be very simple for implementation. However, the algorithm only supports binary classification. Adaboost is also not tolerable to outliers and noisy data.

5-3-Deep Learning Algorithms for Emotion Classification
In this section, the RQ5 is answered. Five deep learning algorithms that are used in ERS are discussed.

5-3-1-Convolutional Neural Network (CNN)
CNN is a class of DL algorithms and a type of FFNNs. Its architecture contains shared weight values with conversion invariance features. CNN is famous in many different domains, and it is also currently gaining popularity for classifying physiological input signals (such as ECG, and EEG) of ERS. Martinez et al. (2013) [94] used CNN for classifying mental states (i.e., excitement, relaxation, fun, and anxiety) using SC and BV pulse signals. In another research, different statistical features were obtained from the benchmark dataset DEAP and passed to the CNN model for emotional state classification [95]. The combination of CNN with the ability for dynamic learning of a new system was proposed by Song et al. (2020) [96] as dynamical graph convolutional neural networks (DGCNN). The dynamical architecture helps to learn information connections between various EEG channels. The EEG channel data can be presented in an adjacency matrix in rows and columns to generate different features and classify them in the SEED dataset.  [97] implemented deep CNN (DCNN) on a dataset of bio-signals (ECG and GSR) from the AMIGOS database for emotional state detection by associating the ECG and GSR signal data with arousal and valence level. Al Machot et al. (2019) [98] implemented human ERS by proposing a CNN and proved that the system would perform effectively for both subject-dependent and independent human ER using DEAP and MAHNOB datasets. They had explicitly worked with the stress detection of an individual. CNN's main issue is it requires a large dataset to get an effective trained model, and hyper-parameter tuning is non-trivial [99].

5-3-2-Deep Belief Network (DBN)
The deep belief network (DBN) is another DL technique containing more straightforward Restricted Boltzmann machine (RBM) models. DBN can learn in-depth by inputting features through pre-training. From the input data, DBN extracts the deep features gradually. Deep learning does not require any feature extraction. DBN itself is capable of extracting high-level features from different types of data. Kawde & Verma (2017) [55] ignored the feature extraction step and directly entered the four biosignal into DBN. These data (i.e., EMG, EEG, GSR, and EOG) were obtained from the DEAP dataset. They achieved greater than 70% accuracy for valence and arousal.

5-3-3-Probabilistic Neural Network (PNN)
Probabilistic neural network (PNN) is an FFNN based algorithm that follows the Bayesian approach. PNN is known as fast training capability due to the ANN structure. It is also proven to give a more accurate classification performance with greater noise tolerance due to the insensitivity to outliers.   [102] applied PNN in an EEG-ERS and evaluated its performance using three datasets MAHNOB, DEAP, and a mobile EEG sensor. They had only used PNN and fed the feature vector for the emotional system classification using GSR and ECG signals. However, PNN takes relatively more time for classifying new cases and requires more excellent storage to store the trained model.

5-3-4-Long Short Term Memory (LSTM)
Another deep learning architecture of artificial recurrent neural network (RNN) is long short-term memory (LSTM). LSTM contains feedback connections, which can handle the vanishing issues of RNN. LSTM can process single data points and a sequence of data. Hence, it can process a sequence of data obtained from the physiological signals as implemented by Wöllmer et al. (2013) [103]. Li et al. (2016) [104] presented a framework involving LSTM as classifier and EEG signal as input data and found the classifier's performance was fitting and provided accurate output each time of prediction. Another research extracted rational asymmetry (RASM) features from EEG signals, trained LSTM to explore EEG signal correlations and obtained 76.67% classification accuracy [105]. In the research of Xing et al. (2019) [106], a framework was proposed; it used EEG signal and trained the classification of LSTM-RNN by integrating context relations between the feature sequences and achieved an enhanced performance. Another similar work was done by Alhagry et al. (2017) [107]; the researchers presented an end-to-end design involving LSTM classification by learning features for arousal, valence, and liking.
Basically, LSTM is designed to overcome the fundamental issue of RNN, i.e., the vanishing gradient problem. However, it does not entirely overcome the issue due to the transferring of data from one cell to another for evaluation. To get the architecture's effectiveness, it requires a large number of training data and higher configured hardware. Moreover, LSTM models are susceptible to overfitting issues.

5-3-4-1-SincNet -Customized Deep Learning
SincNet is a customized CNN-based algorithm designed by Ravanelli and Bengio (2018) [108], for a speech recognition system. A bandpass filtration parameterized sinc was used. The classifier is only trained by high and low cutoff frequencies in the raw dataset rather than training using each element from filters in a typical CNN algorithm. SincNet classifier was proven to be efficient for speaker recognition.
SincNet can also handle the classification of EEG measurement data. In Zeng et al. (2019) [109], and improved SincNet-based classification method SincNet-R was proposed. It contains three CNN and three DNN layers. The proposed technique was tested using EEG measurements and proven to perform better than other classification techniques, such as CNN, LSTM, and SVM.

5-4-Performance Evaluation
In this section, RQ6 is answered. The trained classification needs to be evaluated to make sure the classification prediction accuracy is accurate. The most common method to test the classification is by calculating the accuracy rate and error rate. For more details, precision rate, (P) and recall rate (R) are calculated. These rates can be defined using the confusion matrix, as shown in Table 4. The accuracy rate (A) is the percentage of adequately classified output by the total input, and the calculation is defined by Equation 16. At the same time, the error rate (e) is the classification error and is calculated by the misclassified output to the total input by Equation 17.
P and Rs are opposite, dependent on each other. They can be calculated using Equations 18 and 19, respectively. If the R output rises, the accuracy output will decrease and vice versa. F1 is the P's harmonic mean, and the R. An F1 score reaches its best value at 1 (perfect precision and recall). The formula for calculating F1 is given in Equation 20.
Receiver Operating Characteristic (ROC) can be useful for selecting the best classifier. The ROC's horizontal and vertical axis is the false positive rate (FPR) and the true positive rate (TPR), respectively. The FPR and TPR can be calculated using Equations 21 and 22, respectively.
An example of ROC is presented in Figure 20, and the two lines represent two different classifiers. If the line is nearer to the upper left corner, then the trained classifier has a good performance. In the example given, the blue line is nearer to the upper left corner than the green line, which indicates the blue classifier is better than the green classifier. Additionally, the blue line contains a bigger area under the ROC (AUC). It can also indicate that the blue line's classifier works better than the classifier represented by the green line.

6-Previous Work on ERS
This section summarized the reviewed ERS, which are tabulated in Table 5, where "Ref" contains a citation number of the referred work, "Signal" presents the type of input data the researcher used, "No. of participant/dataset" contains information for how many participants were involved in generating the dataset or which benchmark dataset was used, "Emotion stages" presents which classes of emotions were adopted. Next, "Inducement" contains the methods of inducing emotions in the participants. The "Technique and Features" column presents the techniques used for feature extraction and selection. Next, "Classification" contains information on the ML model used by the researchers. Finally, "Accuracy (%)" presents their outcome in percentage. The systematic review shows that the researchers' most common biosensor data are ECG and EEG signals. Some of them preferred multimodality data. However, most of them used a single modality. The most common single modality data used is EEG from the DEAP dataset.
In recent work, even though deep learning models are becoming popular among researchers working with classifications, including ERS, ML continues to be chosen by researchers. From Table 3, it can be seen that ML such as SVM, RF, and NN are still commonly used in classifying emotions, where SVM classification accuracy is comparatively better than other ML models. In ML-based ERS, feature extraction is necessary to form the data signal as most physiological data is in time-series format. Deep learning requires less pre-processing compared to regular ML model training. That is the reason that motivates researchers to adopt deep learning so that the feature extraction and selection process can be avoided. Due to the capability of the deep learning feature extraction and selection process in hidden layers, few researchers also used it as a pre-processing technology.
The dataset used for training the learning models is developed by inducing emotion in individuals. Unconscious emotion inducement is commonly used rather than conscious. The most common materials used are music, movie clips, audio, images, and video games. The most common pictures/images used to induce emotion are obtained from the International Affective Picture System (IAPS) [152].
Both discrete and arousal/valence are popular in this field. However, only arousal and valence are not enough for a user to understand the individual's exact feelings. The rule of thumb for ML classification states that training samples should be significant to get the best classification result. The minimum number of participants used is six from the work tabulated above, and the maximum is 60. Different researchers obtained different accuracy and performance levels for different input data. It shows that there is no fixed method for the required case. A multimodal model, as researched by Jang et al. (2015) [116], was found to be the best model for the input and output data. Many physiological signals are non-stationary and chaotic. Commonly, time and frequency data can be extracted from these non-stationary physiological signals and reduce the impact of non-stationary characteristics on subsequent processing.

7-Conclusion and Future Work
In this paper, an initial systematic review was presented to answer research questions for developing ERS. It was concluded that the discrete and multi-dimensional emotional state models help distinguish between different emotional values, which are usually used as output classes of ERS. Input data can be obtained from various physiological signals (ECG, EEG, EMG, HRV, EOG, EDA/GSR/SC, SKT, RSP) by using biosensors. The ERS can be built using one source or a fusion of sources of signal data. However, single-modality is preferred due to reducing hardware cost. This paper describes the whole framework of ERS. Physiological measurement processing and data exploration methods play an essential role in selecting the best classification method and biosensors. Emotion classification can be implemented using traditional ML and deep learning models. ML model requires feature extraction and feature selection. Deep learning models automatically extract features instead of manual extraction. It is crucial to evaluate the trained classifiers' performance before integrating the model into the actual system. The evaluation can be processed using the confusion matrix value. Current research indicates that the most effective ERS methods are Deep Learning Models (CNN, DBN, PNN, LSTM, and SincNet). However, some traditional ML models (such as SVM and RF) can also provide classification with reasonable accuracy. According to the accuracy obtained by different researchers, there is room to improve the model performance. In particular, work with single modality needs to improve, as single input will reduce overall system expense.
The findings from this systematic review are going to be used as guidelines in building and designing the final ERS. In the future, the presented system will be implemented to detect emotions in real-time.

8-2-Data Availability Statement
Data sharing is not applicable to this article.

8-3-Funding
This project is funded by TM Research & Development Grant (RDTC/190988), which is awarded to the Multimedia University.

8-6-Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this manuscript. In addition, the ethical issues, including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, and redundancies have been completely observed by the authors.