Deep Learning in Predicting High School Grades: A Quantum Space of Representation

This paper applies deep learning to the prediction of Portuguese high school grades. A deep multilayer perceptron and a multiple linear regression implementation are undertaken. The objective is to demonstrate the adequacy of deep learning as a quantitative explanatory paradigm when compared with the classical econometrics approach. The results encompass point predictions, prediction intervals, variable gradients, and the impact of an increase in the class size on grades. Deep learning’s generalization error is lower in the student grade prediction, and its prediction intervals are more accurate. The deep multilayer perceptron gradient empirical distributions largely align with the regression coefficient estimates, indicating a satisfactory regression fit. Based on gradient discrepancies, a student’s mother being an employer does not seem to be a positive factor. A benign paradigm shift concerning the balance between home and career affairs for both genders should be reinforced. The deep multilayer perceptron broadens the spectrum of possibilities, providing a quantum solution hinged on a universal approximator. In the case of an academic achievement-critical factor such as class size, where the literature is neither unanimous on its importance nor its direction, the multilayer perceptron formed three distinct clusters per the individual gradient signals.

academic achievement (AA). EDM and LA often use neural networks to study educational realities and extract valuable knowledge from digital platform data. Learning systems with the ability to anticipate students at risk of failing are a promising development for improving learning contexts and academic attainment. However, there still seems to be an ongoing preference for traditional methods such as multiple linear regression [5]. On the other hand, EDM and LA develop extensive knowledge models suitable for predictive analysis alone. These models do not have the traditional explanatory nature built upon the measurement of the literature-based AA determinants [6]. To the best of the authors' knowledge, there is no educational econometrics study that has considered deep learning as the explanatory quantitative method. This article aims to fill this important scientific gap.
The study of the determinants of AA is crucial to promoting accurate educational policies. Moreover, the success of a country's education system can leverage the entire nation's wealth [7]. Promoting an improvement in the conceptual framework or in the quantitative approach that supports it is a meaningful and necessary breakthrough. Applying deep learning to infer relationships between concepts is not the same as using it for purely predictive purposes. Since deep learning is based on a universal approximator, the vast underlying parameters make interpretation and knowledge retention more challenging. It is necessary to ensure that developments in scientific experimentation do not bring any spurious complications. Any added complexity will lead to a better approximation of the reality under scrutiny. Thus, deciphering the deep learning black box is a valuable scientific undertaking [8]. In this study we address this challenging task by computing the deep gradients for each variable-observation pair and comparing their distributions with the traditional βs of the multilinear regression.
The adoption of deep learning as an experimental approach in educational and social sciences alike has remarkable advantages beyond its predictive capacity. The paradigm does not depend on a specific mathematical form to express relationships between concepts and has a particular aptitude to represent social phenomena whose heterogeneity is paramount [9]. The treatment of conceptual heterogeneity is undertaken naturally and spontaneously. By widening the spectrum of possibilities, deep learning introduces a capacity to anticipate nonconformities, which induces the search for fairer and more equitable policies. Any policy measure that brings about changes in the critical factors of AA is evaluated within the heterogeneous spectrum of both the possible outcomes and the underlying gradient structure. For example, there is room for a critical factor with an average positive impact on the student's grades to have a detrimental effect in a hypothetical individual example. This study undertakes this comprehensive analysis for the critical AA factor of class size, for which the literature is unanimous on neither its importance nor its direction. This paper aims therefore to apply deep learning to predict upper secondary students' AA, highlighting the revolutionary character of its widespread adoption. It seeks to reflect on the repercussions for the AA domain (and for the social sciences in general) resulting from the use of a paradigm that has the intrinsic ability to create a quantic space of representation of social phenomena. For this purpose, we implement deep learning and multilinear regression simultaneously to predict the upper secondary grades assigned by a Portuguese education system teacher at the end of the 2018-19 school year. The discussion that follows stems from the interpretation and comparison of the results regarding point and interval predictions, independent variables gradients, and the likely effect of a generalized increase in class size.
The remainder of the document is organized as follows: first, a review of AA literature is presented, followed by a detailed description of the methodology and the underlying algorithms. Then, the empirical results are shown and interpreted, followed by the discussion and conclusions.

2-Literature Review
In the scientific literature, AA determinants are commonly classified into student, parents, and school critical factors [10]. A thorough assessment of the conditional background induced by those three analytical axes is of utmost importance when explaining students' AA. Cognitive ability has long been considered the most essential determinant of AA [11,12]. Not surprisingly, students' scores can be anticipated accurately from their Intelligence Quotient [13], despite the significant role that is left for other important factors [14]. When it comes to gender, females generally attain better scores in school, especially in languages, and less so in Math [15][16][17]. The tendency to create a negative peer view of the school activities undermines males' levels of engagement, motivation, and achievement [18]. There is a relationship between certain personality traits, such as organization and steadiness of effort, and overachievement [19].
There is an AA gap between different ethnic groups. Black students in the US are invariably bound to underperform [20]. Even though not extendable to the following generations in the US, first-generation children of African, Asian, and Hispanic origins achieve higher education levels than did their parents [21]. The AA tends to be poorer if the origin country has a low economic development level and better if the origin country is politically stable [22]. Using personal computers at school can improve AA. However, students tend to use them primarily for unhelpful leisure activities such as emailing friends and navigating the Internet [23]. There is a negative relationship between the non-academic use of information and communication technologies and student grades [24]. Greater use of internet applications is also associated with sleeping late, fatigue, class absence, and AA underperformance [25].
Parents' expectations about their children's education attainment positively affect their AA, which is more significant than a proper home structure and supervision [26]. Underachieving students are bound to benefit from good relationships between parents and school [27]. Furthermore, parental involvement seems to especially help low socioeconomic status (SES) students [28]. There is a strong positive relationship between the SES of the student's family and AA, highlighting education inequalities and the importance of resources and cultural capital [29]. The association between parents' education and AA remains even after controlling for variables associated with intelligence and personality [30], underlining the prominent role of schools in providing cultural experiences and additional stimuli that are lacking at home. In addition, having private lessons, which is associated with parental education and family income, can be decisive for students' AA [31].
There is some controversy surrounding the relationship between class size and AA in the literature. Hoxby (2000) [32] concluded that the class size effect is insignificant even for minor effects. By contrast, Krueger (1999) [33] concluded that smaller classes improve AA. The most benefitted are the minority and impoverished students. Smaller classes appear to have a favourable effect on AA in education systems, where the lecturing quality seems to be lower [34]. In a more convergent tone, smaller schools promote AA, providing the greatest benefit to students with learning difficulties and lower SES [35]. An adequate school environment and design are conducive to overachievement. Students and school stakeholders should be provided with a peaceful and comfortable learning environment with clean air and good light [36]. When introducing changes in the school environment, an inclusive design process is recommended that welcomes genuine inputs of teachers and students [37].
Lecturing ability and teacher quality are important for AA in general and influence underperforming students in particular [38]. There is a positive relationship between teachers' ability and college grades [39]. However, many measurable teacher characteristics seem to be unrelated to teacher quality, which is intrinsically linked to unobservable factors. This finding points to policies favouring teaching evaluation based on students' performances [40]. In the same line, Rivkin et al. (2005) [41] corroborated that lecturing effectiveness is undoubtedly a significant AA determinant. However, in the same study the education and teacher experience revealed only a weak effect.
The LA/EDM field is a predictive branch of the AA domain that uses machine learning to disclose relevant behaviour patterns embedded in the educational databases. The increase of LA/EDM research is an ongoing process. However, there are only a few regression studies, as most of them are designed to solve classification and clustering problems [42,43]. Normally, the LA/EDM learning systems resort to socio-demographic variables, digital log data, and course assignment scores to anticipate the students' AA. They are extensive knowledge models appropriate for predictive but not explanatory analysis [6]. It has also been proved that ANN performs among the best when predicting grades [44]. Table 1 shows a representative set of the studies that use ANN in the experimental phase. It is worth mentioning that our research goes far beyond their scope and depth. For instance, none of those in Table 1 involves estimating prediction intervals, the computation of the deep learning gradients, and further analysis of the results of political measures.

3-Methodology
Supervised learning involves learning a function that maps an input to an output based on a set of input-output pairs. The function is inferred from labelled data consisting of training examples. Each example is a pair of an input vector and an output value in supervised learning, also called a supervision signal. Each component of the input vector corresponds to a feature or attribute. A supervised learning algorithm analyses the training data and infers a function to be used to map new examples. The learned function should accurately anticipate the class labels in case of classification or the numeric target variable in case of regression. The learning algorithm should have the statistical quality of properly generalizing from training to unseen data [57].
The dataset was split into 60% for training, 20% for validation, and 20% for testing. All the variables were standardized. The deep multilayer perceptron (MLP) implementation includes training and test performance statistics, test prediction intervals, training, test gradients, and the analysis of the effects on grades of a class size increase. The multilinear regression (MLR) implementation does not include the computation of gradients because they coincide with the regression coefficients. The core of the experimental phase involved eight main steps. The first consisted of a feature selection procedure based on the Lasso regression algorithm. The multilinear regression results were computed in both the training and test sets in the second step. In the third step a thorough architecture-topology and hyperparameter optimization procedure of the deep MLP was undertaken. Next, the deep MLP was trained. Then the deep MLP test prediction intervals were calculated. In the sixth step, the training and test gradients were determined. Finally, the seventh and eighth step comprised predicting class size effects on grades for both MLR and deep MLP. Figure 1 displays the research methodology followed in this study.

3-1-Multilinear Regression
MLR establishes a linear relationship between a dependent variable to be explained and predicted and a set of independent variables. It provides easily interpretable results by imposing important restrictions. The error terms are assumed to be independent of one another, homoscedastic, and with a null mean. The model and the individual statistical significance tests of the coefficients presuppose that the error term follows a gaussian distribution. The ordinary least squares method was used in the learning phase, and the model parameters were estimated from the training and validation set. The point and interval predictions of the test set followed the standard practice [58]. The implementation was based on the statsmodels python library [59].

3-2-Deep Multilayer Perceptron
The MLP stems from the perceptron model [60], capable of solving linearly separable classification problems. The MLP architecture adds hidden layers between the input and output layers. The number of nodes of the input layer equals the number of the input variables. It is a feed-forward topology, as the connections between nodes are established from lower to upper layers, and no connections exist between nodes of the same layer. Each connection is assigned a weight. The input of every node in any hidden layer or the output layer is a weighted average of the nodes' outputs of the preceding layer plus a bias. The input is transformed through a nonlinear activation function in a new signal propagated forward up to the output layer [61]. The theoretical analysis of an MLP is not an easy task as the nonlinearity of the distributed processing and the high connectivity enlarge the optimization search space to numerous possible representations of the input patterns by the hidden nodes. The task becomes even more difficult in the case of a deep MLP with several large hidden layers. The learning phase of an MLP consists of optimizing the weights and biases to minimize the gap between the network output and the target. This optimization is carried out by the backpropagation algorithm [62] combined with gradient descent techniques. The learning process has two phases. In the forward phase the signals are propagated from lower to upper layers up to the output layer, and the weights and biases remain unchanged. In the backward phase the network error is first computed and then propagated backward layer by layer, inducing the weights and the biases to change in the direction determined by the gradient of the loss function. The learning phase is considered successful as it reaches a configuration of the weights that results in an acceptable value of the loss function [61,63,64].
The implementation was developed using Keras [65], which is a deep learning API written in Python, running on top of TensorFlow (an end-to-end machine learning platform). It was developed with a focus on enabling fast experimentation and is characterized by flexibility and scalability. In fact, as stated in the Keras documentation, it is possible to run Keras on large clusters of GPUs, and export Keras models to run in the browser or on a mobile device.

3-2-1-Layer weight initializers
The assignment of initial node weights is done just before the learning phase. Insignificant initial weights tend to produce vanishing backward propagated weights. Large initial weights can induce exploding gradients [66]. The hypertunning procedure encompassed four weight initialization alternatives: the random normal initializer with mean zero and standard deviation of 0.05, the random uniform initializer between -0.05 and 0.05, the normalized random uniform initializer, and the normalized random normal initializer [67]. In terms of biases, the ones initializer, activating every node in the deep MLP, and the most commonly used zeros initializer were included.

3-2-2-Activation Function
The ANN can learn very complex patterns due to both the nonlinearity of the activation function and the existence of hidden layers. The activation function of the hidden layers is the Rectified Linear Unit function ( ( ) = max( , 0)) as it typically enhances learning in networks with many layers [68]. The output layer has no activation function.

3-2-3-Dropout
Dropout is a regularization technique that randomly and temporarily stops training some nodes and their interconnections. Dropout regularization can be compared to model ensembles without the explicit need of creating multiple learners [69]. The ANN generalization ability is enhanced because it avoids adapting weights to overfit the training set. To keep the mean weight unchanged between training and testing, the weights for unseen data come as follows [70]: where p is the dropout rate, the probability of not training the node.
The hyper-tuning phase evaluated a dropout layer with different dropout rates after any hyper-tuned dense layer. A dropout layer sets the inputs to zero according to the dropout rate.

3-2-4-Batch Size and Batch Normalization
The batch size corresponds to the number of observations considered in the forward step of the backpropagation before updating the network's weights [71]. There is a trade-off between the computation cost and the ANN accuracy in batch size. A larger batch size induces a more straightforward computation but poorer ANN accuracy. The default batch size of 32 examples was used [72]. The batch size search space was built from powers of two [71].
Batch normalization consists of normalizing the layer inputs for each training batch to maintain its mean close to zero and its standard deviation close to 1. With the distributions of the layer inputs stabilized, the optimizer is less prone to lead to layer saturations, accelerating learning, reducing the importance of the weight initialization, and eliminating the need for dropout [73]. The option of having a batch normalization layer before the activation was evaluated in the hyper-tuning phase.

3-2-5-Optimizers
The hyper-search includes a representative set of various versions of the gradient descent optimizer to analyse which better suits the data's convergence and pattern.

3-2-5-1-Mini-Batch Gradient Descent with Momentum
The batch gradient descent [3] with moment updates the weights and biases for a learning rate and a momentum hyperparameter are as following: where; The momentum addition in the gradient function makes the actual gradient dependent on the previous gradient, accelerating convergence and avoiding excessive oscillation.
An epoch is a complete pass through the entire training set. In the batch gradient descent, there is one update per epoch. In the case of the mini-batch version, as the internal parameters are updated for every successive subsample of the training data, there are several updates per epoch. In the case of the stochastic version, the update is undertaken for every single example. The mini-batch version comes up as a good compromise between the large gradient oscillation of the stochastic version that demands lower learning rates and more time to converge and the computation cost of the batch version that computes the gradients for the entire training set at once.
In the hyper-tuning phase, its adoption was evaluated for different learning rates and momentum coefficients.

3-2-5-2-Root Mean Square Propagation (RMSprop)
The gradients of each weight or bias can differ substantially, making it hard to find a proper single learning rate that fits every case. Higher gradients should correspond to lower learning rates in terms of convergence and efficiency. The RMSprop is based on the mini-batch gradient descent and introduces adaptive learning rates. It divides the actual mini-batch gradient by the moving average of the square of the consecutive mini-batch gradients, resulting in different learning rates for each weight: In the hyper-tuning phase, its inclusion was evaluated for different values.

3-2-5-3-Adaptive Moment Estimation (Adam)
The Adam optimizer [74] uses adaptive learning rates and momentum. As sparse features are bound to generate sparse gradients, their learning rates should be higher. The adaptive learning rates allow different feature learning rates based on the sum of squares of their previous gradients.
The Adam optimizer updates the weights and biases for a core learning rate , a momentum hyperparameter 1 and an adaptive learning rate hyperparameter 2 : where > 0 avoids the null denominator case and ̂ is different for each feature. As the initial values 0 and 0 are zero, the rectifications in Equations 8 and 10 recentre the exponential averages.
In the hyper-tuning phase, the default optimizer was Adam. When tuned, the core learning rate , the momentum

3-2-5-4-Learning Rate Schedule
Scheduling the learning rate consists of reducing it as the training goes. Sometimes it is called annealing rate because it allows both higher weights variance in the beginning to avoid local minima and lower variance in the final epochs, enhancing the likelihood of convergence [72,75].
In the hyper-tuning phase, an exponential learning rate schedule was put forward: where ∁ ]0,1] and is the number of steps completed every % of total batches.

3-3-Feature Selection
Lasso multilinear regression [76] introduces an L1 regularization in the MLR model, penalizing the magnitude of the regression coefficients. As the shrinkage pressure increases, the resulting model is likely to be simpler and sparser. In a feature selection procedure, the variables that have null ̂ are dropped as they are considered unimportant for the explanation of the target variable.
n the feature selection phase, the choice of the regularization factor was carried out through a four-fold crossvalidation search grid. The of the feature selection model is the highest, which allows the loss function to be less than the optimum plus its standard deviation.

3-4-Hyper-Tuning
The hyperparameters selection was divided into three steps. The first was based on the hyperband optimization method, whereas the other two were based on the Bayesian optimization method.
The hyperband optimization algorithm was used to select the deep MLP topology. The aim was to include as many deepness and width combinations as possible and simultaneously follow a reasonable computation budget.

3-4-1-Hyperband Optimization
The hyperband optimization [77] speeds up the random search algorithm [78] (commonly used for hyper-parameter optimization) by introducing an adaptive mechanism and an early stopping system. For the same computation budget, these two components allow the algorithm to look at more possible configurations with respect to traditional hyperparameter optimization approaches. The hyperband undertakes a grid search for n possible configurations. Each grid search iteration is called a bracket and includes a complete run of the Successive Halving algorithm [79].
The schedule used is described in Table 2. The Max-epochs refer to the maximum iterations per configuration and the Factor to the configuration down-sampling rate.

3-4-2-Bayesian Optimization
Bayesian optimization uses the Bayes Theorem to estimate an acquisition function that determines the spatial location of the following search. The acquisition function represents a formal trade-off between exploration, high variance areas of the surrogate objective function with insufficient posterior information, and exploitation, areas for which posterior information points to adequate objective function values. It is cost-efficient because it minimizes the demand for configuration evaluations and suits non-convex optimization problems [80].
In the hyper-tuning phase the Bayesian optimization maximum of trials was set to 200.

3-5-Deep MLP Prediction Intervals
Let us suppose a target random variable y as follows [81]: where is a vector of independent variables and is a term of stochastic noise, mean µ, and finite variance 2 .
where ( ; ) is the model error.
The prediction for any unseen example 0 is:

3-5-1-Model Error
A bootstrap can approximate the model distribution. Let us draw with replacement random successive subsamples of the training set and fit a model on each of them. The bootstrap predictions on a validation set VS can be denoted as follows: The mean of the bootstrap distribution ̅ , converges to the true mean of the model: In turn, the empirical distribution of the centred bootstrap samples converges to the distribution of the model error ( , ; ):

3-5-2-Stochastic Error
The distribution of the stochastic error can be approximated by the distribution of the residuals projected on the validation set:
For a level of significance of = 5%, the 2.5% and the 97.5% quantiles of the T set were taken to build the prediction intervals, ( 0 ) = ( 0 ;̂) + ( 2,5% , 97,5% ) In the experimental phase the trained deep MLP was used as the statistical learning model ( ;̂) and the bootstrap was carried out as a subsequent fine-tuning. The number of bootstraps samples was 200 and the number of epochs was 30.
The accuracy and adequacy of the Prediction limits were inferred from the Prediction Interval Coverage Probability (PICP) and Mean Prediction Interval Width (MPIW) as follows [82]: where c is the number of samples of the test set whose target falls inside the prediction interval and n is the total of the test set samples.
where and are the upper and the lower limit, respectively

4-1-Data
The dataset comprises 673,992 grades (from 0 to 20) from Portuguese upper secondary students. The data refer to the three final high school years (10 th , 11 th , and 12 th ) and comprise 27 subjects from Portuguese and English to Physics and Math for the 2018-19 academic year (see Appendix I). A dummy variable was associated with each subject.
Regarding the proportion of years, 29% are 10 th grades, 36% 11 th grades, and 35% 12 th grades. 53% of the grades are from girls. 14% are from half-scholarship students and 12% from those holding a full scholarship. The dataset was built from a Microsoft ® SQL Server Management Studio series of queries. There are 34 features, 7 of which are from Statistics Portugal and the remainder from the Directorate-General for Statistics of Education and Science of the Portuguese Ministry of Education. The latter are essentially categorical variables creating a sparse dataset in terms of measurements of AA critical factors (see Appendix II for more details).
The one-hot encoding ended with 131 independent variables to be selected to the input space via the Lasso Regression selection procedure. The dataset was split into 404,394 observations for training, 134,799 for validation, and 134,799 for testing. The test set is a complete holdout set that did not participate in any step of the learning phase, replicating unseen data.

4-2-1-Feature Selection
The feature selection Lasso procedure picked 85 of 131 available predictive variables for an optimized shrinkage pressure of 0.004 (Figure 2). The distance between the student's home and the school was considered irrelevant. The dummy variables concerning students whose fathers' nationality is from either wealthy Western countries or poor Eastern European ones were discarded as their behaviour is not significantly different from nationals. The dummy variables related to employment situation and student guardians' education level were largely discarded because they are strongly correlated with the same dummy variables that correspond to the parents. In terms of both parents' job situation, several dummies were considered irrelevant and indifferent from the base status of being employed. Here, the unemployed situation dummy passed the LASSO filter in both situations. Almost every dummy associated with the parents' education level went to the input space. In terms of scholarship, the half support dummy was discarded, not differing significantly from the non-scholarship situation. Among the Statistics Portugal socioeconomic variables, the illiteracy rate, unemployment rate, and the primary sector importance were also discarded.

4-2-2-Hyper-Tuning
The initial deep MLP changes throughout the hyper-tuning process because it incorporates the tuning optimization of the preceding steps. For reference, the base MLP is shown in Table 3.

4-2-2-1-Deep MLP Topology
The first step of the hyper-tuning phase was to optimize the topology of the deep MLP through a hyperband search. The search space was built according to Table 4. The optimization results show a clear preference for topologies with depth lower than 10 hidden layers and a global size of fewer than 250 nodes (see Figure 2). On the other hand, topologies with a depth deeper than 12 tend to have a higher MAE. This outcome arises from the pattern of the data itself and not from possible divergence issues, as some deep topologies reach fair values for MAE (size and colour of the dots in Figure 3).

Figure 3. Hidden layers, size, and validation MAE
The selected topology consists of 6 hidden layers with widths of 45, 45, 3, 22, 48, and 19. The deep MLP seems to allocate the first three layers to condense the data and then the latter ones to search for the universal approximator. The MLP reduces the dimensionality of the data with an edge: it does not follow a predefined linear or kernelized mathematical transformation.

4-2-2-2-Weight and Bias Initializations, Dropout Layer, and Batch Normalization
The second step of the hyper-tuning phase consists of choosing the weight and bias initialization method, the existence of a dropout layer after each dense layer, and a batch normalization before every activation. The search space was built according to Table 5. The selected combination was random normal and ones for weight and bias initializations, the existence of batch normalization but no dropout. The Bayesian optimization directed the search toward areas where the weight and bias initializations were random normal (68%) and zeros (91%), respectively, and neither dropout nor batch normalization existed (83.50%). Thus, only the choices of random normal for the weight initializer and the dropout inexistence can be said to have a robust decision basis. The other choices were substantiated on a weaker stand.

4-2-2-3-Optimizer and Batch Size
The last step of hyper-tuning encompasses the batch size optimization and the optimizer choice along with the tuning of its hyperparameters: learning rate and schedule, momentum, and adaptive learning rate factors. The search space is described in Table 6. The Bayesian choices of Adam as optimizer and 64 as batch size (see Table 7) are robust because they are present in 55% and 45% of the 20 best combinations, respectively. However, the choice for a learning rate schedule is weak as half of the 20 best combinations have no learning schedule. The learning rate and the batch size increased from the default value of 0.001 and 32 to 0.00553 and 64, respectively. The surge of the learning rate is not unexpected given the existence of a learning schedule and the increase in the batch size.

4-2-3-Learning Results
The deep MLP presents better results than the MLR in training and testing ( Table 8). The MLP MAE in training and test are 0.6357 and 0.6484, respectively, better than the MLR 0.6944 and 0.6910. The multilinear regression training set includes the validation set. Regarding the MLP, using the validation set allowed for saving the best combination of epoch weights used further to compute predictions limits and gradients. The MLP suffers from some overfitting as the test results are poorer than both the validation and the training results. Several variance reduction techniques were foreseen when optimizing the architecture and the hyperparameters, so the MLP overfitting should be interpreted as a virtuous cost associated with achieving a better generalization error. The deep MLP training optimization converged smoothly and reached a low variance plateau around epoch 600 (Figure 4), the weight combination that prevailed corresponds to the 799 th epoch.

Figure 4. Deep MLP training convergence (Depicted from the Keras-Tensorflow learning history)
The predictions limits were built for a = 5%, and both deep MLP and MLR have a greater than 95%. The deep MLP has a prediction interval 5% shorter than MLR (Table 9).

4-2-4-Gradients Analysis
The gradients correspond to the first derivatives with respect to the input variables. In the MLR the gradients are the data-invariant βs. In deep MLP the gradients vary from data point to data point, forming a vector of βs for each input variable. The analysis consists of comparing the MLP mean β with the MLR β in the light of what is expected from the literature.
Inconsistencies between the MLR β and the mean MLP β were found in only 8 out of 85 input variables (see and follow Table 10). The guardian not being a parent or a close relative can indicate a dysfunctional family background, which is detrimental to AA. However, the MLP β relative to both Guardian is not a relative and Guardian is a relative but not parent input variables have a contradicting positive signal. Regarding internet usage, the MLP β signal is negative and different from the positive β in the MLR. In the literature, the use of the Internet is reported as having both positive and negative effects on AA, depending on being directed to school activities or entertainment. Regarding the professional teacher category, it might be expected that any category below being a definitive permanent staff member of a school would be detrimental to AA. However, in two such cases the MLR has positive β and in one such case the MLP also has a positive mean β. The MLR and the MLP β disagree in signal once again in the collective dwellings input variable. As some collective dwellings such as hotels and state buildings are bound to be found in high-income urban zones and others like shopping centres and hospitals can be placed in some suburban areas, the literature does not indicate a specific signal on this β. More inconsistencies between the results and the literature are noted in Table 11. The results show that students belonging to the Chinese community tend to have better grades than the natives, contradicting the literature. Other examples are the negative effect of teacher age and the positive effect of teacher years to retirement on AA. Indeed, it could be expected that more lecturing experience would result in higher grades. Curiously, female teachers are bound to assign lower grades, a finding not explicitly addressed by the literature. Fixed contract teachers tend to assign higher grades than fully permanent teachers, contradicting the notion that a teacher with a stable career is more efficient in lecturing, thereby yielding higher AA levels. The results also show that the mother being an employer is detrimental to the student's AA, contradicting the positive association between parental SES and AA to some degree. Lastly, the results also show an unequivocal negative class size effect on AA even though the literature is non-conclusive in this regard.

4-2-5-Class Size Effect
It is necessary to change the test set to analyse an increase of five students in the size of the classes accordingly. In this case, the impacts on grades arise naturally from the difference between the modified and the original test set predictions.
In the MLR the impact is the same whichever test example is considered and is driven by the ̂ associated with the class size. In this case, the grades of every example were down by 0.0282.
In the deep MLP every test example has assigned a specific ̂ and the impacts on grades varied accordingly. The mean impact is down in 0.1047, and the standard deviation is 0.3747. The deep MLP anticipates, on average, a more substantial effect on grades than MLR does. The test set is split into three clusters regarding the type of impact on grades: a first cluster in which grades are predicted to improve, a second cluster in which grades are predicted to worsen, and a third cluster in which grades are predicted to remain unchanged (see Table 12). The formation of the clusters closely followed the test set gradients even though there is a clear difference between first derivatives and differentials in the MLP framework. The analysis of the confusion matrix of Table 13 highlights that 78.46% of the impacts on grades are per the respective gradient signal.

5-Discussion
The feature selection procedure solved multicollinearity problems concerning the variables measured for the guardian, the parents simultaneously, and the socioeconomic variables retrieved from Statistics Portugal. On the other hand, the regularization techniques such as dropout and batch normalization had only a minor rule in the deep MLP hyper-tuning optimization, seemingly coherent with the high-bias knowledge-intensive model in question and an inherent low variance trait [6].
In terms of efficiency, the deep MLP has better results than MLR, whichever the metrics or approach. The deep MLP generalization error is more minor in the student grades prediction, and its prediction intervals are more accurate. This added generalization ability is a hallmark of machine learning, particularly deep learning. Furthermore, the deep MLP gradients empirical distributions are primarily in line with the regression coefficients estimates of the MLR, pointing to a satisfactory MLR fit to the pattern embedded in the data. The relationship between the structure of the MLR regression coefficients and the deep MLP gradients empirical distributions corroborates the absence of significant specification distortions in the MLR, strengthening its results and inferences. There is no doubt that in the presence of a strong nonlinear pattern, the divergence between the gradient structures would undoubtedly be accentuated. In fact, the deep MLP implementation turns out to be an extremely robust way to assess the adequacy and soundness of the MLR fit.
In terms of discrepancies between the resulting gradients and what would be expected according to the literature, it should be highlighted that teachers with fixed-term contracts tend to assign higher grades than teachers with a permanent contract and school. AA seems to be negatively associated with lecturing experience, as older teachers closer to retirement tend to assign lower grades. However, care should be taken when interpreting this empirical result. Perhaps, with more lecturing experience, teachers tend to increase their stringency for excellence concerning student performance, resulting in lower grades for comparable attainments of AA. Female teachers also tend to assign lower grades, which can also be associated with stricter evaluation criteria. The AA of students belonging to the Chinese community highlights successful integration, sound economic and social endowments, and efficient support networks [83]. The mother being an employer does not seem to be a positive factor in the student's AA. This is an important empirical result because it is essential to ensure that women's empowerment in their aspirations, objectives, undertakings, and civic participation is followed by a benign paradigm change in terms of the balance between home and career affairs for both genders. Therefore, the father should reinforce his role at home, and career demands should not follow the more aggressive patterns of Western patriarchal society.
Deep MLP broadens the spectrum of possibilities and greets each individual specificity as a core element of the phenomenon by providing a quantum solution hinged on a universal approximator. For example, there is room for a critical factor with an average positive impact on the student's grades to have a detrimental effect in a hypothetical individual example. In the case of a critical AA factor such as class size, for which the literature is unanimous regarding neither its importance nor its direction, the MLP formed three distinct clusters per the individual gradients. The first cluster is formed by the students most likely to benefit from the increase. This cluster is followed by those most likely to be indifferent to it. The third cluster is for students most likely to be harmed by the increase. The gradients anticipate the likely response to a change in class size and therefore must be considered in decision-making processes and policy design.
Deep MLP can have a revolutionary effect on the social sciences in general and the educational sciences in particular. Deterministic mathematical functions cannot formalize social science conceptual relationships without an evident loss of explanatory and predictive power. The heterogeneity of responses to social phenomena is a pattern that should be accepted into social conceptual frameworks. Forging a quantitative basis that does not need a deterministic functional assumption and welcomes high levels of heterogeneity is a decisive breakthrough clearly adequate for the complexity of social phenomena. The aim is not to increase complications. The objective is to use a quantum method of empirical inference and prediction that can anticipate the conceptual behavior of phenomena, extending it to the complexity and heterogeneity that have always been the hallmark of the social sciences. Moreover, in this heterogeneity, it is possible to achieve the character of "new normality" in the presence of relational divergence between concepts and enhance the exante tools that can explain, anticipate, and resolve concrete inequities and discrepancies.

6-Conclusion
The high school grades attributed by teachers appeared to be negatively associated with lecturing experience. However, drawing conclusions about AA is not straightforward. For instance, a simple increase in the teaching stringency as teachers grow older should result in lower grades for comparable AA attainments. Female teachers are also bound to attribute lower grades. This can also be linked to stricter evaluation criteria. The mother being an employer is detrimental to student AA. It is of utmost importance to ensure that women's empowerment in their aspirations, objectives, undertakings, and civic participation is followed by an appropriate balance between home and career affairs for both genders.
Deep MLP is more efficient than other methods in predicting students' grades. However, the adoption of deep learning as an experimental approach in educational and social sciences also has remarkable advantages beyond its predictive capacity. We are dealing with a paradigm that does not depend on a specific mathematical form to express relationships between concepts with a particular aptitude to represent social phenomena whose heterogeneity is paramount. The treatment of conceptual heterogeneity is undertaken naturally and spontaneously. By widening the spectrum of possibilities, deep learning introduces a capacity to anticipate nonconformities, which is an inducement to the search for fairer and more equitable policies. In deep learning, any policy measure that induces changes in the critical factors of AA is evaluated within the heterogeneous spectrum of both the possible outcomes and the underlying gradient structure. Deep learning recreates a quantum space of representation and explanation of phenomena that promotes a diversity of leads and accurate predictions. On the other hand, in the presence of more uniform realities, it establishes an intelligible relationship with the MLR and the classic meaning of its coefficients. The absence of a strong empirical relationship between deep learning and classic MLR is a robust means to assess the correctness of implementing the latter.

6-1-Limitations
Like any other study of this nature, some limitations need to be acknowledged. The vast majority of the variables under consideration are categorical and do not directly measure the critical factors of AA. They are proxy variables with measurement biases. There is no variable associated with parent involvement and school environment and design. The target variable itself, being teacher-attributed grades rather than exam scores, is susceptible to issues such as differences in the stringency of teachers' assessment criteria. Adopting a data-driven approach for policy definition and design needs a substantial improvement in the quantity and quality of the data to forge a capable and reliable education data system.

7-2-Data Availability Statement
Data were obtained from DGEEC-Direção Geral de Estatísticas da Educação e da Ciência and are available from the authors upon reasonable request with the permission of DGEEC-Direção Geral de Estatísticas da Educação e da Ciência.

7-5-Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this manuscript. In addition, the ethical issues, including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, and redundancies have been completely observed by the authors.