Machine Learning Bias in Predicting High School Grades: A Knowledge Perspective

This study focuses on the machine learning bias when predicting teacher grades. The experimental phase consists of predicting the student grades of 11 and 12 grade Portuguese high school grades and computing the bias and variance decomposition. In the base implementation, only the academic achievement critical factors are considered. In the second implementation, the preceding year’s grade is appended as an input variable. The machine learning algorithms in use are random forest, support vector machine, and extreme boosting machine. The reasons behind the poor performance of the machine learning algorithms are either the input space poor preciseness or the lack of a sound record of student performance. We introduce the new concept of knowledge bias and a new predictive model classification. Precision education would reduce bias by providing low-bias intensive-knowledge models. To avoid bias, it is not necessary to add knowledge to the input space. Low-bias extensive-knowledge models are achievable simply by appending the student’s earlier performance record to the model. The low-bias intensive-knowledge learning models promoted by precision education are suited to designing new policies and actions toward academic attainments. If the aim is solely prediction, deciding for a low bias knowledge-extensive model can be appropriate and correct.

Precision education presupposes an extensive database of the critical factors that influence the students' Academic Achievement (AA). In addition to the introduction of biological factors, it requires sharpening the metrics currently in use and improving their representative intake. Advanced data analytics will be needed to evaluate their importance and influence on AA, and machine learning algorithms will be used extensively for their greater predictive ability. The comprehensive and continuous data collection that is paramount in the precision education framework is an extension of the ongoing datafication process of the 21 st century digital economy, perceived as a perpetual cycle of capital accumulation [6].
As precision education arose with the prospect of seriously augmenting the predictive ability of machine learning algorithms to anticipate teachers' grades and test scores, the present study focuses on ascertaining the specificities of the machine learning bias in the AA scientific domain. With no purpose of neglecting the profound ethical issues in letting an algorithm shape the future of human beings alone [7], the lack of success in using predictive models to assign grades also seems to corroborate the appropriateness of studying the machine learning bias. A remarkable example is the 2020 International Baccalaureate final exam [8]. Due to the SARS-Cov-2 pandemic crisis, the International Baccalaureate, an educational organization from Geneva that offers a worldwide high school program, has decided not to hold the final exam in 2020. Instead, the final scores were awarded by an algorithm that failed miserably, despite being allegedly based on the coursework and schools' predicted grades. Therefore, this study sheds light on both the structure of the machine learning bias that is bound to appear when predicting grades and the likely precision education effect on the performance of the algorithms.
We introduce the knowledge bias concept that fills an important gap in the predictive model classification. The knowledge space comprises every known and unknown critical factor that exerts some influence on the target concept [9]. The knowledge bias appears as the divergence between the input space composed by the actual critical factors in use and that theoretical optimal space. Depending on the low or high knowledge bias, a model is classified as an intensive-knowledge or extensive-knowledge model, respectively. The latter is suited only to evaluate the execution of policies and actions in a post-inception phase. When conceiving and planning, only intensive knowledge learning models are appropriate to assist the decision process of which critical factors should be swayed to produce the desired results. The knowledge bias is most important for classifying machine learning implementations in the social sciences, in which the longitudinal regularity of the target concept behaviour is stronger and the knowledge about the critical factors is weaker.
The conclusions are drawn from the simultaneous analysis of two different implementations, a base implementation relying on a feature space that includes only the variables related to the AA critical factors, and a second implementation in which the one-year lagged grade of the student is appended, emphasizing the influence of the student´s historical path. Bearing that in mind, we carry out various random forest, support vector machine, and extreme boosting machine regressors implementations not only to predict the grades (attributed by teachers) of 11th and 12th grade students in Portuguese public high schools but also to compute the bias and variance decomposition through a bootstrap procedure. In addition, we use the knowledge bias concept to feed the discussion and to build the conclusions. A Lasso procedure is used to select the input space variables along with a random forest feature importance structure analysis to operationalize the concept of knowledge bias. The research questions are the following:  What are the factors that can explain the underperformance of machine learning algorithms when predicting student grades?
 Is precision education bound to improve machine learning bias when predicting grades?
 Is the machine learning bias an unbiased indicator of the model embedded knowledge?
The remainder of this paper is organized as follows: Section two proceeds with an AA critical factors literature review and presents the machine learning implementations that are appearing in the domain; Section three describes the methodology, the machine learning algorithms, and the research process in detail. Section four begins by presenting the data and how they were collected and organized. Then the results are shown and interpreted concerning the hyperparameter optimization, the prediction, the bias and variance decomposition, and the knowledge intensity of the implementations. The duality between the implementations in terms of the generalization error and bias is demonstrated and compared with their incorporated knowledge. Section five discusses the results and introduces the knowledge bias concept as a means of differentiating effects and functions of the models. Finally, Section six presents the main conclusions and answers the research questions directly.

2-Literature Review
The literature has extensively confirmed the student's cognitive ability as the main determinant of AA [10,11]. However, on average, it leaves unexplained 51-75% of the total variance [12]. Males more often develop a negative peer attitude toward school [13], corroborating the empirical evidence of a gender gap in favour of females that reaches high visibility in linguistics, although lower in mathematics [14][15][16]. Indeed, a personal attitude with an adequate level of diligence, organization, focus, and resilience is conducive to overachievement [17]. The AA can vary according to ethnicity, as in the US, where white students seem to outperform consistently [18]. A similar gap can be found regarding immigrant groups [19]. Low Socioeconomic Status (SES) immigrant students from small communities whose parents have left their home countries due to political entanglements normally underperform [20].
Using the internet and personal computer to learning tasks easiness, attractiveness and diversification favours AA [21,22]. However, if used excessively for leisure activities the use can be detrimental [23]. Parents' participation in the school activities motivates their children to outperform [24,25] and is especially important amongst lower SES students [26]. The parental involvement forges a suitable and convenient attitude toward teachers and school tasks [27]. There is empirical evidence that supports a positive relationship between SES and AA [28,29], magnifying the role played by a convenient endowment of social and cultural capital. Steinmayr et al. (2010) [30] show that parents' education is positively associated with AA even after controlling for student intelligence and personality. Using a concept of SES that includes parental education and occupation, household size, and possessions, Tesfagiorgis et al. (2020) [31] conclude that there is a positive association between SES and AA. Tomul and Savasci (2012) [32] found that parental educational status and the average income per capita were important positive factors related to AA.
The association between AA and class size is not straightforward. Hoxby (2000) [33] estimated that class size does not have a statistically significant effect on AA. Krueger (1999) [34] found otherwisethat the class size has a generally negative effect on AA and is stronger for minority students and those of lower SES. Wößmann and West (2006) [35] studied the effect of class size in 11 countries and concluded that its magnitude depends on the educational system itself and the teachers' lecturing abilities. In a less controversial stand, smaller schools seem to improve the academic outcomes of both lower SES students and those with greater learning needs [36,37]. Schneider (2002) [38] highlighted the importance of schools' indoor environmental conditions such as noise, light, temperature, and comfort for teachers and students alike to be properly motivated. Furthermore, the architectural features of the school should embody the expectations of the school participants [39].
Lecturing ability inferred by panel data fixed effects emerges as a positive factor on AA [40,41]. Rivkin et al. (2005) [42] concluded that the teachers' fixed effects on the 9 th -grade math test score were substantial and educationally relevant. It is argued that the teacher's role in the AA is to a great extent related to unobservable personal characteristics and that the experience and education level of teachers have a minor role. In turn, Wayne & Youngs (2003) [43] add that teachers' college grades seem to be positively correlated to AA. Last, the teacher quality has not only a short term but also a long term positive effect on the student academic outcomes [44].
In the AA literature some studies have used machine learning algorithms to substantiate their conclusions (Table 1). However, there is a clear preference for solving classification instead of regression problems [45], and to the best of the authors' knowledge no published studies addressing bias and variance decomposition exist. Evaluating the admission criteria of a Saudi University Sorensen (2019) [51] 220,685 Students Decision tree and support vector machine classifiers. School dropout

3-1-Supervised Learning Algorithms
Supervised learning consists of finding a mathematical function that efficiently maps the predictive variables input space into the target variables output space. In the learning phase a supervised learning algorithm uses the actual association between input and output variables to build a machine able to approximate the target outputs from the simple awareness of the input variables. Supervised learning is used for solving classification problems, in which the target variables are binary, and regression problems, in which the target variables are continuous [52].
For each dataset 70% of the examples were assigned to training and 30% to testing. A training set standardization procedure of the input variables was carried out and subsequently applied to the corresponding test set. In the learning phase the model is built upon the training dataset and is further evaluated in terms of generalization error on the holdout test set. In parallel, a four-fold cross-validation procedure on the training set was carried out to evaluate its consistency with the test dataset. Furthermore, as the 10 th high school year's dataset was used specifically for both the Lasso feature selection procedure and the hyperparameter tuning, the cross-validation and the bias and variance decomposition bootstrap are virtually unbiased.
Before the training phase, the algorithms' hyperparameters were optimized through a four-fold cross-validation procedure [53,54]. As soon as the hyperparameters to be optimized were selected, a search space was built, and a random grid search [55,56] was carried out. The hyperparameters' combination that maximizes the algorithms' fourfold average performance was picked and further used in training, evaluation, and the bias and variance decomposition. The algorithms' implementations follow the scikit-learn python module documentation [57].

3-1-1-Random Forest
The Random Forest (RF) [58] is a randomized decision tree ensemble resulting from a bootstrap aggregating procedure. In the decision tree algorithm the input space is broken successively in a way that minimizes a cost function, normally purity-linked in case of a classification and pattern recognition, or the mean square error in case of regression. In each step, usually a pair of new nodes representing two different subsets of the input space is created. In a randomized decision tree the input variables that take part in the optimized split decision are selected randomly [59]. The partition process ends when the cost function gains are no longer perceived as significant. The final nodes are called leaves and deliver the decision rules guiding the target variable estimation and prediction. The random forest ensembles the randomized decision trees by majority vote in case of classification or by computing their scores' mean in case of regression.

3-1-2-Support Vector Regression
The Support Vector Regression (SVR) algorithm's main intention is to find a function that approximates a continuous target variable with a deviation not exceeding ∈ ℝ + [60]. In the soft margin SVR, some flexibility is added that augments the algorithm generalization ability by allowing a deviation beyond ∈ ℝ + at a cost of through the introduction of slack variables ≥ 0. For the primal form of the SVR optimization problem see, e.g., Mohri et al. (2018) [53].
The SVR Lagrange multiplier dual form of the mathematical optimization problem [61] highlights two fundamental characteristics of the algorithm. The approximated function depends solely on the inner products between the examples that lie outside the -tubethe support vectorsand every actual example, whichever the feature space used to represent them in. In our case and to add a nonlinear character to the approximation, the gaussian radial basis function (RBF) kernel was applied to compute the inner products of an extrapolated infinite-dimensional space.

3-1-3-Extreme Gradient Boosting Machine
Boosting is a machine learning ensemble method like bagging. Boosting consists of building a strong learner by training several weak learners in different training sets [62]. The main differences rely on both the training set resampling process, which is built specifically to generate complementary learning, and on the weak learner weights assignment, which is based on performance [63]. Essentially, and contrary to the case of bagging, the sample probability distribution is changed in each iteration to allow the next weak learners to focus on reducing the bias in the preceding worstperforming examples. The gradient boosting machine [54,64] creates a chain in which each weak learner is moulded to minimize the generalization error of the previous iteration. In our case, the weak learners are regression decision trees, and the loss function is the square loss. To improve robustness, the extreme gradient boosting (XGB) machine [65] adds to the decision trees gradient boosting framework two regularization hyperparameters that control the size and the magnitude of the trees' scores.

3-2-Bias and Variance Decomposition for Regression
The following bias and variance decomposition is based on Mehta et al. (2019) [66]. Consider a target random variable that can be approximated from a vector of independent variables as follows: where is an irreducible stochastic term, is the unknown real function that maps into and is a vector of parameters.
Suppose that a dataset = ( , ) was randomly drawn from the population and a statistical learning procedure was carried out to estimate . In regression, the square error is normally elected as the estimation cost function: ( , ( ; )) = ∑ ( − ( ; )) 2 (2) The optimization problem underlying the parameters' estimation can be formalized as follows: Every dataset = ( , ) that can be randomly drawn from the population produces a different ̂ and a specific value for the cost function. The cost function expected value for unseen data prediction, i.e., not belonging to the actual = ( , ) that was used to learn, comes as follows: where the bias measures the deviation of the model's expected value relative to the true value. In turn, the variance measures the model estimates sensitivity to sample variations. And finally, the irreducible variance refers to the structural noise that is inherent to the target variable. There is an empirical trade-off between bias and variance [54].
Although complex functions are being approximated, small training sets may require simple models that nonetheless asymptotically biased perform better in unseen data. A 200 samples train dataset bootstrap [67] was employed and the bias and variance decomposition upon the applicable test dataset was computed. The mean square error cost function was decomposed instead of the square error: Note that as the true function ( ; ) in Equation 5 is unknown, the bias and the irreducible variance cannot be empirically separated.

3-3-Feature Selection
Before the algorithms' hyperparameters tuning, an optimization procedure of the input space was undertaken, consisting of selecting the predictive variables according to the strength of their association with the target variable. The Lasso multilinear regression model [68] was used, comprising a classic multilinear regression and an L1 norm regularization term that exerts some pressure on the less important regression coefficients to converge to zero.
Through a four-fold cross-validation search grid procedure, the highest shrinkage pressure λ model, whose cost function was not higher than the optimum plus its cross-validation standard deviation, was picked and the null ̂ variables were subsequently discarded. The model knowledge intensity can be inferred from the input space dimension, the number of critical factors in the model.

3-4-Methodology Steps
The order of the methodology's steps is the following (Figure 1):  To select the variables of the input space, we used the Lasso multilinear regression model, the base implementation, and the 10 th -grade dataset.
 To take into account any latent procedural bias, we used three different machine learning algorithms: the random forest, the support vector regression, and the extreme boosting machine. As the first is a bootstrapping ensemble, the second is a kernelized linear model, and the third is a boosting ensemble, we believe that together they constitute a comprehensive set of algorithms.
 To tune the hyperparameters, we performed the following sub-steps using the base implementation and the 10 thgrade dataset: o We built a search space of hyperparameters to be optimized.
o Then, we carried out a random grid search embedded in a four-fold cross-validation procedure.
o Finally, we selected the hyperparameters' combination that maximizes the algorithms' four-fold average mean absolute error (MAE).
 The training-test split was carried out at the grade level, assigning 70% of the examples to training and 30% to testing. The training dataset was standardized and the test dataset was transformed accordingly.
 The models were trained and their generalization error computed on the holdout test set. In addition, a four-fold cross-validation on the training set was used to evaluate its consistency with the test set.
 We made use of a bootstrap procedure to compute the bias and variance decomposition: o We generated 200 models from 200 subsamples of the training dataset [69].
o With those models, we predicted the grades of the test dataset 200 times.
o Then, we computed the mean square error (MSE) and the variance of those predictions.
o Finally, we assigned to bias the difference between them.
 The knowledge intensity of a model was deduced from the number of relevant variables that are associated with the critical factors and from the structure of the random forest feature importance. We applied the Lasso multilinear regression model to the entire set of variables, using both base and second implementations and both 11 th grade and 12 th -grade datasets, aiming at finding the variables that are sufficiently important to participate in the learning model. Subsequently, we computed their random forest feature importance and aggregated them according to the related critical factor.
We specifically used the 10 th -grade dataset to select the variables and to tune the hyperparameters to ensure the robustness of the bias and variance decompositions.

4-1-Data
The experimental data come mainly from the Directorate-General for Statistics of Education and Science of the Portuguese Ministry of Education information system. The system was designed to assist the administrative management of the Portuguese public education system and to store information about students, schools, and teachers from pre-school and basic to high school. Through a series of Microsoft® SQL Server Management Studio queries it was possible to build a global dataset consisting of 96,346 grades from 10,364 high school historical student paths. It includes observations from 2014-2015 to 2017-2018 academic years. The subjects were aggregated into four classes, Portuguese language, foreign languages, quantitative and natural sciences, human and social sciences. A split into 10 th , 11 th , and 12 th grades was also carried out to feed the intended implementations (see Table 2). The dataset is composed of 40 features that are related to the AA critical factors identified in the literature review (see Table 3 and Annex for full feature description). The family non-classic dwellings, the collective dwellings, the literacy rate, the post-secondary schooling rate, the primary sector importance, the secondary sector importance, and the unemployment rate were retrieved from Statistics Portugal. Given the categorical features one-hot encoding procedure, the number of predictive variables available to be selected by the Lasso filter added up to 120.

4-2-1-Feature Selection
A shrinkage pressure λ of 0.02 was used for the feature selection and 56 variables were subsequently dropped ( Figure  2). The most important dropped variables were the internet usage, parish literacy rate, post-secondary schooling rate, and primary sector importance. The internet usage is strongly correlated with the computer usage and the shrinkage pressure tends to reject the weakest. The parish literacy rate, post-secondary schooling rate, and primary sector importance belong to a set of seven SES variables retrieved from Statistics Portugal. The dropping of the other variables corresponds to the clustering of homogeneous feature categories in terms of effect on the AA.

4-2-2-Hyperparameter Optimization
The initial search space and the four-fold cross-validation random grid search results are shown in Table 4. The random grid search had 200 trials for each algorithm. The goal of the procedure is to minimize the cross-validation mean absolute error. According to the hyperparameter optimization procedure, the RFs were built from a 100% bootstrap of 420 trees. Two restrictions were imposed. First, the minimum number of examples required to be at a leaf could not be less than 0.009 of the dataset's length. Second, the minimum number of samples required to split an internal node could not be less than 0.001. The SVR hyperparameter optimization procedure set the penalty C to 9.541, and the RBF kernel to 0.004. Concerning the XGB, the procedure set the number of trees to 156, the subsample and column subsample to 1, the maximum tree depth to 20, the boosting learning rate to 0.42, the L2 regularization term on weights λ to 0.4, and the minimum number of instances in a child to 131. The XGB performance was substantially improved by the hyper-optimization as shown by the large dispersion of the trial points on the scatter plot of Figure 3. The RF performance did not change greatly from trial to trial, inducing a concentrated cloud of points in the scatter plot. The SVR had a behaviour more in line with the RF than the XGB despite exhibiting a tendency to a higher overfitting.

Figure 3. Random search trials
The average performances of the three algorithms were very similar (Figure 4). The RF had the smallest average MAE and the XGB the largest. In contrast, the MAE of the RF best trial, the elected hyperparameter combination, was 2.0377, while in the XGB was only 1.9073. The SVR fell into the middle with 2.0337. The flatness of the XGB empirical distribution curve in Figure 4 also highlights the bias focus of the algorithm. The elected hyperparameter combinations are within the surface of the search spaces far from the edges, ensuring that at least a local optimum was reached.

4-2-3-Prediction Training Phase
To evaluate the algorithms' performance, the MSE, the MAE, and the coefficient of determination (R2) are shown in Table 5. It is apparent that the second implementation, which includes the lagged student grade as an input variable, has overwhelming results when compared to the base implementation, which considers only the critical factors. The base implementation led us to poor fits to the training data. On the other hand, the second implementation reaches a good accuracy level. This is true regardless of which algorithm is considered. The XGB has the best results overall, in which the edge is much more pronounced in the base implementation. Boosting is a machine learning method the principal objective of which is to reduce bias even if it is more prone to incurring overfitting. The RF comes next, being surpassed by SVR only in the 12 th -year base implementation.
The duality between base and second implementations in favour of the latter is well represented in Figure 5. Only the XGB shortens the distance between both implementations. However, it is accomplished by overfitting the training data and does not revert to its generalization ability.

4-2-4-Prediction Test Phase
The test results are shown in Table 6. They are poorer than the training results, highlighting the existence of overfitting. Figure 6 illustrates the difference between train and test phases. The deterioration is generally more acute in the base implementation. Every algorithm exhibits at least some overfitting, but it is intense in the XGB case, especially in the base implementation, which is invariably located on the graphs upper right corner. The second implementation still presents an appropriate accuracy and seems to yield a good level of robustness.

Figure 6. Overfitting and train-test gap.
In the base implementation the XGB training edge is significantly shortened and in the second implementation virtually disappears. Indeed, the SVR even takes the lead in the 12 th -year second implementation. The training four-fold cross-validation results converge with the test results, as both the features selection and hyperparameter optimization were undertaken on the 10 th -year base implementation dataset. The XGB cross-validation standard error is in line with RF and SVR, indicating that the strong overperformance of the XGB base implementation in training and its further fall in the test evaluation are almost certainly due to noise retention. When predicting student grades, the second implementation is better than the base implementation regardless of which algorithm is taken. The duality between the implementations deepens in the test phase as we evaluate the generalization ability of the algorithms in unseen data. The XGB test results are not blurred with overfitting issues and end by converging to the other algorithms' performances. In Figure 7 the dual zones of the base and second implementations are much clearer.

4-2-5-Bias and Variance Decomposition
As the irreducible variance and the target variable stochastic process are not supposed to vary with the implementations, the bias and irreducible variance aggregation are further referred to as bias.
As in the prediction, the second implementation provides a pronounced improvement over the base implementation with an MSE maximum decline of 71.10% in the 12 th year SVR and a minimum of 63.18% in the 11 th year XGB (see Table 7). The decrease in the bias explains a major percentage of the MSE improvement, reaching a maximum of 98.61% in the RF and a minimum of 82.25% in the XGB, both for the 11 th year. Though far from being decisive, the variance also decreases, contributing to the MSE improvement (see Figure 8

MSE Bias Var
The best bias results correspond to XGB implementations, which are consistent with the machine learning boosting technique's main purpose. In turn, the RF presents the MSE best results, which were essentially built upon the variance performance the inherent RF bootstrap is meant to provide. The SVR improves performance in the second implementation and is quite effective in adapting to the lagged teacher grade strong signal. The described duality between the base and second implementation generalization ability in the prediction sections corresponds to a bias duality in the bias and variance decomposition (see Figure 9). The duality in terms of variance does not appear perfect because of the XGB variance comparing poorly with any other algorithm implementation, a classic example of the wellknown bias and variance trade-off [70].

MSE Bias
Variance Figure 9. Base and second implementation duality.  Table 8 shows the knowledge incorporated in the different implementations per AA critical factor. (Subjects) refers to the classes presented in Table 2 and it is not related to any critical factor. The base implementations of the 11 th and 12 th grades have 49 and 52 input variables respectively, contrasting with the 16 and 25 input variables of the second implementations (Table 8). Due to the introduction of the lagged teacher grade as a predictive variable, the Lasso method of selecting relevant input variables discards a much larger number of predictive variables associated with AA critical factors. Through the analysis of the RF feature importance structure, it is concluded that the critical factors that most contribute to the final solution in the base implementations are the cognitive ability and the SES. However, the importance of the lagged teacher grade of 96.7% for the 11 th year and 96.2% for the 12 th year overpowers any contribution of the critical factors to the final solution in the second implementations. Thus, the base implementations are considered knowledge-intensive when compared to the second implementations. The graphs in Figure 10 were built upon the first component of a Lasso variables and RF feature importance principal components analysis. It is strong and positively correlated with both variables and explains 91.3% of total variance. Concerning knowledge ( Figure 10) there is also a duality between the base and second implementation. However, in this case, the base implementation takes the lead and incorporates more knowledge than the second implementation.

Cognition, gender, and ethnics
Computer, internet, and SES School, class size, and lecturing

5-1-Discussion
In machine learning, bias can refer to any factor, embedded either in the algorithm architecture or in the concept representation form, which leads to a decision of preferring one learning generalization to another that is inconsistent with the ground knowledge of the experimental examples [71]. The procedural bias or algorithm bias focuses on the appropriateness of the search heuristics preferences on paths and approaches that assist the learning process. One example is the problem of structural bias that consists of the inability of the evolutionary algorithms to carry out an impartial search that includes every part of the search space [72]. A set of well-known state of the art machine learning algorithms -RF, SVR, and XGBwas purposefully called on for factoring in the procedural bias. Its influence can be regarded negligible, as the algorithms' performances are quite similar throughout the implementations.
The representational bias focuses on the adequateness of the search space to define, explain, and predict the target concept [73]. The dataset bias problem of the image object detector domain that limits the generalization ability to test datasets within the learning source is an example of representational bias [74]. Another example can be found in the size of Big Data datasets extracted from the digital platforms, which often leads researchers to generalize the conclusions to the entire population when in fact they represent only individuals with a special propensity to use them [75]. The second implementation presents an adequate performance in terms of generalization error and bias, despite the AA critical factors' small role in the definition of its input space. Its accuracy is built upon the student's historical path. The base implementation shows poor performance in terms of generalization error and bias due to the lack of precision in the critical factors' measurement. In the current study the poor performance of machine learning algorithms when predicting student grades is related to the input space's poor precision and the lack of a sound student historical path. Indeed, the representational bias is set to a minimum when the search space imprints every tone of the target concept. However, it is not decisive about whether it is established upon a differed measurement of the same target concept or upon a comprehensive knowledge and precise measurement of its determinants.
The concept of knowledge bias refers to the gap between the target concept knowledge space and the input search space. The former can have unknown dimensions and includes every element that affects the target concept. In turn, the input search space normally has only a subset of those elements, adding knowledge bias to the learning model. The concept of knowledge bias is pivotal to frame the precision education effect on machine learning bias. The base implementations have poorer performance and wider machine learning bias relative to the second implementations. However, the knowledge bias is weaker in the former as the base search space invariably has more critical factor components with greater RF feature importance. Therefore, it is possible to avoid machine learning bias and augment the generalization ability of a model without adding knowledge. Precision education would improve the machine learning bias through a knowledge bias decrease. More precisely, precision education would mostly improve high bias knowledge-intensive machine learning models and the effect in low bias knowledge-extensive models as the second implementation would be marginal. By no means is its role diminished. First, it is worth mentioning that the expansion of knowledge about the AA critical factors is important in the design of novel conceptualizations in the AA domain [9]. Second, low bias intensive-knowledge learning models are crucial to design new policies and actions, as the goal is to mould the critical factors in such a way that is conducive to AA attainments. Last, low bias extensive-knowledge learning models are suitable to evaluate the same policies and actions but only in a post-design phase. Indeed, they do not assist the education stakeholders in the design of policies, as do low bias intensive-knowledge learning models. On the other hand, as long as there is irreducible variance, the grades predicted by any algorithm have an ever-present quantum bias. The individual essential foundation resides in the quantum, and student evaluation through real life assessments is a way to ensure the freedom of being.

5-2-Limitations
This study has several limitations. The cognitive ability is not directly represented by student intelligence quotient data and there is no measure pointing to the student attitude toward school activities and the corresponding parental involvement. The set of SES variables does not include income data and family size. Furthermore, the comfort of the school infrastructure, its adequateness, and the teachers' lecturing abilities are also omitted. The lack of depth and scope in the dataset can explain a non-significant part of the performance differences reported in the results. The adoption of a precise and data-driven approach in the management and storage of education data is a pivotal cornerstone in the implementation of a precision education framework.

6-Conclusion
As for the first research question, we conclude that the poor performance of machine learning algorithms when predicting student grades is related to the input space's poor precision and the lack of a sound student historical path. To anticipate student's grades through a machine learning implementation, we must collect either a comprehensive dataset that includes the entire range of the critical factors or the most recent preceding grades. On the other hand, the information systems that support the national education cluster should be designed in such a way as to allow every important piece of information about the AA critical factors to be collected. This is a most needed background if the aim is to implement machine learning models that would be decisive both in educational policy planning and in the decision-making process of the educational stakeholders. Regarding the second research question, precision education would mostly improve high bias knowledge-intensive machine learning models and the effect in low bias knowledgeextensive models as the second implementation would be marginal. If the education stakeholders' objective is to design policies and define new actions, a low bias knowledge-intensive model, the search space of which is formed by every critical factor, is almost mandatory, as it produces less biased estimates of the effects of the critical factors. The precision education framework adoption can provide them. If the aim is to anticipate student's grades, a knowledge-extensive model can be sufficient and appropriate, depending solely on the generalization error it conveys.
Concerning the third research question, the second implementation has a greater knowledge bias when compared to the base implementation even though it has a lower machine learning bias. Therefore, it is possible to reduce the machine learning bias without adding knowledge to the learning model. It can be accomplished by simple deferred observation of the target concept.

7-2-Data Availability Statement
3rd Party Data: Restrictions apply to the availability of these data. Data were obtained from DGEEC-Direção Geral de Estatísticas da Educação e da Ciência and are available from the authors upon reasonable request with the permission of DGEEC-Direção Geral de Estatísticas da Educação e da Ciência.

7-4-Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this manuscript. In addition, the ethical issues, including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, and redundancies have been completely observed by the authors.