Missing data analysis using machine learning methods to predict the performance of technical students

Machine learning (ML) has become an emerging technology able to solve problems in many areas, including education, medicine, robotic and aerospace. ML is a specific field of artificial intelligence which designs computational models able to learn from data. However, to develop a ML model, it is necessary to ensure data quality, since real-world data is incomplete, noisy and inconsistent. This paper evaluates state-of-the-art missing data treatment methods using ML algorithms to classify the performance of technical high school students at the Federal Institute of Goiás in Brazil. The aim is to provide an efficient computational tool to aid educational performance that allows the educators to verify the student’s tendency to fail. The results indicate that ignoring and discarding method outperforms other missing data treatment methods. Moreover, the tests reveal that Sequential Minimal Optimization, Neural Networks and Bagging outperform the other ML algorithms, such as Naive Bayes and Decision tree, in terms of classification accuracy.


Introduction
Machine Learning (ML) is concerned with the question of how to build computer programs that learn and improve automatically through experience (Jordan and Mitchell, 2015). Nowadays, ML has become ubiquitous and indispensable for solving complex problems in most science areas (Obermeyer andEmanuel, 2016, de Miranda et al., 2016). For example, Jean et al. (2016) combined satellite imagery and ML algorithms to predict poverty in Africa. Reviews of ML applications to analyze genome sequencing data and to support diagnosis of diseases are conducted in references Libbrecht and Noble (2015), Tagaris et al. (2018), respectively. Ahmad et al. (2018) developed a tutorial covers the definitions, nuances, challenges, and requirements for the design of interpretative and explainable machine learning models and systems in healthcare.
ML has also been used as a tool for decision making, prediction and optimization in the area of education.ML algorithms are proposed to predict the educational performance using the database of an education institution by de Melo et al. (2017). The proposed algorithms allow the education professional, even in the first months of the school year, to verify the student's tendency to fail.
Similar jobs applied in education using Machine Learning can be seen at: Minaei-Bidgoli et al. (2003) present an approach to classifying students in order to predict their final grade based on features extracted from logged data in an education Web-based system; Kolo et al. (2015) use of a decision tree approach for predicting student' academic performance; Yukselturk et al. (2014) use 4 ML algorithms to classify students who dropped out of school; Ayinde et al. (2013) find out interesting patterns in the educational data that could contribute to predicting student performance; Kumar et al. (2011) use ML algorithms to predict the performance of students in their final exam.
The ML algorithm development involves some difficulties. For example, the performance of the ML algorithms depends on the data quality employed during the algorithm development. Additionally, in real data sets, noisy, missing and unreliable samples are common. For this reason, pre-processing techniques, such as, outlier removal, missing data treatment and others, are necessary to handle these problems. Usually, missing values occur in data being forgotten or lost; certain values are not applicable for a given variable; or, the designer of the data does not care about the values (Soares, 2015). Missing data values occur in several applications, so that several missing data treatment techniques have been proposed in literature (Zhu et al., 2018). The most common technique is the ignoring and discarding approach, which discards all samples with missing values. However, this technique is not viable when the data set is small, therefore other techniques (such as, mean or median substitution, and linear interpolation) are preferred. This paper evaluates a number of missing data treatment techniques using state-of-the-art ML algorithms in order to predict students' performance in technical high school education at the Federal Institute of Goiás in Brazil. The main objective is to provide an efficient and valuable computational tool to aid educational performance that allows educators to verify the student's tendency to fail. Since the available data set to predict the students' performance contains missing data values, this paper investigates and evaluates missing data treatment techniques to design ML algorithms. The experimental results reveal that, for this case study, the ignoring and discarding approach outperforms other missing data treatment techniques when applied in most ML algorithms.
The main contributions of this paper are: (i) proposal and evaluation of state-of-the-art missing data treatment methods using a case study to predict student performance; and (ii) evaluation of stateof-the-art ML algorithms (including an ensemble learning algorithm) using state-of-the-art missing data treatment methods and this real case study.
The paper is organized as follows: Section 2 and Section 3 present background on the state-ofthe-art missing data treatment methods and ML algorithms, respectively; Section 4 presents methods and materials applied in this paper; Section 5 presents and discusses the experimental results using the missing data treatment methods and ML algorithms to predict student performance; finally, Section 6 presents concluding remarks.

Missing Data Treatment Methods
This section presents popular methods for missing data treatment in the ML scope, since missing values can affect the accuracy of ML algorithms.
As described previously, in literature, several missing data treatment methods have been proposed (Zahin et al., 2018). The most popular missing data treatment methods include discarding the samples with missing values (know as ignoring and discarding, and listwise deletion), and imputation approaches. The first approach reduces the data set size by eliminating all the samples with missing values; on the other hand, imputation approaches aim to keep the data set size by replacing missing values in a data set by some plausible values. According to Gao et al. (2018), imputation approaches outperform ignoring and discarding approaches, as they produce complete data sets and make use of the samples that deletion techniques would remove.
Taking this into account, this work evaluates ignoring and discarding approach, and six imputation approaches to deal with missing values: • Ignoring and discarding. This approach removes all samples that present missing values. The main advantage is convenience; however, it reduces the number of samples. • Mean imputation. The missing value for a given attribute in a sample is replaced by the mean value of all the sample values for that attribute. If the replaced value is not conditioned on the values of other attributes in the record, this approach is called imputing unconditional mean. Despite this approach is simple to be implemented, its disadvantage is that the variance of the replaced attribute and its co-variance with other attributes are systematically underestimated (Lakshminarayan et al., 1999). • Median imputation. In this technique, each missing value in each attribute is replaced by the median value of all non-missing values of that attribute. It should be employed when the distribution of the underlying attribute is not symmetric. As the mean imputation approach, this technique is simple to implement. • Last Observation Carried Forward (LOCF). In this approach, a missing value in an attribute is replaced by the last measured value before the missing one. This approach is easy to understand and implement; but it assumes that the value of the attribute remains unchanged (Kang, 2013). • Linear interpolation.
The missing value is computed by the linear interpolation of the known values of which the missing value is located. The usual motivation for linear interpolation is simplicity, and linear functions are the easiest to determine (Pownuk and Kreinovich, 2017).
• Spline interpolation. The missing value is replaced by piece-wise cubic spline interpolation of the nonmissing values of that attribute. A spline function consists of polynomial pieces on sub-intervals joined together with certain continuity conditions (De Boor et al., 1978). • Piecewise Cubic Hermite Interpolating Polynomial (PCHIP). This method replaces the missing value by the shape-preserving piece-wise cubic spline interpolation non-missing values of that attribute.

Machine Learning Algorithms
ML approaches are computer programs used to solve problems using data or past experience (Rudolph and Martinez, 2015). They are employed in a wide range of applications, including forecasting problems (Zhang, Teng and Chen, 2018) and image classification problems (Yuan et al., 2019). This work compares six state-of-the-art ML algorithms able to automatically classify the performance of students in a technical high school education. To do so, five single learning algorithms (i.e. Naive Bayes, Sequential Minimal Optimization, Decision Tree, Decision Rule and Neural Network) and one ensemble learning algorithm (i.e. Bagging) are considered. The next subsections detail each ML algorithm.

Naive Bayes
Naive Bayes is a Bayesian probabilistic algorithm based on the Bayes's Theorem. It is simple ML algorithm, with clear semantics, to represent, use, and learn probabilistic knowledge (Witten et al., 2016). The term "naive" comes from the hypothesis that the attribute values of a sample are independent of its class.
To design a Naive Bayes model, consider a data with N samples of attributes x i and labels y i associated to a supervised classification task, where x i = {x i,1 , . . . , x i,m }; m is the number of attributes; x i,j is the j-th attribute value of x i ; y i ∈ C = {c 1 , . . . , c K } is a K-class (with K = 2 in this study). Moreover, consider that P(x i,j |c k ) denotes the conditional probability distribution of attribute x i,j belonging to class c k ∈ C (Faceli et al., 2011), and P(c k ) is the prior probability of class c k in the data set D. Then, for a given test instance x t , its output value (estimated class) by the Naive Bayes model can be mathematically obtained in Eq. (1) (Wu et al., 2015): The Naive Bayes learning algorithm involves a learning step procedure in which the values of P(c k ) and P(x i,j |c k ) are calculated. According to Shanahan (Shanahan, 2012), one difference between the Naive Bayes algorithm and other ML algorithms is that there is no explicit search through the space of possible models; instead, the model is obtained without searching by calculating the frequency of various data combinations within the training samples.

Sequential Minimal Optimization
Support Vector Machines (SVMs) are algorithms based on the statistical learning theory. To train a SVM model, it is required the solution of a large Quadratic Programming (QP) involving an optimization problem. According to Platt (1999), the Sequential Minimal Optimization (SMO) algorithm breaks the QP problem into a series of small possible problems to ensure convergence. Thus, they are solved analytically, avoiding the use of the time consuming numerical optimization. SMO is an efficient learning algorithm to handle with large training data sets, because the amount of memory required for the SMO is linear to the size of the training set. The SMO algorithm is detailed in paper (Zhang, Wang, Lu, Wang and Ma, 2018).

J48 -Decision Tree
J48 is a decision tree learning model based on the C4.5 algorithm, which builds a decision tree using a divide and conquer strategy (Ruggieri, 2002). The goal of the J48 learning algorithm is to create a binary tree that includes: a root node, which consists of all input data; internal nodes, which are associated with a decision function; and leaf nodes, which show the output of a given input. The J48 model outperforms other decision tree models in terms of classification accuracy (Pham et al., 2017).
Moreover, the J48 algorithm has other attracting features. For example, it is available as an open source in the WEKA project, is easy to understand, makes use of categorical and continuous values, handles missing values, and provides a tree pruning process (Aljawarneh et al., 2017).
In the J48 algorithm, a model is built in two main stages as follows (Bharti et al., 2010) and (Tien Bui et al., 2014): in the first step, a classification tree is designed; and in the second step, the classification tree is pruned. Specifically, in the first step, the input data with the highest gain rate is determined. This is done in the root node of the tree classification; and then, a division process is implemented, using the training data set, to create sub-nodes based on the values in the node root. In the second step, the gain rate value is generated individually for all sub-nodes; and then, the classification variables (slip or non-landslide) are determined based on each gain rate value of each subnode.

OneR -Decision Rules
OneR is a simple classification algorithm, which presents high degree of precision. It generates a rule for each predictor in the data, and then selects the rule with the lowest total error, being called a "single rule". To create a rule for a predictor, it constructs a frequency table for each predictor and the target. OneR produces rules that are only slightly less accurate than the last generation sorting algorithms, while it generates rules that are simple for humans to interpret (Witten et al., 2016). Therefore, the main features of the OneR algorithm include: simplicity, high degree of accuracy and easy interpretation of rules.
The main steps of the OneR algorithm are (Nasa and Suman, 2012): (1) for each attribute j and for each value v of that attribute, create a rule; (2) calculate how often each class appears; (3) find the most frequent class c ; (4) make a "single rule"; (5) calculate the error rate of this rule; and (6) select the attribute whose rules produce the lowest error rate.

Neural Networks
Neural Networks (NNs) are ML algorithms inspired by the biological neurons. The NN model has processing elements (neurons), connections between them (weights) and a learning/training algorithm. The main features of the NN model are generalization, robustness, massive parallelism, learning and adaptation (Kasabov, 1996). There are many NN architectures, but Multilayer Perceptrons (MLPs) are the most popular and efficient NN architecture (Soares, 2015). It contains one input layer, one or multiple hidden layers and one output layer. Fig. 1 shows a generic MLP architecture, where the input layer has x neurons, the hidden layer has h neurons, and the output layer has y neurons.
In literature, many learning NN algorithms can be found to obtain the NN parameters (weights and biases), but the most popular is the backpropagation algorithm.
It employs iteratively a gradient descent method to select the NN parameters; and its main advantages include reverse propagation capability, good performance for problems in which no relationship is found between output and inputs, flexibility, and great learning ability (Saduf and Wani, 2013).

Bagging Ensemble
According to Soares (2015), ensemble learning models are sets of learning algorithms that combine in some way their decisions, or their learning algorithms, or different data to obtain accurate predictions. This is because, in most case, an ensemble learning model is more accurate than any single model used separately. The effectiveness of ensemble learning models has been proved in different applications (Tamvakis et al., 2018).
The most popular ensemble learning model is Bagging (Breiman, 1996).
It promotes diversity between the individual models by creating a different training data set for each model using bootstrap (Dinakaran and Thangaiah, 2017). Bootstrap is a resampling approach which can produce a new training data set by randomly drawing with replacement from the original training data set. Statistically, each new training data set contains on average 63.2% of samples from the original training set. In the Bagging algorithm, after training each model with a different training data set, the aggregation step combines all the models' outputs/classifications using a simple voting method (i.e. the models have the same contribution on the final classification) (Soares et al., 2012).

Problem description: Student performance prediction
This paper aims to develop a predictive tool, using ML algorithms, able to predict whether a student will be "approved" or "disapproved" in the initial bimesters of each technical high school year based on historical data. To do so, it will be performed a comparison between state-of-the-art ML algorithms and missing data strategies to create a powerful predictive tool to estimate student performance. This tool will help education professionals in actions to improve Input layer

Hidden layer
Output layer Brazilian education and the student performance. For example, Stimpson and Cummings (2014) proposed ML algorithms to help education professionals. Their results revealed that early information for education professionals provides the development of targeted intervention methods, as it allows accurate estimations to be made earlier in the course. In this work, data were collected from a technical high school of a campus of the Federal Institute of Education, Science and Technology of Goiás, in Brazil, where each technical course is divided in four years and each year is divided in four bimesters (a bimester corresponds to two months). In each bimester, assessments and attendance checks are performed for each student. A student is approved whether: the average of the bimesters' grades is equal or greater than 6.0; and the attendance is equal or greater than 75% in the classes. Data were obtained from students of three technical high school courses (Electronics, Electrotechnology and Informatics) between the years 2008 and 2013.
The original data set contains 15788 samples (where each sample is associated to a student), and it was divided in three data sets: • data set 1: grade and attendance of the first bimester; • data set 2: grade and attendance of the first and second bimesters; • data set 3: grade and attendance of the first, second and third bimesters; where data sets 1, 2 and 3 have 7, 9 and 11 attributes, respectively; and each sample contains the label Approved or Disapproved. Therefore, it is binary classification task (two classes problem). Table 1 illustrates the attributes for all the data sets. It should be observed that, data set 1 uses data (course's ID; and student's data, grade and number of missed classes) from the first bimester to predict an approval or a disapproval at a scholar year; while data set 2 employs data from the first and second bimesters to predict an approval or a disapproval at a scholar year; and data set 3 applies data from the first, second and third bimester to estimate an approval or a disapproval at a scholar year. Additionally, it should be pointed that, data from the fourth bimester is not used, because, in this period, the final student performance (approved or disapproved) is obtained. Further details of each attribute can be found in paper (de Melo et al., 2017). The data sets present a large number of missing values. For example, data set 1, data set 2 and data set 3 have 14.54%, 27.28% and 31.84% presented missing data in the attributes, respectively. To deal with these missing values, seven missing data treatment methods are proposed in this paper (as described in Section 3). For each data set (data set 1, data set 2 and data set 3), seven missing data treatment methods were applied to build seven different pre-processed data sets (cases); and then, each pre-processed data set is employed to design ML algorithms. Table 2 shows this procedure and the identification of each pre-processed data set. Therefore, the number of pre-processed data sets is 21.
To evaluate each ML learning algorithm (described in Section 4) trained with a pre-processed data set, cross-validation method is applied. In cross-validation, the data set is randomly divided into k folds of equal size. At each cross-validation iteration, one fold is used for testing (to obtain the classification accuracy) and the other k -1 folds are used for training the ML learning algorithm. The test values (classification accuracy values) are calculated and averaged over all the k folds; and then, this average value corresponds to the final ML learning algorithm's accuracy (Bouckaert et al., 2008).
In this work, the number of folds is set to 10. Moreover, in order to get statistically meaningful results, each ML learning algorithm is evaluated in 20 runs. That is, for each ML learning algorithm, 10fold cross-validation is applied in 20 runs. Therefore, in the experimental results below, the classification accuracy corresponds to the average of the 10-fold cross-validation values in 20 runs. The algorithm with the best performance (i.e. the highest percentage of correctly predicted samples) will be used in the ensemble learning model (Bagging).
The ML algorithms were implemented using Weka (Waikato Environment for Knowledge Learning). It is a software developed by students, from the University of Waikato in New Zealand, with an initial purpose of identifying information coming from raw data in agricultural applications (Vaithiyanathan et al., 2013).
The Weka software contains many tools for data pre-processing, classification, clustering, association, regression and feature selection. This paper uses six classification algorithms (Naive Bayes, SMO, J48, OneR, NN and Bagging (MLP mode)) from the Weka software,  where the ML learning algorithms' parameters are set to default (in case of NN with six perceptrons).

Experimental Results and Discussions
In this section, missing data treatment techniques and state-of-the-art ML algorithms are evaluated and discussed to predict students' approval and disapproval in a technical high school education. The experiments have been done on the Weka software, running on a PC equipped with an Intel(R) Core(TM) i5-7200U 2.5 GHz-2.71 GHz processor of 4 cores and 4 GB of RAM. Experimental results using all the missing data treatment techniques and the single ML algorithms (Naive Bayes, SMO, J48, OneR and NN) are presented in Table 3.
As described previously, for each missing data treatment and for each ML algorithm, the simulation was conducted in 20 runs.
The average and standard deviation of the percentage of correct instances (classification accuracy) using the 10-fold cross-validation on 20 runs is reported. The underlined values highlight the ML algorithm with the best performance in a case (pre-processed data set); highlighted in blue values the best missing data treatment technique for a single ML algorithm; and the circle (•) highlights results with statistically significant improvement, it generated automatically by Weka.
From the results presented in Table 3, some conclusions can be drawn, for example: • the JR8 algorithm outperforms the other single ML algorithms in most cases from data set 1 (i.e. it achieves best performance in five cases from data set 1); • the NN algorithm outperforms the other single ML algorithms in all the cases from data set 2 (i.e. it achieves best performance in all the cases from data set 2); • the SMO algorithm outperforms the other single ML algorithms in most cases from data set 3 (i.e. it achieves best performance in six cases from data set 3); • the NN algorithm presents the best average of classification accuracy in all the cases; • the best classification accuracy (i.e., 96.0574±0.5261) was achieved by SMO algorithm in case 21, which uses ignoring and discarding method in the data set 3; • the results from data set 3 are better when compared to the data sets, since it uses more attributes; • in most results, the best missing data treatment technique for a ML algorithm is the ignoring and discarding method; • in most results, the missing data treatment techniques with the worst performance are median imputation and spline interpolation methods.
Therefore, according to Table 3, for all cases, the single ML algorithms that present best performance are SMO, J48 and NN. But, it is noted that NN presents best classification accuracy when compared to the other ML algorithms. Thus, a more detailed implementation of NN algorithm is proposed by tuning the number of hidden neurons using case 21 (data set 3 with ignoring and discarding method). Table 4 indicated the NN performance when the number of hidden neurons in the hidden layer varies from 1 to 20, using the pre-processed data from case 21. As it can be seen, the NN accuracy tends to decrease when the number of hidden neurons increases.
Other experiments are performed to analyze an ensemble system (Bagging) performance using NN as ML algorithm. Table 5 presents the Bagging performance when the number of hidden neurons in NN algorithm varies from 1 to 20. As it can be seen, the best accuracy of Bagging with NN is achieved when the number of hidden neurons is 4 (highlighted in blue on the table). In contrast, Fig. 2 compares the performance of Bagging with NN to a single (one) NN, when the number of hidden neurons varies. The results show that, for a ll number of hidden neurons, the Bagging algorithm outperforms a single NN algorithm, in terms of average of the classification accuracy.
Additionally, other tests with case 21 are performed to compare the performance of SMO, NN (with one hidden neuron) and Bagging (with four hidden neuron) using other evaluation metrics, as shown in Table 6. The evaluation metrics are the standard classification accuracy (%), kappa statistic, Mean Absolute Error (MAE), precision, recall, F-Measure, Matthews correlation coefficient (MCC), and Receiving Characteristics of Operation (ROC) area (Faceli et al., 2011). The algorithms present good performance, almost equal from the "ideal algorithm" (where "ideal algorithm" represents the best/ideal performance of a ML algorithm). SMO demonstrates MAE, Recall and F-Measure nearest from the ideal algorithm. Moreover, NN presents the best Precision and ROC-area. Finally, Bagging has classification accuracy, Kappa statistics, F-Measure and MCC almost the ideal algorithm. The underlined values highlight the ML algorithm with the best performance in a case; highlighted in blue values the best missing data treatment technique for a single ML algorithm, and the circle (•) highlights results with statistically significant improvement.

Conclusions
This work evaluates seven missing data treatment methods and six ML algorithms to estimate students' performance in technical high school education at the Federal Institute of Goiás in Brazil. The aim is to propose an efficient computational tool to aid According to the reported results, for this case study,   the missing data treatment method with the best performance is ignoring and discarding method; while median imputation and spline interpolation methods have the worst performance. Regarding the ML algorithms, three out of the six achieved the best classification accuracy: SMO (96.0574%), NN (95.9747%) and Bagging (96.0714%). Therefore, it can be concluded that the ML algorithms that used the ignoring and discarding method can be used in the classification of student's performance.
Seeking to improve the tools used, it is still possible to think of applications of such tools in undergraduate courses. For this purpose, the same analyzes and procedures can be used, which can bring significant improvements to the pedagogical assistance of educational institutions.
Future works will be devoted to analyze other missing data treatment methods and optimize the ML algorithm's parameters. Moreover, future efforts will be done to propose other ML algorithms, such deep learning algorithm, in this case study.