A PCA and SPCA based procedure to variable selection in agriculture

Agricultural data mining often demands processing data bases with few observations and high dimensionality. As sample complexity grows up with data dimensionality these factors can constraint the confidence on obtained results as also overfitting. An approach for reducing data dimensionality and sample complexity is to select the attributes that are relevant to describe the phenomenon of interest. This work presents a procedure that combines search methods and supervised and unsupervised principal component analysis to select variables. The procedure removes irrelevant variables or variables with small influence on data variation and evaluates the impact of variable selection on tasks of regression and classification. Whenever possible, the number of selected variables attends the sample complexity requirements. The procedure was tested to select variables for the regression of multivariate linear models and artificial neural network classifiers training on a precision agriculture data set. The proposed procedure supports the trades-off between the dimensionality reduction and model accuracy.


Introduction
Mathematical modeling is an important tool to capture the interactions among environment, soil and plants and then to get a better understanding about the physiological processes [25].It usually demands developing multivariate models that relate a set of variables, describing physical and chemical features of plants and soil involved in a study.An important step to proceed the multivariate modeling on agricultural data is selecting the most relevant variables related to the response variable [24,27].A suitable variable set allows [11,14,12]: (i) to discard redundant attributes or the ones that add noise to the data ; (ii) to reduce the risk of overfitting; (iii) to reduce sample complexity; (iv) to make the model simpler; (v) to save time and resources in future data collection.
A point related with the item (iii) is that the confidence on data mining results depends on the size of sample used to learn a pattern/function [7].From this follows that to know how many observations a procedure needs for inducing a model is a very important practical issue.The computer theorists named such automated learning aspect as sample complexity [15], which establishes the smaller number of cases a learning algorithm requires to learn a concept from data, given a prespecified error limit of its previsions.
Whereas the sample complexity grows with the data dimensionality [8], variable selection procedures could make easier the requirements on the sample size.Basically, if the data analyst removes low relevance variables before learning the model, it would be possible to learn a better model from a smaller sample [18].Such approach may be useful in domains where is common to have data sets with high dimensionality and few cases.An usual situation in domains as agricultural, medical and biological research [19].
However, PCA is an unsupervised procedure.That is, variable selection with PCA does not take into account the influence of any input variable on the output one to decide if the former should be discarded or not.To overcome this limitation the supervised principal components analysis (SPCA) proposed by Bair [1] provides an approach to integrate supervised data in principal component analysis.The basic idea in SPCA is applying a filter to preselect attributes that have a stronger association with the response variable and then to use PCA to remove input attributes which are correlated between themselves.
Joliffe [9] presented two PCA based procedures for variable selection.Those methods, called B2 and B4, do not use supervised information to execute that task.Considering it, this work presents a procedure for variable selection that combines B2 and B4 methods with supervised PCA to select attributes for inducing regressors or classifiers.Additionally, the proposed procedure provides information that can be used to trading-off between the sample complexity and model accuracy.Experiments on agricultural data sets allows to evaluate the effectiveness of the approach.
The paper is organized as follows: Section 2 presents the basic theory used in this work; Section 3 describes the materials and methods used to perform the experiment; Section 4 presents the experimental results; Section 5 shows a discussion about the results and Section 6 presents the final remarks.

Background review 2.1 PCA-based variable selection
The purpose of the variable selection method is to choose the most relevant variables from a data set to proceed data mining tasks.This approach, also called feature selection, attribute selection and dimensionality Revista Brasileira de Computação Aplicada (ISSN 2176-6649), Passo Fundo, v. 7, n. 1, p. 30-41, abr.2015 reduction, has been performed with a set of techniques.The PCA procedure provides support information to implement some of them.
Principal component analysis is a procedure for dimensionality reduction of a data set with many interrelated variables [11] [3].The basic idea is to apply a linear transformation in order to highlight the shape of the data variation.This linear transformation defines a new space where the main axes are called principal components.
The components are ordered so that the first components are related with the direction of the largest variation from the data set.More formally, let D be a data set with multivariate observations and let Y be a matrix obtained from D by subtracting the mean of every variable in the original data set.Here, Y is a matrix of N × d dimensions so that N is the number of observations and d is the number of variables.Let S be the sample covariance matrix computed from Y, Λ, the diagonal matrix of the eigenvalues of S, and, X, the matrix of the eigenvectors corresponding to the eigenvalues in Λ.Furthermore, the matrices are organized so that the first elements in the diagonal of the Λ matrix are the largest ones.From this follows that: The dimensionality reduction is obtained by removing the higher ordered components (less informative).After this data analysis can be proceeded by observing the behavior of the data in the space defined by the first (main) principal components.Another approach is to evaluate the contribution (load) from the original variables on the main components and then to use some criteria to select or remove some of them.
Jolliffe [9,10] presents two PCA-based variable selection procedures called B2 and B4.B2 and B4 procedures are described in Figures 1 and 2, respectively.2. set an eigenvalue threshold to cut the variables off; this threshold is denoted by λ 0 ; 3. sort the principal components in ascending order by eigenvalue; let u 1 , ...u d the ordered sequence of components; 4. for i = 1 to d: • determine the still non-associated highest loading variable X in u i ; • associate variable X with u i ; 5. remove the variables associated with those principal components which eigenvalues are less than λ 0 .
The main difference between B2 and B4 is that B2 rejects the variables that are strongly related with the components that represent the smallest part of the data variation.B4, otherwise, rejects the variables which are weakly related with the components which represent the most variation of data.

Supervised principal component analysis
The SPCA intends to concentrate the principal component analysis on input variables which have more influence on a response variable [1].To that, SPCA pre-selects the input variables which have more influence on the output and then runs PCA on selected ones.The Figure 3 shows the SPCA procedure.L is a threshold on Revista Brasileira de Computação Aplicada (ISSN 2176-6649), Passo Fundo, v. 7, n. 1, p. 30-41, abr.2015 2. define an eigenvalue threshold (λ 1 ) to cut the variables off; 3. sort the principal components in descending order by eigenvalue; let u 1 , ...u d the ordered sequence of components; 4. for i = 1 to d: • determine the still non-associated lowest loading variable X in u i ; • associate variable X with u i ; 5. remove variables associated with those principal components which eigenvalues are greater than λ 1 .
the regression coefficient and s is computed by the expression . In this equation, X j is the column matrix associated with the respective attribute and s j is the standardized regression coefficient of X j .
Figure 3: SBCA procedure 1. compute the standardized regression coefficient s for each variable; 2. create a reduced matrix from D with only those variables whose absolute value of the standardized regression coefficient is greater than a threshold L (L is estimated by cross validation); 3. compute the first (or first few) principal components from the reduced data set; 4. use these principal components in a regression model to predict the outcome.

Regression and classification models
Consider a data set with N observations on a response variable Y and p predictor variables X 1 , X 2 , ..., X d .
The goal of the multiple linear regression (MLR) is to determine a relationship between Y and X in the form [2]: In Equation 2, β 0 is the intercept, β i , i = 0..d, are the linear coefficients and ε is a random disturbance or error.
Artificial neural network (ANN) is a mathematical model formed by a set of interconnected processing units, called artificial neurons [7].Each unit u calculates its value using an activation function that integrates messages that the units connected to u send to it.The connections among the units are weighted.ANNs work by getting an input signal and propagating it forward, layer by layer, until it arrives to output layer.When the signal achieves the output layer the activation functions of its units are calculated and the results are related to the response variable and interpreted by the user.A multilayer neural network typically has one layer of input units, one or more intermediate layers, and one layer with output units.The ANNs weights are trained from data sets Revista Brasileira de Computação Aplicada (ISSN 2176-6649), Passo Fundo, v. 7, n. 1, p. 30-41, abr.2015 using specialized algorithms.

Sample complexity
In the computer learning theory, the number of examples/cases required by a learning algorithm to learn a concept/pattern is named sample complexity [15].In this context, the probably approximately correct (PAC) learning framework provides results to calculate bounds on the sample complexity for a given learning task.Summarily, the sample size, m, that allows an algorithm to learn a concept with probability between 0 and δ having error between 0 and ε is: In the Inequation 3, V C is the VC dimension of the learning problem [6].For linear regression the VC dimension is given by V C(H) = d + 1, where d is the number of variables (dimensions of the data) [22].For neural networks, the VC dimension is defined as: In this equation, q is the number of nodes in the hidden layers, r is the number of inputs for each hidden node and e is the base of natural logarithm.The dataset was generated as described in [16].Before processing the data outliers were removed if data values exceed the mean plus or minus three the standard deviation [4].Every record having at least one outlier in a variable was cut off from the dataset.After this preprocessing, the number of cases in the dataset was reduced from 2416 to 2138.

Materials and methods
In next, the methods B2 and B4, proposed by [9], were applied to determine the variables that would be used to regress the multivariate linear models (MLR) and to the train neural networks.The aim was to generate models to predict the crop yield from the input variables in the dataset (chemical and physical soil properties).The Revista Brasileira de Computação Aplicada (ISSN 2176-6649), Passo Fundo, v. 7, n. 1, p. 30-41, abr.2015 determination coefficient (R 2 ) was used to choose the best linear model generated by each procedure.The best generated ANNs were determined by evaluating the sum of squared error (SSE).
The generated networks were multilayer perceptron [21] with one input layer, one hidden layer and one output layer.The input layer was defined as the variables selected by B2 an B4.The number of nodes in the hidden layers was 1  10 of the number of neurons in the input layer, with a minimum equal to 1.The output layer was set to have just one neuron: the crop yield.The algorithm used in the learning task was Backpropagation [7] and the number of epochs during training was 100.
In the sequence two methods for variable selection were developed.These methods, called B2+S and B4+S, combine SPCA with B2 and B4 procedures.The Figure 4 presents a pseudocode that describes the procedures B2+S and B4+S 5 .2. input a and b so that a, b ∈ N and a 2; these inputs represent the minimum and the maximum number of variables to be rejected; 3. for each z from a to b do: (a) for p going from 1 to z − 1 do: i. remove p variables using the supervised criterion; ii.remove q = z − p variables with B2 or B4; iii.form a group G p with the p + q selected (rejected) variables and form a subgroup of remaining variables R p = D − G p ; iv.build a model with the remaining varibles R p and execute it to get the performance (R 2 or sum of squared error or relative error) v. save the performance V p into a vector V , and go to next p; (b) Save the V p with the best performance value in V into a vector Z n .Then, go to the next n;

Return vector Z with the best selections and analyze them to decide what the best subset of variables Z n
has provided the more accurate model.
As previously, the B2+S and B4+S procedures were applied to select variables for the linear regression and neural network training.Once again, the task was to generate models to predict the crop yield from the input variables.The neural network topology was constrained as described in the first test.As before, linear and ANN performances were evaluated by R 2 and SSE results, respectively.
The data set was split, many times, in training data (66% of the cases in the original data set) and validation data (33% of the cases).At all, each test was repeated several times on 35 different randomly generated training and evaluation sets.The overall result of each procedure was computed as the mean and the standard deviation of R 2 and SSE scores.We consider to remove up to 15 variables from the original data.The sample complexity was estimated assuming ε = 0.1 and δ = 0.05.

Results
The experiments were carried out aiming to compare the influence of the selection criteria on the predictive performance of the generated classifiers and regressors.The results are presented in the Tables 1 and 2. The first column of the Table 1 indicates the number of selected attributes, columns two to five present the mean and standard deviation of determination coefficient of MLR learned from attributes selected with B2, B4, B2 + S and B4 + S, respectively.The last column shows the sample complexity (m).Variable selection using B2+S and B4+S improved the performance of MLR learning.This conclusion is supported by observing that, in Table 1, the mean of the determination coefficient of regressors generated by B2+S and B4+S selection was higher than those generated with B2 and B4 (except in one case).Probably these better results were achieved because supervised procedure discards attributes which are irrelevant for prediction before applying PCA to select the variables that have more influence on uncorrelated principal components.That is, even a variable contributing to data variation can be discarded if it has a low correlation with the output variable.
Furthermore, the results indicate that variable selection can provide a criteria to deal with tradeoffs decisi-ons between sample complexity requirements and maximization of accuracy score (or reversely, minimization of error based scores) when learning from small datasets.Table 1 shows that the sample complexity exigencies (m) decrease as less variables as selected.Particularly, the sample size demand is reached by the MLR models with five and six variables.The better mean for R 2 was obtained by models with six variables selected by B2+S.In this learning configuration the PAC learning theory specifies that the sample data must have 1248 records.Since the training database has 2416 cases, it satisfies the PAC learning set (ε = 0.1 and δ = 0.05).Additionally, it must be observed that the supervised selection procedure allowed to induce a model that attends the sample complexity constraints with a not so large decline in the determination score (R 2 ≈ 0.684).
The results of tests made with neural networks indicated that the supervised selection procedure contributed to generate better classifiers in contrast to those obtained only by applying B2 and B4.In particular, the results also showed that B2 + S variable selector also allowed to build ANNs with five or six input variables.These networks showed a small decline of performance and a model structure requiring a sample size of 1079 cases.

Discussion
B2+S and B4+S procedures reduces data dimensionality by applying a sequential search to remove irrelevant variables.The search procedure explores supervised information (standardized correlation coefficient) and principal component analysis combined with classifier/regressor performance to decide about attribute removal.
The main advantage of this approach is to make possible to evaluate how much the use of a smaller set of attributes downsizes the requirements on sample size and fullfils the machine learning expectations.
The attributes selected by B2+S and B4+S procedures preserved the most of the variation from data and influence on crop yield.It positively has impacted on reducing the computational effort when proceeding the regression or classification.The heuristic of B2+S and B4+S procedures favors the removal of bad regressors and correlated attributes providing a strategy o trade off between dimensionality reduction and model precision.
When intending select relevant variables for linear regression and to attend PAC learning requirements, the results showed that B2+S and B4+S performed better than B2 and B4.It can be more easily viewed in the figures 5 and 6 -for models with five or six variables.In those figures it is possible to observe that the supervised PCA scores are superior to PCA scores.The difference between the results comes from the fact that SPCA selects the variables which retains the most of the variance of the original data and have higher influence on the output variable.The performance of models derived from supervised procedures was significantly better than those generated with unsupervised ones -5% of significance for paired T test.For neural networks the results are mixed.However, supervised PCA criteria obtained the best result for models with less attributes.
Krzanowski [13], Jollife [9] and King et al [12] also used PCA and sequential search to execute variable selection.In those works the efficacy of the selection process was measured by proceeding a Procrustes analysis to determine how much of the variation in original data was preserved by selected attributes.King et al [12] also used a similarity score that summarizes the correlation matrix between principal components computed from the original and reduced data.Those works do not consider the effect of variable selection in tasks as classification and regression.In contrast, the procedures presented here use the performance of machine learning process as the main score to quantify the effectiveness of the supervised/unsupervised strategies applied in variable selection.
The sequential search strategy employed by the proposed procedure can demand a considerable computational [5].Such framework could be unsuitable for variable selection in large highly dimensional data sets.However, the proposed procedure could be sped up by using a local search strategy as hill-climbing with multiple re-starts [21].In such implementation it is necessary to consider the trade off between learning performance and speed.

Conclusions
This work presented a procedure that combines supervised PCA and Joliffe's B2 and B4 procedures to select variables.The proposed procedure is compared with the original B2 and B4 methods considering two factors: the model quality and the sample complexity demands.The experiments show that, in general, the performance of B2+S and B4+S procedures was higher than B2 and B4 performance for selecting variables to proceed linear regression and neural network classification on a agricultural dataset.The experiments also allow to argue that, usually, the use of PCA based selection methods to reduct data dimension and sample complexity exigencies produces better regressors and classifiers when they are combined with supervised information.
When considering the application domain, that reducing the sample complexity requirements is an important issue when applying multivariate methods to analyze agricultura data sets because in many situations they have not many cases.The proposed procedure can contribute to extend the range of multivariate data analysis in agricultural research by applying dimensionality reduction.

Figure 4 :
Figure 4: Method for variable selection through supervised principal component analysis (B2+S, B4+S) 1. input a dataset D with n attributes;

Table 1 :
Table 2 has a similar structure except that columns two to five reports the means and variance of sum of squared error of ANNs learned in each test set.Experimental results for MLR -R 2

Table 2 :
Experimental results for ANN Classifier -SSE