A genetic algorithm using Calinski-Harabasz index for automatic clustering problem

Data clustering is a technique that aims to represent a dataset in clusters according to their similarities. In clustering algorithms, it is usually assumed that the number of clusters is known. Unfortunately, the optimal number of clusters is unknown for many applications. This kind of problem is called Automatic Clustering. There are several cluster validity indexes for evaluating solutions and it is known that the quality of a result is influenced by the chosen function. From this, a genetic algorithm is described in this article for the resolution for automatic clustering using the Calinski-Harabasz Index as a form of evaluation. Comparisons between the results and other algorithms in literature are also presented. In a first analysis, fitness values equivalent or higher are found in at least 58% of the cases for each comparison. Our algorithm could also find the correct number of clusters or close values in 33 cases out of 48. In another comparison, some fitness values are lower, even with the correct number of clusters, but graphically the partitioning are adequate. Thus, it is observed that our proposal is justified and that improvements can be studied for cases in which the correct number of clusters is not found.


Introduction
Data clustering is a technique that organizes a dataset into clusters defined by the similarities between its elements. Sometimes the number of clusters that represents the set is initially unknown. This case is called Automatic Clustering Problem (ACP), and in addition to identifying the clustering, the ideal number of clusters is part of the solution to be discovered (Linden, 2009, Cruz, 2010, Ochi et al., 2004, José-García and Gómez-Flores, 2016, Gan et al., 2007. There are many methods available in the literature. Cruz (2010) proposed several methods for ACP resolution using the Silhouette index (SI) as evaluation criteria. Semaan et al. (2012) presented a method for solving this problem, called MRDBSCAN, evaluating solutions by the SI function. Kettani et al. (2015) aimed to improve the initial definition of the number of clusters problem found in the K-means algorithm. The optimality was measured by Calinski-Harabasz index (CHI). Finally, Pacheco et al. (2017) presented an algorithm based on a proposal inspired by ants behavior to solve data clustering problems. The ACO algorithm performed its experiments with the SI evaluation function.
There are proposals in the literature that make use of the CHI as an evaluation criteria for the resolution of automatic clustering. Because of its simple implementation and low computational cost, it is stated that its use is a good choice so as to find solutions with good clusters formation (Kettani et al., 2015, Harsh andBall, 2016). Thus, this article presents a Genetic Algorithm to solve the Automatic Clustering Problem. This procedure is based on an algorithm from the literature and the cluster validity index used to evaluate is the CHI. Experiments were performed and their results were compared to other works in the literature.
This paper is structured as follows: The next section presents the cluster validity index that will be applied in the ACP resolution. Section 3 talks about the methodology used in this paper. The fourth section presents, compares and analyzes the results obtained by experiments. Finally, Section 5 presents the conclusions about the work as a whole.

The cluster validity index
A solution generated by a clustering algorithm is evaluated by a cluster validity index. It analyzes the goodness of the clustering by establishing, usually, a relation between the clusters' internal cohesion and the separation between clusters. Clustering algorithms based on metaheuristics often use cluster validity indexes as objective function to be optimized (José-García andGómez-Flores, 2016, Mishra et al., 2016). Some works present proposals for solving the Automatic Clustering Problem using the CHI as a cluster validity index. It is stated that the use of this index is a good choice in the search for solutions with good clusters formations because, in addition to being simple to implement, its processing is not very computationally expensive. In general, its results are robust when compared to other clusters' validation methods (Kettani et al., 2015, Harsh andBall, 2016). Thus, the cluster validity index used in this article is the CHI and it will be presented below.

Calinski-Harabasz index
The function evaluates the cohesion through the sum of distances of cluster elements in relation to their respective centroids. The separation criteria is calculated from the sum of the distances between the centroid of each cluster and the global centroid of the dataset. The computational cost of this function is not high and outperforms, generally, other cluster validity indexes. Its complexity is equals to O(n). When used by a metaheuristic, maximizing its value is the objective. This function is defined as follows (Kettani et al., 2015, Mishra et al., 2016, Maulik and Bandyopadhyay, 2002, Caliński and Harabasz, 1974: where:

The used methodology
In this paper, the ACP is solved by an algorithm based on the Genetic Algorithm metaheuristic. The partitional form is adopted. Using a hard clustering algorithm, the Euclidean Distance determines the similarity between the elements. The methodology used is based on the method named Constructive Evolutionary Algorithm with Local Search 1 (AECBL1), already known in the literature. The AECBL1 solves the ACP from the concept of the Evolutionary Algorithm metaheuristic. It has two phases, an initial which is responsible for generating the initial clusters and another that presents a genetic algorithm with a local search (Cruz, 2010).

The formation of initial clusters
For the initial organization of clusters it is made use of the procedure named Generate Initial Solution 1 (GSI1). From the dataset X it organizes the elements into sets to form the genetic algorithm's initial solution. To decrease the cardinality of the problem's input data each of these temporary clusters is considered as an object by the algorithm. Its pseudocode is presented in Algorithm 1 (Cruz, 2010).
A region with points agglomeration originates a cluster. For each element is determined the shortest distance from it to some another. Then, the average of all these shortest distances is calculated, which is called as davg (Cruz, 2010). From this, each element x i of X is defined as the center of a circle whose radius is equivalent to r = α * davg. Then, the group of elements belonging to each circle N i = circle(x i , r) is generated (Cruz, 2010).

Algorithm 1: GSI1
A list T stores the number of elements of each circle. It is sorted in descending order according to these cardinalities. Thus, the corresponding elements to each position of T are defined the initial clusters from this procedure, forming B = {B 1 , B 2 , . . . , B t }. These clusters do not share any elements because when a circle is chosen the elements belonging to it will not be part of any other (Cruz, 2010).

Genetic Algorithm
The methodology of this work called Genetic Algorithm with Local Search 6 (AGBL6) is based on AECBL1's evolutionary module. It corresponds to an Evolutionary Algorithm with three Local Search techniques and notions of adaptive memory (Cruz, 2010).
After GSI1 processing the algorithm begins its execution by initializing a population. In sequence, Gmax iterations will be performed related to the Genetic Algorithm's generations. In each generation, individuals are selected for reproduction, crossover and mutation operations. The selected pairs are defined as follows: the first one is chosen among the 50% fittest and the second among the entire population, both selected randomly. There are no repetitions in the choice of individuals. The number of pairs that will pass the crossover is defined according to a rate pc, it is applied in the two-point method. Then the mutation operator is applied with a probability pm, it exchanges one of the characters from one of the individuals in the pair. If any generated individual represents an invalid configuration, without clusters, another one is randomly generated to replace it. Remembering that when generating some chromosome, either by applying an operator or during a search process, it is evaluated by CHI, used here as a fitness (Cruz, 2010).
The Individual Inversion local search is applied to the fittest individuals in the population at each t iterations. The Path-Relinking runs every r iterations on the best individual of the population and the best of the Elite set. The Elite set has a size of five individuals and it saves the best solution in each iteration, it must be better than the worst solution in this set and all the others, and at the end of processing the Peer Exchange search runs in this set (Cruz, 2010).
In conclusion, the algorithm returns the best of all solutions after completing the processing of operations. Algorithm 2 shows the method's operation (Cruz, 2010).

Computational experiments
The developed work was implemented in the C++ programming language using the g++ compiler in version 4.8.4.
• α: Tested values vary within the range [0.5,12], according to the particularities of the dataset. • Tpop: The population size was defined as 1/3 of the number of generated initial clusters, ie 1/3 of chromosome size. Its value is limited to a maximum of 30 individuals. • Gmax: The algorithm had the number of generations set to 50. • pc: The number of pairs of selected individuals for the crossover operation is equivalent to 40% of the population size. • pm: The chance of the mutation operation being applied to a selected individual was set at 10%. • t: It was stipulated to be applied every five iterations.
Thus, for 20% of the best individuals in the population the search is performed in the following generations: 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50. The operator is applied similarly to the AECBL1 algorithm, having the same frequency set. • r: The application of Path-Relinking has been defined for the following generations: 18, 28, 38, and 48. The AECBL1 performs the search this way and it was decided to keep it as well.
The following tables present a comparison between the algorithm presented in this work and some literature proposals, as follows: AECBL1, MRDBSCAN, AK-means, and ACO (Cruz, 2010, Semaan et al., 2012, Kettani et al., 2015, Pacheco et al., 2017. To compare to others methods, it is necessary to normally execute the algorithm, find the solution and use the Silhouette index, since most of the methods in the literature use this index to show their results. So AGBL6 uses the CHI to find its solutions and, in the end, with its final solution, the algorithm uses the SI to compare to others algorithms. In each table the highest results will be highlighted in bold. It will be considered equivalent results those with SI values up to 0.02 difference, so possible rounding of the strategies of other works are disconsidered. Each algorithm from this work was performed 30 times for each instance.
A comparison between AGBL6 and AECBL1, MRDBSCAN and ACO algorithms is shown in Table 1. The "Dataset" column indicates the name of each dataset. The column "AGBL6" includes the results obtained using the CHI function as an evaluation. The "MRDBSCAN", "AECBL1", and "ACO" columns contain the results of the respective literature algorithms: MRDBSCAN, AECBL1, and ACO. The "Literature" column corresponds to the known information in the literature. Each dataset has its results presented in three rows. The first, "Number of clusters", contains the number of clusters obtained in the final solution. The second, "SI value", includes the SI values for the final solution. The third, "CHI value", presents the values obtained by the CHI in AGBL6.
Considering the AGBL6, the technique showed good results, often their SI values tied with these works of the literature. In comparison to AECBL1, similar results were found in 28 instances. Lower values were obtained in only 19 cases, and it won once in the broken ring dataset. Although AGBL6 lost in approximately 39% of cases, it obtained results equivalent to those of AECBL1 in about 58% of datasets. Regarding MRDBSCAN algorithm, SI values were tied in 17 instances, AGBL6 won in 25 others, and its results were lower only five times. Thus, the AGBL6 obtained SI values equal to or greater than MRDBSCAN in at least 89% of cases. Considering ACO, there was equivalence in the values in 15 datasets. Higher values were recorded in 16 cases, and eight times AGBL6 resulted in lower results. Thus, the results of AGBL6 tied or won ACO in at least 79% of instances.
About the number of clusters, AGBL6 finds the correct value for 22 instances. AECBL1 resulted in the correct value more times, for a total of 37 datasets. The other two, MRDBSCAN and ACO, found in 14 and 15 situations, respectively. Considering also the discovery of close values, both AGBL6 and ACO obtained a certain degree of similarity to 11 instances. MRDBSCAN values were close to 17 times, and AECBL1 in eight situations.
A comparison between AGBL6 and AK-means algorithms is shown in Table 2.
The name of each dataset is displayed in the "Dataset" column. The "AGBL6" column presents the results generated from AGBL6. The results obtained by AK-means are presented in the "AK-means" column. The "Literature" column corresponds to the known information in the literature. Each dataset has its results presented in three rows. The first, "Number of clusters", contains the number of clusters generated by the solution. The second, "SI value", displays the values obtained by the Silhouette Index for the final solution. The last, "CHI value", indicates the values obtained by the respective function in AGBL6.
When comparing the results it is observed that the AGBL6 obtained lower SI values in 12 cases, and tied or won in four.
The AGBL6 generated the correct number of clusters for half of the instances. AK-means hits this value in 13 cases. Considering values close to correct, with a difference of up to two units, AGBL6 generated for two datasets, and AK-means in one case.   Some AGBL6 results obtained unsatisfactory SI values being lower than other algorithms during the comparisons. Considering AK-means, it was observed that some SI values of their solutions are much higher than those obtained by AGBL6. However, for some datasets, although the AGBL6 SI result is smaller, the correct number of clusters is found. Considering these facts, it was decided to perform a more careful analysis. Some of the resolutions in which the correct number of clusters is found by AGBL6, and the instance dimensions allow for illustration, will be graphically examined. Solutions of four R 2 datasets will be presented below, they are different in types and sizes, as follows: ruspini, R15, 300p2c1, and 1000p6c. Each figure was made by gnuplot program, via command line, and the illustrations represent the organization of the clusters generated for each instance.
Graphically, it is clear that the configuration of clusters for four datasets is adequate, presenting some homogeneity in each cluster and good demarcations between the clusters. Although not obtaining higher silhouette values visually it is concluded that by the correct number of clusters found the generated solutions can be considered optimal, its partitions are correct. The AGBL6 algorithm is a valid strategy but it is not 100% accurate. Therefore, it can be improved to fit cases where the correct number of clusters is not found.

Conclusions and future works
This work presented a methodology for the resolution of Automatic Clustering Problem based on Genetic Algorithm metaheuristic using the Calinski-Harabasz index for the formation of clusters. The results obtained were compared to other works in the literature.
When comparing AGBL6 to other studies, it was found that it was able to result fitness values equivalent to or higher than ACO's and MRDBSCAN's in 79% of the cases and 89%, respectively. There were also ties with AECBL1 for 58% of the datasets. About the number  of clusters, in the first comparison to literature works, AGBL6 obtained this value for 22 instances out of 48, second only to AECBL1. From the comparison to AK-means it was seen that often their fitness values are higher than the results obtained by AGBL6, even when the correct number of clusters was found by this proposal. From a total of 16 datasets in ten times, AGBL6 was able to get the correct number of clusters or values close to it. Thus, a detailed analysis was performed, observing the solutions' results graphically. It was found that the partitions are adequate, they are constituted by clusters of appropriate formation. Thus, it can be concluded that the AGBL6 proposal is justified, however, improvements should be studied to correct cases where the correct number of clusters is not found.
In future activities, the use of a multi-objective optimization method can be a promising strategy in the search for better solutions. Another possibility that can also be explored is to adopt different cluster validity indexes at different stages of the algorithm. A function can be employed only for the initial phase of clusters formation and another for continuity of processing.