An updated analysis of seasonal variations of the security vulnerability discovery process

Several factors may influence the security vulnerability discovery rates. The projection of these rates might help the development and the prioritization of software patches. Previous work studied the seasonal behaviors of the vulnerability discovery process for several operating systems and web-related software systems. We propose a replication study of an experiment conducted more than a decade ago to understand the changes in the dynamics of the security vulnerability discovery rates. In contrast to the findings from ten years ago, the investigated systems do not exhibit a year-end peak. Besides, the higher incidence during mid-year months for Microsoft operating systems was only noticed for the most recent Windows OSes: Windows 8.1 and Windows 10. These results highlight the relevance of reproducibility in scientific works. For cybersecurity studies, in particular, understanding the impact of specific findings over time might uncover unexpected trends and provide valuable insights

statistics show that more than 70% of computer systems have vulnerabilities that could be exploited by an attacker (Positive Technologies, 2018). Data losses due to such vulnerabilities are typical of two types: either the data is confidential to the organization or is private to an individual. Regardless of the category, such attacks may result in loss of money or reputation.
Thus, managing security flaws in computer systems become an essential task not only to identify vulnerabilities but also to create a knowledge base on the subject and therefore, to understand possible patterns of behavior. Several references show the importance of vulnerability analysis for the understanding of the security risks of an organization (Goel and Mehtre, 2015, Shameli-Sendi et al., 2016, Singh et al., 2016. In this context, Joh and Malaiya (2009) proposed a study on the seasonal variation in the process of security vulnerability discovery of software. The authors' idea was to investigate the National Vulnerability Database (NVD) between 1995 and 2007 and discover at what time of year vulnerability discovery rates were higher, detecting possible causes. Its results suggested a possible seasonality pattern, which means that it should be taken into account in decision-making and security vulnerability forecasting models.
Given that new technologies, computer systems, and consequently security issues have arisen since the work conducted by Joh and Malaiya (2009), it is essential to verify if the trends found in their work still exist. To do this, we analyzed security vulnerability data reported from 2008 to 2017 for the following systems: Windows Vista, Windows 7, Windows 8, Windows 8.1, Windows 10, Solaris, Red Hat, Ubuntu, Mac OS X, Internet Information Services (IIS), Internet Explorer (IE), Chrome, Firefox, and Apache. Additionally, two approaches were employed to find the patterns of seasonality: calculation of the seasonality index and analysis of the autocorrelation function (ACF) for each of the months. Since cyber-attacks are rapidly evolving, this type of study is important to understand the impact of temporal factors on information security issues. Previous studies investigate the behavior of security incidents and reinforce the importance of this research topic. They analyze factors such as how patterns in past movements are useful for forecasting incidents (Condon et al., 2008, Miani et al., 2015, Liu et al., 2015, the relationship between deployed security measures and security incidents frequency over time (Kuypers et al., 2016), and the prevalence of different sizes of data breaches over time (Liu et al., 2015).
The goal of this work is to use time-series analysis to evaluate the seasonality in the software vulnerability discovery process using a public database maintained by NIST. We want to i) identify software systems that exhibit seasonal trends (in the process of vulnerability discovery) among the data collected and ii) make a comparison of our results with the results of the previous in Joh and Malaiya (2009). In general terms, two types of results can be found: the confirmation of the patterns identified by Joh and Malaiya (2009) or the existence of new patterns.
The paper is organized as follows. Section 2 presents the theoretical foundation of security vulnerabilities. Section 3 discusses related work. Section 4 details the study that is being replicated (Joh and Malaiya, 2009) and also the proposed methodology. Section 5 presents the results. Conclusions and future work are provided in Section 6.

Security vulnerabilities
In the context of computer security, a vulnerability can be defined as a failure in a system that allows the realization and execution of an attack on a computer system (Aparecido and Bellezi, 2014). To exploit a vulnerability, an attacker must have at least one appropriate tool or technique that can connect to a system weakness, these tools or scripts are called exploits (Whitman and Mattord, 2012).
In general, vulnerabilities have a life cycle, as illustrated in Fig. 1 (Xiao et al., 2018). According to this figure, from the discovery of vulnerabilities, there is a race between developers/users and attackers: while developers try to release patches so that users can install and are no longer vulnerable, attackers attempt to exploit these vulnerabilities through automated tools before users install the patches. During the vulnerability discovery phase, when developers or attackers discover system failures, these vulnerabilities can be disclosed to the public. Such disclosure may occur either in public forums or through the release of an update for vulnerability correction (Xiao et al., 2018).
Data on software security vulnerabilities are often found using specialized search portals to store and maintain information about vulnerabilities and security holes, such as NVD (NIST, 2004), Secunia (Secunia, 2009), andUS-CERT (CERT, 1991). We use the NVD portal as the main source of data, as done by previous work such as Malaiya (2009), Alhazmi et al. (2007), Roumani et al. (2015), Johnson et al. (2016), Han et al. (2017) and Anand et al. (2020).

Related Work
Several studies attempt to characterize computer security vulnerabilities. Since the goal of this work is to understand factors (seasonality) that might be used to forecast vulnerabilities, we focus on such type of study in this section.
One of the first works that investigate vulnerability prediction can be found in Alhazmi et al. (2007). The authors analyzed the number of vulnerabilities per unit of code size (density) from Windows and Red Had Linux. They found that values of vulnerability densities tend to fall within a range of values, similar to the defect density. They used this result to model the vulnerability discovery process using a logistic model. Alhazmi and Malaiya (2008) describe the applicability and significance of several vulnerability discovery models for four operating systems (Windows XP, Windows 95, Red Hat Linux 6.2, and Red Hat Fedora). Vulnerability discovery models were examined using the Akaike information criterion (AIC) and chi-square test. The evaluation found that the AML model is generally better in the long run, with better performance for Windows 95, Red Hat Linux 6.2, and Red Hat Fedora. Roumani et al. (2015) evaluate the use of time--series modeling to the vulnerability disclosure issue. They applied two techniques: autoregressive integrated moving average (ARIMA) and exponential smoothing to predict the number of vulnerabilities for five web browsers: Chrome, Firefox, Internet Explorer, Opera, and Safari. Their findings suggest that such modeling techniques can be useful for vulnerability prediction.
A recent study by Movahedi et al. (2019) compared the performance of time-series models and neural network models for predicting vulnerabilities. Authors found that neural network models outperform time-series models in all the cases in terms of prediction accuracy. Yasasin et al. (2020) uses several vulnerability modeling techniques (exponential smoothing, ARIMA, Croston, and neural network) but tackle a slightly different issue: predicting the number of post-release security vulnerabilities in subsequent periods of time.
They found that the optimal forecasting methodology depends on the software and some techniques (ARIMA and Croston) outperforms exponential smoothing and neural network.
Regarding analyzing specific seasonal factors, Joh and Malaiya (2009)  Apache, IIS, Internet Explorer, Firefox, Safari and Java (JRE)) to investigate possible annual variations in vulnerability discovery processes. They also examined the weekly frequency in the distribution of security updates (patches) and exploited vulnerabilities.
For all software groups examined, the authors found a higher vulnerability detection rate in certain months. In Microsoft products, they reported a higher incidence during the half-year periods. They also observed the periodical behavior of 7 days in the data of vulnerability scanning and confirmed that more activity occurs during the week than on the weekends. Specifically, vulnerability activity figures for Tuesday tended to be higher than the other days of the week. Results showed that periodicity needs to be considered for optimum allocation of resources and the evaluation of security risks.
The objective of this present study is to analyze the seasonality in two different moments: between 1995-2007 and between 2008-2017. The primary motivation for this is to evaluate the evolution of the behavior of security issues over the years. Joh and Malaiya (2009) proposed a study on the seasonal variation in the process of vulnerability discovery of software security. The objective of the research was to find a seasonal pattern among the available databases of the studied software and to discover in which time of year the vulnerability rate tends to be higher and to detect possible causes of this event.

Previous work -Joh and Malaiya (2009)
From the data collection, the authors analyzed the possible existence of seasonality using two statistical methods. The first was a seasonal index method measured with the chi-square test, which provides specific indices for each month, and the second was the autocorrelation function, which provides information for the correlated month. In this way, the authors were able to investigate the behavior of each of the systems.
The authors divided the operating systems into three categories: Windows, not Windows, and Web. The Windows operating systems were Windows NT, Windows XP, Windows 2000, and Windows Server 2003, non-Windows operating systems were SUN Solaris, Red Hat Linux, HP-UX and MAC OS X and Web applications were IIS, IE, Apache, and Firefox. The vulnerability database used by the authors were from 1995 to 2007.
Results suggest that, for the Windows category, the months of June and December had a high rate of vulnerability discovery, whereas February, March, April, and September had a low below-average vulnerability detection rate. Both non-Windows and Web categories showed that December had a high vulnerability discovery rate. According to the authors, this seasonality may be associated with the beginning of the semester in schools and the festive periods, such as Christmas and New Year, since people buy new computers with the operating systems described above.

Our Approach
Some of the operating systems used by Joh and Malaiya (2009) have been discontinued. For this reason, in this paper, we perform a data collection considering most recent Windows operating systems, such as From these data, the same technique used in Joh and Malaiya (2009) was replicated, that is, the methods to extract seasonal indexes, chi-square test, and autocorrelation function. Through these statistical methods, it is possible to infer the behavior described by the vulnerabilities by month and year.
For better visualization of the data calculated by the seasonal index and autocorrelation function, we constructed a time series graphs for each system group and individual charts for each system, respectively. When both graphs present possible seasonality, we built another chart, called box-plot, to improve the the analysis.
The method used to evaluate the seasonality of the vulnerability discovery process involves the following steps: i. Collection of vulnerability data for each system from the year 2008 to the year 2017; ii. Calculation of the seasonality index for each system; iii. Application of the autocorrelation function for each of the months for all systems; iv. Individual analysis of the autocorrelation function to detect possible seasonality for each system; v. Box-plot graphics construction only for systems that have the possibility of seasonality.

Data Collection
XML files are made available by NVD in zip format. Each year has its file with the necessary information. Therefore, when extracting the XML file, the next step is to import it into an Excel table for better visualization of the data about the vulnerabilities. Excel was chosen to import the data because of its filtering tools. Fig. 2 shows a sample of how to import data from the XML file. In the file, there are precisely 35 header fields, but only three are essential for the data collection. The main ones are the published, name3 and vendor fields, which respectively mean the vulnerability publication date, the operating system name where the vulnerability was found, and the company name responsible for the system.
The field name3 displays all known software types, so the second step is to filter this field according to the type of software desired (in this case, Solaris). Automatically, all fields are updated with the respective information of that system. After choosing the system, the published column allows filtering of vulnerabilities by months of disclosure to facilitate accounting. Other information such as the name of the (name) vulnerability reported by CVE and the severity of the vulnerability is not relevant to this work, only the date of publication of the vulnerabilities.
For the present work, we used XML files from the year 2008 to 2017. For some of the chosen software, the XML files did not present all the necessary information of the ten years analyzed but presented information according to their launch or when vulnerabilities were not found. The following is a summary of the data collected:

Results
This section presents the results of the seasonality analysis and compares it to the results from Joh and Malaiya (2009).

Data Analysis
The seasonality index is a measure widely used to evaluate seasonal trends and may indicate how much the average of a particular period tends to be above (or below) the expected value (Arsham, 1994). The monthly values of the seasonal index are given by Eq. (1): where, s i is the seasonal index for i th month, d i is the mean value of i th month, and d is a grand average ( 1 ). Hence, for instance, a seasonal index of 1.25 indicates that the expected value for that month is 25% greater than 1/12 of the overall average where the expected value is 1.
To check whether the seasonal indexes are statistically significant, chi-square (χ 2 ) test for the null hypothesis H 0 has been calculated. To be statistically significant, χ 2 value (χ 2 s ) must be greater than χ 2 critical value (χ 2 c ) with small enough p-value. The other approach to characterize seasonality is to use the autocorrelation function (ACF). ACF analysis gives us specific relationship information between related months. With time series values of z b , z b+1 , . . . , zn the ACF at time lag k, denoted by r k , is Eq. (2) (Bowerman, 1987): 1 http://home.ubalt.edu/ntsbarsh/business-stat/statdata/forecast.htm wherez = n t=b z t (n-b+1) represents the mean of the observations. Values of coefficient of autocorrelation close to zero indicate absence of seasonality in such an interval of observation, while values close to 1 show a significant relationship associated with such an interval of observation.
The value of r k , along with the correlogram analysis (a graph with r k values arranged in lag intervals) are used to establish criteria for a time series. According to Heckert et al. (2002) and Brockwell and Davis (2016), three possible situations can occur: i. The time series is considered "random" or "stationary" if most of the autocorrelation coefficients are between the confidence intervals and no pattern was detected in the correlogram; ii. The time series is considered "non-stationary" when the values of the autocorrelation coefficients decrease slowly as k increases, characterizing a trending behavior; iii. The time series is considered "non-stationary strong" when the values of the correlation coefficients decrease slowly as k increases, but according to a periodicity (seasonal pattern). The main difference between situations 2 and 3 and that a non-stationary "moderate" time series does not usually exhibit periodic behavior. In these data types, it is common to find isolated r k components related to a growing (or decreasing) trend or some types of patterns related to seasonality.
The results of the analysis for the groups of software using the index of seasonality, chi-square, and autocorrelation will be shown and explained in the next sections. Based on such methods, the following sequence of steps will be used to identify patterns of seasonality in the data series: i. Analyze the time series of discovered vulnerability values per year -is it possible to view months with a greater or lesser number of vulnerabilities in all years? ii. Compute and analyze the seasonality indexidentify the months in which the index value is greater or less than one; iii. Calculate and analyze the autocorrelation coefficients and the correlogram -exclude the software in which the series is considered stationary (that is, seasonally adjusted indices greater than or less than one previously found are potential outliers and are not associated with seasonal patterns); iv. Investigate the box-plot of the discovered vulnerabilities per year and correlate with the seasonality index (the average of a given month is higher or was the product of an outlier?) Fig. 3 shows the time series referring to the number of vulnerabilities found over the ten years. For these operating systems the total number of vulnerabilities was 1261 for Windows Vista (Fig. 3a), 1637 for Windows 7 (Fig. 3b), 471 for Windows 8 (Fig. 3c), 759 for Windows 8.1 (Fig. 3d), and 1466 for Windows 10 (Fig. 3e).

Windows Operating System
Time series for Windows systems provide no visible pattern. It is possible to notice a high dispersion in the data as a high concentration of vulnerabilities in some years -2010 and 2011 for Windows Vista and 2015 for Windows 8, for example -and low concentration of vulnerabilities in others -2014 for Windows Vista and the majority of the series for Windows 7.
With the help of the seasonal index presented in Table 1 and in Fig. 4a it is possible to note that some months have a higher probability of disclosure of vulnerabilities than the others. All systems have a low seasonal index in January and December, and March shows high seasonal index in all. However, in order to check whether such indices can be associated with seasonal patterns, additional tests should be done, such as, for example, calculating the autocorrelation coefficients.
The autocorrelation coefficients are shown in Table 2.
It is possible to see that Windows 7 displays a stationary pattern (absence of seasonality), unlike other systems in which several coefficients autocorrelation coefficients are greater than one and are outside the confidence interval. The next step of the analysis involves the construction of box-plots for the four systems that exhibited non-stationary behavior.
With the help of the seasonal indexes shown in Table 1 and the box-plot shown in Fig. 5 it is possible to note that Windows Vista exhibits the following behavior: few vulnerabilities released in January and an increase in vulnerability between February and April. Windows 8 already has a decrease in vulnerabilities in January and April but an increase in the second half (August-September). Windows 8.1 has few vulnerabilities in January and December but an increase in June. Moreover, Windows 10 has fewer vulnerabilities in January and December while June exhibits an increase in such number.

Non-Windows Operating Systems
For non-windows operating systems, the total number of vulnerabilities was 756 for Solaris, 418 for Red Hat,   Table 3 and in Fig. 4b it is possible to note that some months have a higher probability of disclosure of vulnerabilities than the others. All systems show low seasonal index in February and March, but none show high seasonal index at all.
The autocorrelation coefficients are shown in Table 4. Mac OS X exhibits a stationary pattern (absence of seasonality), unlike other systems in which several autocorrelation coefficients are greater than 1 and are outside the confidence interval. The next step of the analysis involves the construction of box-plots for the three systems that exhibited non-stationary behavior.
Using the seasonal indexes shown in Table 3 and the box-plot shown in Fig. 5 it is possible to note that Solaris displays the following behavior: an increase in vulnerabilities released in January and August and few vulnerabilities released in February. Red Hat shows an increase in vulnerabilities in June and October and low vulnerabilities released for the remaining months. Ubuntu has few vulnerabilities in September but an increase in April and June.

Web Servers and Browsers
The number of vulnerabilities for Web Servers and Browsers was 35 for IIS, 2528 for IE, 60180 for Firefox,  Table 5 and in Fig. 4c show that all of the systems have low seasonal indexes in January, but none of them exhibit a high seasonal index. However, in order to check whether such indices can be associated with seasonal patterns, additional tests were conducted. Table 6 show that all systems exhibit a nonstationary pattern where several coefficients of autocorrelation are higher than one and are out of range. The next step of the analysis involves the construction of box-plots for the four systems that exhibited nonstationary behavior.
The analysis of the box-plots (Fig. 5) and the seasonal indexes reveal that IIS exhibits the following behavior: an increase in vulnerabilities released in July and September and low vulnerabilities for the other months. The number of vulnerabilities associated with IE has increased in February and June and decreased in January. Firefox has fewer vulnerabilities in January and April but an increase in vulnerabilities in February. Fig. 5 shows two box-plot graphics for Chrome. In the first graph, the months of April and July present a small number of vulnerabilities but an increase in March, August, and September. In the second graph, the months of March, May, and August show an increase in reported vulnerabilities and low vulnerabilities reported for most other months. By comparing the two graphs, it is possible to notice that the seasonality present in the first graph, has entirely changed concerning the second graph. This fact reinforces the importance of studying security events using different time windows. Of all the software analyzed, only Windows 7 and Mac OS systems did not present any seasonality pattern. January was the month with less incidence of vulnerabilities for Windows systems and Web applications. For non-windows OSes, February and September have a less incidence of vulnerabilities. For most of the studied systems, June is the month with a higher incidence of vulnerabilities.  Table 7 illustrates the main similarities and differences between papers. For the systems studied, Joh and Malaiya (2009) found that June and December, for the Windows systems, had a high rate of vulnerability discovery. For systems other than Windows and Web applications, December showed a high incidence of vulnerability discovery rate. However, our results showed that this pattern has changed over the years. For example, Windows systems, non-Windows systems, and Web applications have a higher incidence of vulnerabilities in June. In other words, the year-end peak for these systems found by Joh and Malaiya (2009) does not exist anymore. Besides, it would be important to study the reasons that affect such changes. An important issue found during the analysis is related to the collected data. In many situations, NVD has returned months in which no vulnerability was disclosed. Systems such as Apache and IIS, for example, obtained 19 and 35 vulnerabilities respectively over the ten years. In future work, it would be interesting to evaluate what actually happened in those months as a way of providing context for the analysis of the results. Our work shows the importance of performing such type of studies using updated data. That is, behaviors associated with information security issues, such as the modeling of security vulnerabilities, might change over time. This result emphasizes the importance of conducting cybersecurity replication studies in order to confirm or clarify specific outcomes. The usable privacy and security community and references (Coopamootoo

Conclusion
The purpose of this paper was to find possible seasonalities in a vulnerability dataset of ten years composed of Windows operating systems, non-Windows operating systems, and Web applications. In summary, from the collected dataset of vulnerabilities, we compute the seasonal index for each system, applied the autocorrelation function, and constructed box-plot graphs for the systems that presented seasonality. Next, we compare our results with a previous work (Joh and Malaiya, 2009). By comparing both results, we observed that the seasonality reported in the updated dataset (our paper) has changed. Joh and Malaiya (2009) concluded that Windows operating systems had seasonality in the months of June and December, while other operating systems and Web applications obtained seasonality in December. However, this study indicates that all of these system groups obtained seasonality in June. This result reinforces the relevance of replicating cybersecurity studies in order to understand the impact of some findings over time.
For future work, we would like to explore the seasonality factor in order to forecast the future behavior of vulnerability disclosures. For this, we might use several time-series modeling techniques such as moving average, exponential smoothing, and ARIMA (Auto-Regressive Integrated Moving Average) models. For instance, the ARIMA model can be applied in cases where the data show evidence of nonstationarity, such as the data collected for this work. Evaluating the forecast capabilities of machine learning models is another important research topic. Finally, it would be interesting to consider other relevant software, for example, IoT applications.