Evaluation of Boosted Regression Tree for the Prediction of the Maximum 24-Hour Concentration of Particulate Matter

Air pollution is a considerable health danger to the environment. The objective of this study was to assess the characteristics of air quality and predict PM10 concentrations using boosted regression trees (BRTs). The maximum daily PM10 concentration data from 2002 to 2016 were obtained from the air quality monitoring station in Kuching, Sarawak. Eighty percent of the monitoring records were used for the training and twenty percent for the validation of the models. The best iteration of the BRT model was performed by optimizing the prediction performance, while the BRT algorithm model was constructed from multiple regression models. The two main parameters that were used were the learning rate (lr) and tree complexity (tc), which were fixed at 0.01 and 5, respectively. Meanwhile, the number of trees (nt) was determined by using an independent test set (test), a 5-fold cross validation (CV) and out-of-bag (OOB) estimation. The algorithm model for the BRT produced by using the CV was the best guide to be used compared with the OOB to test the predicted PM10 concentration. The performance indicators showed that the model was adequate for the next day’s prediction (PA=0.638, R 2 =0.427, IA=0.749, NAE=0.267, and RMSE=28.455).


I. INTRODUCTION
In Malaysia, air quality is monitored continuously throughout the country by the Department of Environment (DOE) at 65 stations. Afroz et al. [1] discussed air pollution caused by open burning and forest fires in Malaysia, which has become harmful to the public health and the environment. According to the [2], PM 10 and O 3 are the major causes of unhealthy days recorded in Malaysia. PM 10 is particulate matter with an aerodynamic diameter of less than 10 μm [3]. It is one of the main causes of pneumoconiosis, when it enters the bronchus, alveoli, and so on. The smaller the size of the dust particles, the deeper into the respiratory tract they enter Manuscript [4].
Previously, many studies were conducted to predict future PM 10 concentrations using a variety of methods. The multiple linear regression (MLR) method is the most common method used to predict PM 10 concentrations. Juneng et al. [5] used the MLR method in their study to analyse the predictive relationship between the dependent variable (PM 10 ) and the independent variables. It was shown that local meteorological factors, particularly local surface air temperature, local humidity and local wind speed, dominate the fluctuations of PM 10 over the Klang Valley during the summer monsoon. Moreover, Ul-Saufie et al. [6] used a quantile regression model to predict future (next day, next 2 days and next 3 days) PM 10 concentration levels in Seberang Perai, Malaysia, and compared the results with the MLR. Despite the success of the MLR, according to [7], it presents problems in identifying the most important contributors when there is a high correlation or multicollinearity between the independent variables in the regression equation. Typically, one of the favoured techniques for predicting a complex system involves the use of artificial neural networks (ANN), such as the ANN model that was used by [8] to predict PM 10 concentrations from the hourly data of a subway platform. According to [9], the predictive aspect of validation in the ANN model is not sufficient enough to fully assess the ability of the developed model to completely capture the underlying dynamics between independent and dependent variables.
BRTs are very reliable and flexible for dealing with complex responses, including interactions and nonlinearities [10]. The BRT algorithm is a single algorithm that is a combination of regression trees. The regression tree stops growing with repeated binary splits when certain criteria are met. In recent years, BRTs have been successfully implemented in air quality forecasting applications [11]- [14]. Table I lists recent studies that have been conducted on air pollution in Malaysia. It shows that limited study have been conducted to predict PM 10 concentrations using a BRT in Malaysia. A BRT works very well with large datasets and is robust with regard to missing values or outliers. Therefore, this study was conducted to predict PM 10 concentrations using the BRT approach which had been developed by [15]. In contrast, this study used maximum daily data compared to hourly and averaged daily data that had been used by other researcher. Furthermore, this study used BRT to predict for the next day and it is different from BRT prediction that had been produced by [16].

II. MATERIALS AND METHODS
The main research site for this study was Kuching, Sarawak (Latitude: 1°36'27" N; Longitude: 110°22'42" E). Kuching is the capital city of Sarawak, and it has been classified by the DOE as an industrial area in the state. Sarawak is the largest state among the 13 states in Malaysia. It is located in northwest Borneo Island, and is bordered by the Malaysian state of Sabah to the northeast, Kalimantan (the Indonesian portion of Borneo) to the south, and Brunei to the north. The sampling station was named as the Kuching Air Monitoring Station (Latitude: 1° 33' 44" N; Longitude: 110° 23' 19" E). This area was selected to provide an overall representation and inference of the level of air quality in Kuching, Sarawak.
The parameter PM 10 , CO, SO 2 , NO 2 , relative humidity (RH), temperature (T), and wind speed (WS) were used to predict for maximum 24-hour concentration of PM 10. The BRT model was fitted in the R version 3.4.2 software using the GBM (Generalized Boosted Regression Model) package version 1.6-3.1. The GBM offers three methods for estimating the optimal number of estimations, namely, the five-fold CV, independent test set (test), and out-of-bag estimation (OOB).
OOB assesses the decline of deviations from observations not used in selecting the next regression tree. Ridgeway et al. [21] states that the OOB use conservative methods to get the best iteration, as it underestimate the reduction of deviance. The advantage of this method is that the reduction of information available to study the structure of the model does not happened since it not eliminates a large set of independent data. According to Kohavi et al. [22], CV estimations of predictive performance may be erratic and repeated since it will do the cross validation according to the number folds in CV and then fit the final GBM model with number of iteration using all data. This study used five-fold CV. Lastly; the independent test set method uses a single holdout test set to select the optimal number of iterations. The disadvantage of this method is that the prediction uses a large number of observations, leaving a reduced data set to estimate the overall structure of the model. The steps for the BRT algorithm are summarized as follows [23]: Input:

 
Step 3: Get the final output


The steps of BRT algorithm involve of fit decision tree to the data and the loss function is used to appraise how well the prediction of a study.
Step 2 is called weak classifier additive which also includes four steps. m i r , is the negative gradient of the i-th sample in the m-th as a number of tree. jm R is a leaf node with the j-th is the number of leaf in the tree. It is a looping process that fit a regression tree to the residuals. This means that once the first tree is fitted to the model, it will take into account the prediction error of the tree to match the next tree, and to improve its accuracy. The learning rate ) ( for this study was set at 0.01. The three methods for estimating the optimal number of iterations test, OOB, and CV monitor the test data to stops improving beyond a certain number of iterations. Performance indicators were used to determine the best BRT model from the three methods (OOB, test and CV) for future PM 10 concentration predictions in Kuching, Sarawak. The models were validated by the root mean square error (RMSE), mean absolute error (MAE), index of agreement (IA), prediction accuracy (PA) and coefficient of determination (R 2 ). The equations used were reported by [24].

III. RESULTS AND DISCUSSION
The model was developed using 80% of 3998 sets of data (3198) to forecast for the next day, while another 20% (800) were used to compare the performance for future predictions of PM 10 concentrations in Kuching, Sarawak. Table II shows the descriptive statistics of the gaseous and meteorological parameters in Kuching, Sarawak. According to Uyanik et al. [25], if the skewness coefficient is a variable within the acceptable range of 1, the variable may not be said to be skewed. The PM 10 , CO, NO 2 , and SO 2 and relative humidity were highly skewed because their skewness coefficients were less than -1 or greater than +1, while the wind speed and temperature were moderately skewed since their skewness coefficients were between -½ and -1 or between +½ and +1. This showed that these variables were not normally distributed since the kurtosis coefficients differed greatly from the normal.  Fig. 1 shows the box plots and descriptive statistics for the maximum daily PM 10 concentrations in Kuching, Sarawak from 2002 to 2016. A box plot is a simple graphical display that is suitable for some important data features such as the central tendency, dispersion, skewness and potential outliers through three quartiles and the minimum and maximum observations [26]. The skewness values showed that the PM 10 had an extreme event. The maximum value for the PM 10 concentration was 526 mg m -3 (07 October 2012) while the minimum value was 14 mg m -3 . In 2012, the country experienced several short spells of haze due to transboundary air pollution as a result of forest fires in Central and Northern Sumatra, Indonesia. These had contributed to a slight deterioration in the overall air quality. For many years, the recorded PM 10 concentrations in Kuching, Sarawak had been up to 200 mg m -3 , which is very unhealthy. Fig. 2 shows the long-term monthly record of air quality data in Kuching, Sarawak. The emission of PM 10 was found to be higher from August to October. According to [27], the inter-annual phenomenon known as the El-Nino Southern Oscillation (ENSO) is substantially related to an increase in sea surface temperatures. Thus, the rainfall rate over Southeast Asia is decreasing and will affect the Malaysia-Indonesia region, and anomalous easterly winds during August to October may enhance pyrogenic emissions across international borders from Kalimantan to Kuching, Sarawak. Incidents of open burning and forest fires increased due to ENSO, such as the 1997 forest fires in Borneo and Sumatra. The concentrations of CO, SO 2 and NO 2 were due to emissions from motor vehicles used by locals as well as tourists. The NO 2 , CO and SO 2 profiles possessed a similar pattern throughout the years. Since an increase in relative humidity will lower the temperature in Kuching, Sarawak from November to December, it will also decrease the PM 10 concentration.
Performance indicators were used to compare the performances for future predictions of PM 10 concentration in Kuching, Sarawak. Table III shows the values of the performance indicators. The accuracy measures used were the prediction accuracy, coefficient of determination, and index agreement, while the error measures used were the normalized absolute error and root mean square error. The results showed that the CV was the best method for estimating the optimal number of estimations in the BRT.  As illustrated in Fig. 3, the most influential variable for predicting the maximum PM 10 concentration for the next day was the PM 10 concentration for the previous day, with 61.41%. Meanwhile, the second most influential variable would be NO 2 or CO, with 10%. The least significant influence for predicting the maximum PM 10 was SO 2 with a maximum of only 2.32%. Therefore, the outcome would be more precise if the previous PM 10 concentrations were used as one of the parameters to predict future PM 10 concentrations. According to Chaloulakou et al. [28], the determination of the regression coefficient, R 2 was improved to 0.65 by using the previous day's PM 10 concentrations as the extra input. In addition, Caselli et al. [29] found that the use of the PM 10 concentrations for the previous day as independent variables for the prediction of PM 10 concentrations give better results than a model without the previous day's PM 10 concentrations for prediction models of PM 10 . [30] also obtained a similar result as [28] and [29].

IV. CONCLUSION
This study has proved that BRT method can be used as alternative method to predict the maximum daily PM 10 for the next day in Kuching, Sarawak. OOB, CV and test were used to determine number of tree. An assessment of the performance of the model verified that the CV gives a higher quality of prediction with a lower error NAE 0.2673, RMSE 28.4554 and with greater accuracy 0.7492 (IA), 0.6375 PA, and 0.4269 (R 2 ) compared with the OOB and test. Since this study used maximum daily data so it will be more relevant to help the government or authorities to provide early warning to the people about severe haze in the future.