Accurate Prediction of Streamflow Using Long Short-Term Memory Network : A Case Study in the Brazos River Basin in Texas

Accurate prediction of streamflow plays a pivotal role for effective reservoir system operations. Specifically, streamflow forecasting provides valuable information for reservoir operators to make critical decisions on water release amount to maximize reservoir storage benefits considering tradeoffs among flood control, municipal water supply, irrigation, hydropower etc. This task, however, has posed daunting challenges due to the complex mechanisms of the physical-based processes as well as the influence of uncontrollable factors. Hence, developing a robust mathematically driven model in tandem with the supervision of proficient hydrologists for validation purposes to ensure an accurate forecasting of discharge flows could be of paramount importance. To this end, a deep learning framework using a variation of recurrent neural networks called Long Short-term Memory (LSTM) network, for an accurate prediction of streamflow is presented and evaluated without losing any generality for a watershed outlet at the United States Geological Survey (USGS) gauge station neighboring Hempstead within the poorly-gauged region of Brazos basin in Texas with temporal coverage of 2007-2010. In this work, the antecedent precipitation observations and the climate variability indices have been utilized as the potential predictors. Our model is, however, scalable and transferable to be deployed across variant basins with various drainage areas. We, herein, assessed the performance of our predictive model via the Pearson correlation (  ) and the Nash–Sutcliffe model efficiency (NSE) coefficients between the predicted and observed streamflow, achieving  and NSE of 0.9542 and 0.8859, respectively.


I. INTRODUCTION
Decision making relative to water resources requires the characterization and perception of the heterogeneous -and often chaotic -environmental systems, and accordingly, a well-grounded scheme for water budget management. However, a reliable planning and management of water resources to ensure sustainable use of watershed budgets would necessitate precise and robust models to enable the scientist to examine the potential players subscribing towards Manuscript  the streamflow patterns. Conventional physically-based models have long been applied to simulate the hydrological regimes. Nevertheless, characterized by the complexity, non-linearity and the uncertainty in variable estimation, these models have often failed to meet the practical needs for a dynamic hydrological analysis. These issues are often plagued while the data are erroneous or the environmental noise leaks into the observations which would, consequently, deteriorate the model performance. Moreover, as [1] declares, physical-based models require a large amount of data for calibration and validation purposes, and thus, are computationally intense. We, however, aim at developing a nimble data-driven model, which could discover the underpinnings of the input and output variables efficiently using sufficient amount of data. Additionally, this model needs to be robust enough to hold its promise even in the presence of noise.
Data-driven models, and deep learning architectures in particular, have recently gained immerse applicability in hydrologic modeling as the underlying inter-relationships among the parameters are learnt, merely, through the data in an automated manner without any external effort exerted by human experts [2]- [4]. These models often out-perform the statistical models, as no prior model assumption for the input-output relation is required (examples of statistical methods can be found at [5]- [11]). An appropriately designed deep learning architecture would in turn enable us to extract the governing physical-based inter-relationships where the learning parameters are adjustable [12]. Moreover, these models are in principle scalable, leading to fast-computing engines to process variant amount of data efficiently. Backed by advanced optimization theories, and the transferable dynamics to be deployed in variant hydrological conditions and across different basins, these models have now emerged as powerful alternatives to the conventional hydrological models. Specifically, for the purpose of this work, we compare our proposed data-driven model against the Catchment-based physically explicit Macro-scale Floodplain (CaMa-Flood) model [13]- [16], a hydrological model designed to simulate the hydrodynamics in continental-scale rivers. Interested readers are referred to [17].
Several studies have investigated the capability of artificial intelligence to forecast streamflow. To name a few, , [18] [19] presented Support Vector Machines (SVMs) to forecast flows at two time scales: seasonal flow volumes and hourly stream flows. On a similar study, Lin et al. 2006, [20] compared the performance of autoregressive moving-average (ARMA) models with SVM. Their results demonstrate the prominent potential of SVM for long-term stream flow prediction.
The objective of this study is, therefore, to propose a fully automated framework to forecast the stream flow of a desired outlet of a watershed both for short-and long-term purposes based on past meteorological input. We work this out via performing basin delineation and extracting the potential predicting cells within the gridded digital elevation model (DEM) of the basin. We exploit the antecedent precipitation observationsas the fundamental input variable for the hydrological modelling of river basins [21] -and the indices denoting the climate variability as the potential predictors for stream flow forecasting.
The rest of the paper is organized as follows: Section II briefly introduces the region of study for this work. The dataset used in this study is explained in Section III. Section IV expatiates the proposed method to detect the potential contributive cells toward the streamflow forecasting at the location of an outlet. The results are shown in Section V. Section VI concludes the paper. Brazos basin is the 11-th longest river in the United States and the second biggest basin by area within Texas with the size of 45,000km 2 flowing 840 mile from the confluence of Salt and Double Mountain forks in Stonewall County to the Gulf of Mexico having the largest average annual flow volume among the rivers in Texas (Fig. 1). Note that, this topography is originally a 1 km Shuttle Radar Topography Mission (SRTM) DEM and this map is the 0.0625 degree spatially averaged version. The afore-mentioned characteristics turn this basin into an appealing case study for scientific scrutiny. We explore the watershed outlet with the coordinates of 30.09375 o (latitude) and -96.21875 o (longitude), located at the lower reach of the basin. Most of the cells within the basin would drain into this outlet, providing us with sufficiently enough data points to train our learning architecture.

III. DATASET
The precipitation data utilized in this work is extracted from the publicly available Livneh's database [22]. This database contains the Conterminous United States (CONUS) near-surface gridded meteorological and derived hydrological data with daily temporal resolution spanning from 1915-2011, at the spatial resolution of 0. We extracted the watershed boundary from the DEM of the Brazos basin using ArcGIS version 10.6, a geographic information system developed by ESRI. As this figure indicates, we first delineate the basin and extract the desired watershed for a pre-set outlet location. We will then investigate the cells within the watershed as the candidates contributing to the streamflow at the location of the outlet. Section IV.A itemizes five stages to extract such cells. Once the contributive cells are detected, we retrieve the potential predictors for these cells from the database. These predictors are shown in Table I. In order to accommodate the cases leading to an abrupt change in the streamflow values, we also incorporated the streamflow with the temporal lag of one day into the features. The learning architecture of our choice is long short-term memory (LSTM) deep learning architecture [34], [35], the state-of-the-art architecture of recurrent neural networks [36]. Although the traditional recurrent neural networks contain feedback loops in the recurrent layer which allows them to maintain information in memory over time, they fail to deal with the cases that require learning long-term temporal dependencies. This is due to the fact that the gradient of the loss function decays exponentially with time, or the so-called "vanishing gradient" problem occurs. This flaw, therefore, casts doubt on their applicability to be utilized for miscellaneous time series prediction as the interdependence could be of either short or long term within the data points across different time stamps. LSTMs, however, holds promise to capture any temporal inter-dependency. Due to the ability of keeping the information from the past input data points, LSTM networks are often considered as ideal candidates for both short and long time series prediction problems (short and long term streamflow prediction in this work). Ultimately, these predictors will be used to train the proposed LSTM network. We extract the contributive cells via applying a threshold on the flow accumulation pattern of the basin. Let us assume, n cells are selected as the contributive cells, and according to Table I Fig. 3. This network consists of an input layer with 140 neurons, stacked block of LSTM cells with ten hidden layers followed by a fully connected layer. Each LSTM cell consists of 300 memory units. The back-prorogation optimization to train the weights across the network is performed using Adam optimizer [37]. Dropout module [38] is employed between each layer to avoid the over-fitting issue [39] and further improvement in prediction accuracy. Ultimately, the performance of the network is evaluated over the testing data and using the Pearson correlation and Nash-Sutcliffe model efficiency coefficients. We trained the machine using the first 1000 days (~68% of total days) and tested against the remaining 461 days (~32% of total days). The programming language used for this work is the free and open source Python 3.7 [40]. The deep learning frameworks were implemented using the well-known Tensorflow [41] and Keras [42] libraries. Calculations were mainly performed via Sckit-learn [43] and Numpy [44] packages. All analysis is carried out on an Ubuntu-based machine with Intel Xeon CPU 6136 processor. Fig. 3. The proposed deep learning framework. This network consist of ten stacked LSTM blocks followed by a fully connected layer. To evade the over-fitting issue, we deployed dropout module between each layer.

A. How to Detect the Contributive Cells?
In this section, we expatiate the strategy to determine the contributive cells for streamflow prediction at the outlet location. It is worth mentioning that, the key distinction of our proposed method is that we use a sub-set of the cells within the watershed towards an accumulation of flow at the location of the outlet. We, herein, propose a systematic method to detect these cells and build the deep learning architecture over them. Thus, the following six-step strategy is suggested: 1) As indicated earlier, the process begins by capturing the DEM of the basin using ArcGIS. 2) DEMs often suffer from depressions, or pits, which are areas of a landscape wherein flow ultimately terminates without reaching an ocean or the edge of a DEM. We fill these depressions using the fill tool in Hydrology toolbox. 3) Determine the location of the outlet within the watershed.
We are interested in the outlet located at 30.09375 o (latitude) and -96.21875 o (longitude). 4) Basin delineation is the task of creating boundary defining the contributing area, or the geographical cells, for a particular outlet within a watershed via analyzing flow directions within the DEM of a basin. Delineation is part of the process known as watershed segmentation, i.e., dividing the watershed into discrete land and channel segments to analyze watershed behavior. Basin delineation is often considered as the key point to study of the potential development of water resources in a basin. Interested readers are referred to [45], [46] for a comprehensive note on basin delineation using DEM. We first delineate the basin around the outlet point to detect the boundary for the potential contributive cells towards an accurate approximation of stream flow. Basin delineation is usually performed using ArcGIS. Instead, we accomplished this task via the use of a python-based toolbox. Fig. 4 shows the delineated watershed for the investigated outlet.

5)
For the next step, we perform the flow accumulation technique to gauge the number of upstream cells draining to each cell. With this technique, we would be able to extract the river network, and thus, find out the most contributive cells within the watershed for the streamflow at the outlet location. To this end, we first calculate the flow direction using the steepest descent from each cell, guided by a flow direction map [47]. Fig.  5 illustrates the flow accumulation plot for the outlet.  Once the potential cells are detected, we train a many-to-one deep learning architecture to map the hydrological parameters of these cells to the streamflow of the outlet. This six-step strategy is summarized in Fig. 7. Fig. 7. The six stages of extracting the potential predicting cells within a basin for the streamflow prediction at the location of a pre-set outlet of a watershed. We solely need the gridded digital elevation model of the basin to kick off the procedure. V. RESULT   Fig. 8 illustrates the comparison between the predicted streamflow using the proposed learning network and CaMa-Flood versus the observed streamflow. As shown, the proposed learning network has been able to track both the low and high peaks, which is indicative of the robustness of this model to accommodate different hydrological scenarios. CaMa-Flood, likewise, has generally traced the observed signal but has failed in certain timestamps. Specifically, a significant mismatch is observed around the day indexed by 350. This mismatch has, consequently, led to an overall poor performance for this model. One possible reason could be the fact that Camaflood cannot account for the reservoir operations. Our proposed network, however, achieved the Pearson correlation coefficient of 0.9542 and the Nash-Sutcliffe model efficiency coefficient of 0.8859. These parameters are defined in Equation (1) and (2) in Section V.A, respectively. Table II, summarizes the comparison between the proposed network and CaMa-Flood in terms of these evaluation metrics.

A. Evaluation Metrics
For further clarification of our analysis, we herein, define the evaluation metrics used in this work. The Nash-Sutcliffe model efficiency coefficient between the observed and model-derived flow is defined as following:

VI. CONCLUSION
Conventional models exploring the physical mechanics of environmental events have long disclosed invaluable information to scientists towards a precise explanation for these events. These models, however, are limited in broader applications due to their complex nature and tedious validation processes. To this end, data-driven techniques have been deployed to replicate the multi-scale conventional environmental as well as hydrological models [18], [19], [20], [48]- [51]. These techniques benefit from advanced optimization theories to delve into the data and unravel the intrinsic relationships of the parameters but in an efficient manner. Furthermore, with the advent of deep learningas a more sophisticated version of machine learning techniquesscientists are now able to probe the data in an automated manner, without an explicit need for manual feature engineering, as was the case with the conventional machine learning techniques.
In this work, for the very first time -to the best of our knowledge -we propose an automated learning pipeline to forecast the stream flow within a basin using the digital elevation topography and the state-of-the-art deep learning framework, i.e. Long short-term memory network. Without using any generality, we applied our network over the flow observations collected at an outlet located at the bottom of the Brazos basin in Texas and validated via the USGS gauge station near Hemsptead. This method, however, is generic and could be easily harnessed for any basin. This is actually the key distinction of deep learning models i.e. both scalability and tunability for miscellaneous scenarios. The Additionally, our proposed model outperformed CaMa-Flood, the model commonly used in hydrologic community for streamflow simulation. The results presented in this work, denote the prominent capacity of deep learning for an accurate prediction of discharges, and of course, an efficient alternative to the conventional hydrological models. Besides this capacity, data-driven models exhibit great potential to capture anthropogenic impacts such as water use and reservoir operation, which is difficult -if not impossibleto be reflected in distributed hydrological models.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS
HGD contributed significantly towards the data preparation as well as the algorithm development and its implementation; HGD, RS, DS and YW contributed significantly towards drafting the manuscript, verifying the findings of the work and the related interpretation of the results; DB and JS supervised the project and have been significantly involved in revising the results for the intellectual content as well as providing critical feedback; all authors had approved the final version.