After the exploration of data sources, in this week we focus on the geographic processing of data and start to explore some preliminary questions in flood risk prediction.

Geo-processing of data: for the 1km*1km grids created as learning units in Ziguinchor region, we try to extract the topographic and weather features for each.

1) Weather: in the NOAA weather data, there are two weather stations locate in the region we study. So the weather feature of each grid is calculated by taking a weighted average of the two stations based on the distance to each. While the flood images are a 14 days composite, the weather data is aggregated in the same time scale. As the figure shows, the total number of flood area in this region is highly consistent with the precipitation and dew point over all the time points from 2015 to 2017.

2) Water area and waterways: both are spatial data, with polygons or lines in map representing the water information in certain location. For each polygon, we calculate its intersection area with water area polygons, and calculate the distance of grid centroid to the nearest waterways. These two features indicate the geographic relation between our object area and waters.

3) Elevation and slope: elevation is raster data at 3 arc second resolutions, while also helps generate the slope values. We apply a zonal statistical method on the raster file and grid shapefiles, and obtain the average, max, min and standard deviation statistics of elevations and slopes in each grid unit.

4) Land cover: water storage capacity has great effect on the formation of flood, so the land cover type could be a very predictive feature for flood risk. We downloaded land cover map from Food and Agriculture Organization of the United Nations, can calculate the percentage of surface types for each grid.

After data wrangling, we explore several questions to build models. Some has decent results for us to keep working on.

Deep Learning Models:

In this method, the dataset is composed of image patches randomly sampled from color-coded map of Senegal, where each color represents a different feature of the land. We currently have a convolutional neural network that can classify whether a patch of land will be flooded within the next year at 85% accuracy. The dataset was labeled using a simple algorithm that counts occurrences of RGB values within a specified range, and if the number of occurrences is above a certain threshold then the image is labeled as ‘flooded’.

The classification accuracy is used to see how well the current network architecture can encode information, so that an autoencoder model that can generate flood patterns within specific images can be built on top of it. Another model in progress is one that can predict flood risk in a specified area with respect to time. An RNN is being used for this purpose. We hope to represent risk in different areas using a choloropleth map.

Machine Learning Models:

- Regression: Can we model the average area of flooding for each grid over all study days by the static topographic features and an average of weather features?

Random forest model: R-squared=0.806

- Classification: Can we model a grid flooded or not over the study dates, where flooded is determined by a variable threshold of percentage of flooding area in the grid?

Stochastic gradient boosting: Accuracy=0.895 when threshold is 0.25

- Regression: Extended the question 2, can we model the number of floods over study dates by a threshold?

Random forest model: R-squared=0.734

- Regression: Extended the question 3, can we train the data of 2015 and test on 2016 to evaluate model performance?

Random forest model: R-squared=0.730

- Regression & Classification: Can we predict the amount of flooding per cell per date, or whether it flooded or not by setting a threshold on the amount of flooding?

Generalized linear model: R-squared=0.155

In the next week, we will keep exploring models for the above questions.