This week, we completed our event detection algorithm. Instead of pursuing a sine curve fitting model, we decided to use the established NOAA predicted tides as the ground-truth. We fitted those NOAA predictions to our sensor data with a horizontal and vertical shift. Using residual values of the sensor data compared to NOAA predictions as our criteria for judging “interesting-ness”, we were able to confirm that high residual values indeed corresponded to established “interesting” events that Dr. Clark informed us of. To capture a variety of interesting events, we also allowed the option of our test dataset to be either 1 hour, 1 day, or 3 days to capture both short and long term interesting events. Finally, test data with less than a certain number of points (ie, 50) were also flagged.
In the images below, the left plot shows the training set (7 days of data) and the right shows the test set (currently set to 1 day). The residual values for both are 9.4 and 4128.19, respectively. From this, we can hypothesize that the adjusted NOAA curve fit the training data relatively well, but the spike in test residuals perhaps indicates an event in the past day. Indeed, we can see that in the test plot, there is a significant spike downwards on that day.
We also made some quality of life changes to our visualizations. Earlier we were having trouble where the lines of the plot would fill lines between missing data or “fuzzy” data. We moved from a line plot to a scatter plot in response to this; the eye can fill in the pattern for the line plot anyways. We also created a better selector for the sensors, so now you can create custom groups for whatever sensors you want to compare to each other.
This week we began working on the second phase of our project involving anomaly and interesting event detection. Anomaly detection aims to detect time periods when sensors are not working properly, whether it be missing data or incorrect readings. Generally, anomalies pertain to sensor maintenance by monitoring the health of the sensor network. Interesting events includes environmentally induced events such as increasing high tide levels due to rainfall events. These kinds of events will be flagged for further investigation into the underlying causes.
Our overall approach is to choose a baseline for ‘normal’ tide patterns for each sensor and to compare new readings to this baseline. We will then calculate the residuals between the baseline and the new readings to determine if there is an interesting event/anomaly occurring.
A challenge we encountered during anomaly detection/interesting event detection is finding a reliable baseline tide pattern. We considered using published astronomical tide predictions, but a lot of resources online prohibit use of their data other than on their site. We also plotted the predicted tide patterns through the NOAA Fort Pulaski gauge, and although this served as a fairly accurate ground truth for our own Fort Pulaski sensors, there would need to be some adjustment calculations if we were to use the NOAA gauge as a ground-truth for our other sensors to account for differences in inland distance and tide magnitude. The above graph shows the NOAA gauge predictions (in blue) with our own Pulaski sensor data (orange), and we can see that they both align relatively well. In considering the possibilities for a reliable baseline, we are tentatively moving towards fitting a sine function that is able to capture the tide patterns, the primary benefit of this method being that we would have a fitted sine function for each sensor. Any deviations from this fitted function would be classified on a range of possible “interesting” events.
This week we also gave a mid-program presentation to our fellow interns, Chris Le Dantec, and invited guests. We got feedback on our preliminary visualizations and approaches to anomaly/interesting event detection. Moving forward, there is a lot of work left to be done, but we have a clear vision for what we need to do to complete our project in the coming weeks.
After our meeting with Russ last Thursday, we got a better handle on the kinds of visualizations to focus on. With regards to the goals we mentioned last week, we are moving forward with further data analysis to detect, anomalies and “interesting events” on the sensors. There are a variety of things that need to be flagged for sensor maintenance (such as if a sensor is outputting unreliable data on a day that it rains or if a sensor is not outputting any data) and things that don’t (if a bird flies under the sensor). We began working to parse out events that required Russ’ attention and built a tool he could use to rapidly identify problem sensors. We also intend to begin “interesting event detection”, flagging things like rainfall causing higher water levels in some regions over others.
In terms of visualizations, we completed two complementary plots, one that shows the water level of each sensor and one that shows the max/min/avg water level of each sensor for a particular day. We wanted to create these plots to get a more detailed look into the trends over a long period of time, as well as how the different sensors compare to one another. We will be using these plots in our ongoing data analysis work to cross check whether an anomaly/interesting has truly occurred.
The first plot shows all of the sensors plotted over time with the option to change the time range.
This plot depicts a subset of the sensors. This time the maximum water level for each day is shown from April to June.
By the end of the program, we aim to combine our anomaly/interesting event detection with our visualization capabilities to create a more streamlined way to monitor the Sea Level Sensor network. In this sense, our project has pivoted away from a public facing project to one that can aid people like Dr. Clark in detecting and monitoring events on the coast.
We also met with Dr. Jayma Koval, another collaborator on the Sea Level Sensors project who works more closely with developing curriculum in Chatham county middle schools. She shared with us her insights and feedback on our visualizations, as well as described the curriculum that is being taught.
Although a primary challenge we’ve encountered last week is getting familiar with the project space, especially considering that there are multiple groups in this space, we’ve also gained much more clarity on the direction and goals of our project. We were able to position ourselves and define what our scope was after speaking with a few other people from the project. Instead of creating a public-facing product, we’ve pivoted to creating visualizations and perhaps a predictive model that would be (1) able to assess the robustness of the data by comparing it with astronomical tide predictions and (2) visualize and analyze summary statistics for each of the sensors (max, min, average water level) over a longer time frame to (a) assess the long-term changes of these sensors and (b) to predict whether an event is indeed unusual/interesting, and would require further human assessment, or not.
Now that we’ve got a better (but still not excellent) handle on D3, we’ve started finalizing some basic visualizations and making mock-ups of more complicated plots. One of our big successes has been the “Hurricane visualization”( created using D3 and Leaflet). We took data from temporary water level sensors that were deployed by USGS to monitor flooding during Hurricane Irma and Matthew. These sensors reported the max height that water level reached during the storms. There is also a permanent sea level sensor at Fort. Pulaski that gauges water levels for the entire Georgia coast. We created a visualization that compared the water levels at Fort. Pulaski to the observed water levels at the various sensor locations. This plot aims to communicate the need for a permanent sensor network by pointing out that water levels vary from location to location.
Using D3, we’ve also gotten started on creating basic plots of the sensors that visualize the sensor data over time in different ways (linear and radial). This will serve as the base for our exploratory plots mentioned earlier (in goal 2), but we’ve encountered some challenges in inputting the csv in a useful format as well as some of the csv data being in non-chronological order. This is where we’ll be continuing our efforts in the next week, hopefully having a multi-line plot finished that will summarize max/min/average water levels for all ~30 sensors over the course of a few months.
Finally, we got the opportunity to visit the 2019 MLSE conference, hosted by Georgia Tech (Columbia next year). This conference celebrates the interdisciplinary aspect of machine learning; data science isn’t just something for computer scientists to fawn over, but a tool for revolutionizing all fields. There were talks by scientists from STEM fields such as materials science and biomedical engineering as well as data scientists who focused on public policy and social good. In addition, the conference kicked off with a Women in Data Science Day, featuring talks and workshops focused on sharing experiences.