Albany Hub: Week 6

This week we put the final touches on the database. This involved cleaning addresses, cleaning the census data, and pulling in housing data from ATTOM’s data API. The ATTOM dataset contained properties of each property in Albany, such as square footage, number of rooms, flooring style, and the date of the most recent major improvements. We hope to use these fields to identify reference groups across Albany. These reference groups will allow us to analyze a difference in means between households that did and did not receive funding. In our context, reference groups will consist of groups of households with similar properties to the households that received project funding. 

To begin this analysis, we constructed tables for each of our utility types (gas, electric, water, and sewage) and looked at the number of projects funded, number of unique addresses with each utility type, and mean consumption by block group. These tables serve to show preliminary findings in potential differences between funded and nonfunded homes in the context of utility consumption. We hope to investigate these tables further by looking at outliers, normalizing by square footage, and running t-tests between the two groups of houses.

Finally, we geared up for our pair of days in Albany. We met with Amanda Meng, a research scientist working on the open government data aspect of our project. She will be bringing us to Albany, where we can ask staff clarifying questions about the programs (eligibility requirements, monitoring of programs, direction and motivation of the programs) while she will simultaneously be conducting interviews with staff and participants of the housing projects. 

We’ll be here soon!

Cheeeeeers from Albany! (in a few days)



FloodBud Week 6

This week, we completed our event detection algorithm. Instead of pursuing a sine curve fitting model, we decided to use the established NOAA predicted tides as the ground-truth. We fitted those NOAA predictions to our sensor data with a horizontal and vertical shift. Using residual values of the sensor data compared to NOAA predictions as our criteria for judging “interesting-ness”, we were able to confirm that high residual values indeed corresponded to established “interesting” events that Dr. Clark informed us of. To capture a variety of interesting events, we also allowed the option of our test dataset to be either 1 hour, 1 day, or 3 days to capture both short and long term interesting events. Finally, test data with less than a certain number of points (ie, 50) were also flagged. 

In the images below, the left plot shows the training set (7 days of data) and the right shows the test set (currently set to 1 day). The residual values for both are 9.4 and 4128.19, respectively. From this, we can hypothesize that the adjusted NOAA curve fit the training data relatively well, but the spike in test residuals perhaps indicates an event in the past day. Indeed, we can see that in the test plot, there is a significant spike downwards on that day. 

We also made some quality of life changes to our visualizations. Earlier we were having trouble where the lines of the plot would fill lines between missing data or “fuzzy” data. We moved from a line plot to a scatter plot in response to this; the eye can fill in the pattern for the line plot anyways. We also created a better selector for the sensors, so now you can create custom groups for whatever sensors you want to compare to each other. 

GwinNETTwork: Week 5

We just passed week 5! Wow, time flies. We have been finishing up the last of the data visualization this week.

This Tuesday, the Civic Data Science teams presented our Mid-Term Presentations to a few of our advisors and other GT faculty members and their graduate students. The presentation was centered around our progress thus far for the last 5 weeks.

For the rest of the week, we have been finishing up the Python script for associating data points with the coordinates of the traffic lights, and refining the bubble map up to its final form.

Jason has been in charge of the script; he has constantly been improving the logic in the code and refining it so that we can eventually get the points associated to one of the 730+ intersections. We are mainly interested in the data points only within 1000 foot square of an intersection. Therefore, Jason has been writing a code that can combine data points that are approaching an intersection. This has been proving tricky because writing logic that can associate a coordinate point with another coordinate requires some finessing with the speed and bearing sections.

In the meantime, we have greatly improved upon the bubble map since last week. We managed to get the hover option running (displaying coordinate and speed when a cursor hovers over the bubble), as well as get started on adding filtering and layering functionality.  The level of complexity that we hope to reach with the map still seems distant. We need the intersection logic from the Python script running first so that we can start aggregating data on the bubble map and display the visualization that we want. We’ve finalized that we want the map to display average speed and average delay at the intersection boundaries.

This week, we started working with a graduate student on our team, Zixiu Fu, on the final objective of this project: connecting emergency vehicles to traffic lights. We also met with another graduate student, Somdut Roy, working with Dr. Guin for help with developing our Leaflet map due to his experience. In the coming weeks, we hope to improve our visual map and improve on our scripts such as to filter our data points to being near intersections. We also hope to visit the site in Gwinnett County and gather more information from firefighters as to what procedures are generally followed when approaching an intersection.

FloodBud Week 5

This week we began working on the second phase of our project involving anomaly and interesting event detection. Anomaly detection aims to detect time periods when sensors are not working properly, whether it be missing data or incorrect readings. Generally, anomalies pertain to sensor maintenance by monitoring the health of the sensor network. Interesting events includes environmentally induced events such as increasing high tide levels due to rainfall events. These kinds of events will be flagged for further investigation into the underlying causes. 

Our overall approach is to choose a baseline for ‘normal’ tide patterns for each sensor and to compare new readings to this baseline. We will then calculate the residuals between the baseline and the new readings to determine if there is an interesting event/anomaly occurring.  

A challenge we encountered during anomaly detection/interesting event detection is finding a reliable baseline tide pattern. We considered using published astronomical tide predictions, but a lot of resources online prohibit use of their data other than on their site. We also plotted the predicted tide patterns through the NOAA Fort Pulaski gauge, and although this served as a fairly accurate ground truth for our own Fort Pulaski sensors, there would need to be some adjustment calculations if we were to use the NOAA gauge as a ground-truth for our other sensors to account for differences in inland distance and tide magnitude. The above graph shows the NOAA gauge predictions (in blue) with our own Pulaski sensor data (orange), and we can see that they both align relatively well. In considering the possibilities for a reliable baseline, we are tentatively moving towards fitting a sine function that is able to capture the tide patterns, the primary benefit of this method being that we would have a fitted sine function for each sensor. Any deviations from this fitted function would be classified on a range of possible “interesting” events. 

This week we also gave a mid-program presentation to our fellow interns, Chris Le Dantec, and invited guests. We got feedback on our preliminary visualizations and approaches to anomaly/interesting event detection. Moving forward, there is a lot of work left to be done, but we have a clear vision for what we need to do to complete our project in the coming weeks. 

Albany Hub: Week 5

It’s week 5, which means we’re already over halfway through the program! The rate of work has been picking up. This week, we gave our midterm presentation to Dr. Le Dantec and a group of researchers and visualization experts. It was a great opportunity to share our research and gain some valuable feedback. We also got to watch the presentations of GwinNETTwork and FloodBud and learn about their work. 

Last week, we worked on the database and finally incorporated Census data. The data we obtained contains information such as employment rate, median income, and vacancy status for the block groups and Census tracts within Albany. This information will be invaluable in helping us evaluate the success of the housing projects. After downloading the data from American FactFinder, we restructured it so that each row corresponds to a unique tract-block group combination, joined the data sets by tract and block group, deleted irrelevant and repeated fields, and then renamed the fields as to make for easier analysis. We named the resulting table census_blockgroup.

We integrated all of our tables into a SQL database, which will make it simple to query and retrieve data. This involved picking the column names wisely and ensuring that every column was encoded as the data type that minimizes storage size without truncation. The five tables we have are utilities, housing projects, weather, Census, and addresses. The next step will be to use this for statistical evaluation; hopefully by next week, we will have summary tables to characterize the key variables. 

Here is a short snapshot of our data; since there are so many columns, it wouldn’t all fit on screen!

From the utilities:

From the census data:

An image of the query editor, which lets us retrieve information from the database in real time:

GwinNETTwork: Week 4

This Tuesday, one representative of each of our Civic Data Science teams traveled with our advisor, Professor Chris Le Dantec, to Macon to attend the press release of the new Georgia Smart Community Challenge winners of 2019.

Angela was able to go and represent our Gwinnett team. At the event, we learned that Milton, Woodstock, Macon, and Columbus would be the four new cities awarded with the funds and resources provided by Georgia Tech! We also found out that our current project advisor, Dr. Angshuman Guin, will also continue to be a head researcher in the coming year with the Milton team!

Kutub Gandhi (from the Chatham team), Olivia Fiol (from the Albany team), and Angela Lau (from the Gwinnett team) with the Georgia Smart Community Corps and GT President Peterson in Macon, Georgia!

Back at Georgia Tech, David and Jason continued to make progress in the project. Before Angela left to represent our group in the Smart Community Challenges event, all of us came together to discuss what the next version of the website would look like. We were thinking about grouping sets of points near each other into one big bubble that was clickable. When the user clicks on the bubble, the javascript code we have designed using leaflet would zoom into that location and display information about that bubble. Here is a rough draft of the website design we came up with:

We hope to design and implement more advanced features as time goes on. David and Angela continued to make progress on developing the next version of the website. Jason, on the other hand, was coding and debugging a script that would help analyze the firetruck data and outputting meaningful data onto the same file. Specifically, he looked at how the fire truck locations behaved within a 1000 feet box of each traffic light in Gwinnett County. What Jason was looking for was what points were near intersection location, whether the fire trucks at those locations were approaching or receding from the intersection, and what type of movement did the firetruck performed once they reached that intersection. Did they turn left, turn right, or continue straight. We hope to determine if these actions have an effect on the delay the firetruck.

FloodBud Week 4

After our meeting with Russ last Thursday, we got a better handle on the kinds of visualizations to focus on. With regards to the goals we mentioned last week, we are moving forward with further data analysis to detect, anomalies and “interesting events” on the sensors. There are a variety of things that need to be flagged for sensor maintenance (such as if a sensor is outputting unreliable data on a day that it rains or if a sensor is not outputting any data) and things that don’t (if a bird flies under the sensor). We began working to parse out events that required Russ’ attention and built a tool he could use to rapidly identify problem sensors. We also intend to begin “interesting event detection”, flagging things like rainfall causing higher water levels in some regions over others.

In terms of visualizations, we completed two complementary plots, one that shows the water level of each sensor and one that shows the max/min/avg water level of each sensor for a particular day. We wanted to create these plots to get a more detailed look into the trends over a long period of time, as well as how the different sensors compare to one another. We will be using these plots in our ongoing data analysis work to cross check whether an anomaly/interesting has truly occurred.

The first plot shows all of the sensors plotted over time with the option to change the time range.

This plot depicts a subset of the sensors. This time the maximum water level for each day is shown from April to June.

By the end of the program, we aim to combine our anomaly/interesting event detection with our visualization capabilities to create a more streamlined way to monitor the Sea Level Sensor network. In this sense, our project has pivoted away from a public facing project to one that can aid people like Dr. Clark in detecting and monitoring events on the coast.

We also met with Dr. Jayma Koval, another collaborator on the Sea Level Sensors project who works more closely with developing curriculum in Chatham county middle schools. She shared with us her insights and feedback on our visualizations, as well as described the curriculum that is being taught.


Albany Hub: Week 4

We’ve faced two big challenges this past week: getting to know Albany spatially and preparing all data for our research-grade database. Because of the nature of these problems, our team has split into our specializations and tackled them in groups of two.

On Thursday and Friday, Olivia and David worked to map Albany’s housing projects on ArcGIS, layered with various attributes like median household income and political ward boundaries. At the moment, we only have access to tract-level data for indicators we’ve obtained. We’ve even contacted the Census to request more granular data, but unfortunately, they can only offer it at the tract level. With this median household income data, we’ve created a rough draft of what Albany looks like spatially with respect to median household income. All shading is the result of the default settings of ArcGIS. The dots represent different projects, colors represent each project’s federal funding source, and the size of the dot represents the amount of money invested in the project. Again, the sizes of the dots are all the result of default settings in ArcGIS and do not accurately reflect the full picture of Albany. The tract shapes displayed are those either fully or partially within Albany; they do not reflect the boundaries of the city.

We used the same dots of the projects for the next visual but layered with political ward boundaries rather than Census tracts. While the previous map does not contain the boundaries of Albany, this map does.

While Olivia and David developed these maps, Mirabel and Billy worked to clean up all of the sources of data for the database. These datasets include the weather of Albany, information on the housing projects, utility billing, and data from the Census. Most of their work has come from standardizing addresses, ensuring that streets are recorded as streets rather than drives or avenues, checking that address numbers line up across datasets, as well as other tests to verify continuity across datasets. Essentially, we need the addresses to match in all datasets so that when they’re merged, all data is managed precisely so that we don’t lose any valuable information. Mirabel also took the time to geocode all addresses in Albany. This will associate each address in Albany with a physical location on a map. In continuing our spatial analysis, this addition will be indispensable.  

Besides cleanup and geocoding, they set up the basic structure of our database. Our internal database will be queried using SQL behind the Georgia Tech firewall. With this database, we will be able to answer our own and our advisor’s research questions through statistical analysis. All of our questions will be centered around the following motivation: to evaluate the effectiveness of Albany’s energy efficiency housing projects. Our focus for this week will be on finalizing this database. We’re expected to present the first draft to Dr. Asensio tomorrow.

Also this past week, Olivia attended the Georgia Smart Communities Challenge press conference in Macon to represent our project. Albany Hub is part of the current lineup of projects for this past year’s winners. At the conference, the newly awarded communities were announced. Congratulations to the winning communities! You can find more information about the challenge and the winning proposals here.

That’s all for now. Talk to you next week!

GwinNETTwork: Week 3

It’s been another fun week for us. The first couple of days we got some insight into the fields of data science and machine learning research at the Machine Learning in Science and Engineering conference, where we got to go to many talks from professors and students at universities around the country. In addition, we were also able to mingle and network with some graduate students, professors, and even industry professionals during poster sessions in the afternoon.

Even while attending the MLSE conference, we were still able to have a productive week. We met with Dr. Guin again, where we went over several things including:

  • With the webpage with the Leaflet.js heatmap getting closer to completion, we went over how to use the Windows Remote Desktops that could be used to SSH/SCP into the Linux box that would be used to host the heatmap
  • Improved the filtering by getting rid of firetruck location based off of the speed. The threshold of getting rid of the firetruck location was if the firetruck was moving at a speed less than 7 minutes per hour for more than a period of 4 minutes. We understand that the threshold might change as new methods of filtering get introduced over time.

This is a sample of the heat map we have completed:

As a result of the filtering, this is the image we were able to come up with in QGIS 3:

From the above mapping, the blue and red dots represent firetruck locations before the filtered was applied, but only the blue dots remained after the filtering script was applied.

We are nearing completion of producing initial visualizations of one fire station’s set of data. Next week, we will start rolling this process out to the other 15+ sensors among 6 fire stations along the Peachtree Industrial Boulevard corridor and upload this heat map onto the website! Afterwards, we will hopefully start collaborating with other researchers on this project and connect the fire truck data with the traffic light/sensor data in order to eventually optimize routes for emergency vehicles in the future.

A challenge that we continue to face is QGIS being slow, unreliable, or crashing at random times, causing unnecessary frustration. As far as the data points themselves go, we are still working on edge cases for filtering out the irrelevant points. Additionally, we are also wondering if and how it might be possible to improve upon the current Leaflet.js heat map that we have.

FloodBud Week 3

Although a primary challenge we’ve encountered last week is getting familiar with the project space, especially considering that there are multiple groups in this space, we’ve also gained much more clarity on the direction and goals of our project. We were able to position ourselves and define what our scope was after speaking with a few other people from the project. Instead of creating a public-facing product, we’ve pivoted to creating visualizations and perhaps a predictive model that would be (1) able to assess the robustness of the data by comparing it with astronomical tide predictions and (2) visualize and analyze summary statistics for each of the sensors (max, min, average water level) over a longer time frame to (a) assess the long-term changes of these sensors and (b) to predict whether an event is indeed unusual/interesting, and would require further human assessment, or not.

Now that we’ve got a better (but still not excellent) handle on D3, we’ve started finalizing some basic visualizations and making mock-ups of more complicated plots. One of our big successes has been the “Hurricane visualization”( created using D3 and Leaflet). We took data from temporary water level sensors that were deployed by USGS to monitor flooding during Hurricane Irma and Matthew. These sensors reported the max height that water level reached during the storms. There is also a permanent sea level sensor at Fort. Pulaski that gauges water levels for the entire Georgia coast. We created a visualization that compared the water levels at Fort. Pulaski to the observed water levels at the various sensor locations. This plot aims to communicate the need for a permanent sensor network by pointing out that water levels vary from location to location.

Using D3, we’ve also gotten started on creating basic plots of the sensors that visualize the sensor data over time in different ways (linear and radial). This will serve as the base for our exploratory plots mentioned earlier (in goal 2), but we’ve encountered some challenges in inputting the csv in a useful format as well as some of the csv data being in non-chronological order. This is where we’ll be continuing our efforts in the next week, hopefully having a multi-line plot finished that will summarize max/min/average water levels for all ~30 sensors over the course of a few months.

Finally, we got the opportunity to visit the 2019 MLSE conference, hosted by Georgia Tech (Columbia next year). This conference celebrates the interdisciplinary aspect of machine learning; data science isn’t just something for computer scientists to fawn over, but a tool for revolutionizing all fields. There were talks by scientists from STEM fields such as materials science and biomedical engineering as well as data scientists who focused on public policy and social good. In addition, the conference kicked off with a Women in Data Science Day, featuring talks and workshops focused on sharing experiences.