UN Data for Climate Action – Predicting and Alleviating Road Flooding in Senegal

With ten weeks’ busy work, we successfully completed our work for the UN Data for Climate Action Challenge. Here is a wrap up of the research questions we focus on, and the solutions for the problems.

The background is that climate change has the potential to raise the risk of flood for coastal countries, like those of Senegal. Given the large proportion of unpaved road in Senegal, flood risk could damage the road network and affect the accessibility of residents. Given the condition that in Africa countries there is a deficit funding for the infrastructure development, it is critical to identify which roads should be prioritized, to prepare for the possible damages brought by climate change. We propose two steps to identify the roads that should be prioritized.

First, we need to evaluate the probability of flood risk for the areas where roads go through under climate change. To achieve this, we build a flood risk model based on the topographic features and historical weather data for the area we study.

The second step is to analyze the contribution of each road segment to regional connectivity. Roads that are critical to accessibility and under flood risk should be prioritized for weatherproofing.

Applying optimization techniques, we can then determine explicit plans for allocating road maintenance funds. Multiple sustainable development objectives can be explored within this framework, such as maximizing rural connectivity or minimizing the expected number of people isolated due to flooding. This approach has the potential to minimize the long-term cost of establishing a reliable road network while helping to buffer vulnerable populations from extreme weather events.

Flood Risk Prediction Model

For flood risk prediction, we have collected data from multiple sources, for example, flooding maps of Senegal from NASA, daily weather data from NOAA, land cover data from the Food and Agriculture Organization of the UN, and different types of maps from Open Street Map. With the rich information in topography, hydrology, and weather etc. we are able to build machine learning models to evaluate the flood risk for the 1km*1km analysis unit. The bellowing framework shows features we use, the targets, the algorithms we use to build models, and evaluation methods.

There is also a critical step to join the target flooding area and all the features so that they are at the same spatial scales. For raster files, we mainly use the zonal statistics method to get values for each grid cell. For land cover and water area data, we calculate the intersection area of feature polygons to each grid cell. For daily weather, we use the weighted average, where the weights are determined by the distance of the grid cell to two weather stations.

Firstly, we train regression models and use the proportion of flooding in each grid cell during each biweekly time period as the target. We choose three machine-learning models to train on the data: Support Vector Machines (SVM), Random Forest (RF) and XGBoost. The best RF model achieves promising performance, with an R-squared (how close the data fit to the regression line) of about 0.7056 in test set, and a root mean square error (RMSE) of about 0.1041. The top 10 important features of the model show that the dynamic historical weather features affect the flooding area change, especially the historical temperature and precipitation.

However, the regression results do not reflect how adversely the road going through this area may be affected by the flood. This is a challenging idea to quantify, as the flooding area change of a grid cell is not directly related to the probability of road becoming flooded. Therefore, we set a threshold to determine whether the grid cell is flooded or not at a particular biweekly time period, and turn it into a classification problem. Each sample is labeled as flooded or not based on the percentage of flooding areas in this grid cell. For conservative consideration, the threshold is set as 0.5, which means that if a grid cell has 50% area flooded during a biweekly time period, this sample is labeled as flooded, vise versa. In the table shows the model evaluation and performance on test dataset.

A visualization of the historical flood risk map and the predicted map shows that we can precisely capture areas with high flooding risk such as #1, #2, and #3. Meanwhile, for some historical low flood risk areas (#4), our predicted model can overestimate the flood risk. Such areas may get flooded not that frequent in the past, but probably have a risk of getting flooded in the future, according to our model. The predicted results help to offer suggestive information for the future preparation.

Road Network Optimization

We use the telecommunication data from Orange to estimate the traffic flow in road segments. We first began by generating the Voronoi of the cellular network towers by computing the Delaunay Trian-gulation of each tower and assigning road intersections to each Voronoi region. We then began assigning population flow to the edges by checking if a user was in transition. We say a user is in transition if the tower corresponding to their cell phone use changed from one time stamp to the next. If a user is in transition, we calculate the shortest path between two randomly chosen roads corresponding to the origin and destination region. After the path is calculated, we increment the population of the edges in the path by one for the date of the destination’s time stamp.

The second task was to determine which edges in our graph were at most risk of being flooded. Using the 14 days composite flood map from NASA, we calculate the amount of flooding in a road at a particular time period. This is calculated by the sum of the areas that are flooded in one road segment at a specific time period. We then divide this sum by the length of the entire road segment. The assumption is that if a road segment is frequently flooded, or a large pro-portion of the road is flooded, then this road segment has a higher risk of being broken. Therefore, we define the flood risk of a road as the sum of flooded proportions over all the time periods. The third task was to determine overall importance of each road segment and make repair or preemptive fortifications based on the value of the road. We define road importance as how much of an impact its removal may have on accessibility to the surrounding regions. This is computed by finding the distance traveled by all inhabitants on two separate paths, and taking their difference. The first path is the original intact path. The second is the alternate route taken if one of the roads in the original path is damaged. We take the difference between the second path and the first path. That is, the bigger the difference is, the worse the new route is, and thus the more impact on accessibility the flooding of the chosen road will have. We calculated importance of the top 20 riskiest roads.

In conclusion, we solve the road optimization problem by building a flood risk model, evaluating the road traffic based on mobility behaviors extracted from cell phone records, and combining these two to assess the road importance. Hope these models can help decision makers to make more efficient strategies regarding to climate mitigation for transportation.

We thank our mentor Bistra Dilkina, Caleb Robinson, and Amrita Gupta for useful advice.

Finishing the Course

After 10 busy weeks, our project is complete! We’re proud to have worked with the Food Bank to better understand this important policy issue.
Let’s recap what we’ve accomplished and why.

The Atlanta Community Food Bank (ACFB) aspires to eliminate hunger in its service area by 2025, and to help achieve this goal, the food bank is raising awareness about SNAP among its clients and donors. SNAP is a federal program that helps low income families purchase food. The food bank asked us to gauge public opinion on SNAP and to determine what sort of arguments were being made for and against SNAP. They were also interested in learning about Georgia politicians’ opinions on SNAP. To analyze public opinion, we examined twitter data and news articles. To track politicians opinions, we created a tool that allows the Food Bank to see local politicians voting records on bills related to food insecurity. We analyzed the sentiment of the tweets and news articles and visualized our results to show how sentiment changed in response to current events. We also used the data to create a map that showed our sentiment varies across different news outlets. We displayed our results in an R Shiny web app that is accessible to the food bank and the public.


Sentiment Analysis
Sentiment analysis is a form of text analysis that determines the subjectivity, polarity (positive or negative) and polarity strength (weakly positive, mildly positive, strongly positive, etc.) of a text . In other words, sentiment analysis tries to gauge the tone of the writer. To conduct our sentiment analysis, we scraped news articles and tweets that contained key words such as “SNAP”, “food stamps”, and “EBT”.

The Vader and AFINN packages in Python were used to conduct unsupervised sentiment analysis. Vader is short for Valence Aware Dictionary Sentiment Reasoner,and is a lexicon and rule-based sentiment analysis tool. AFINN is a dictionary of words that rates connotation severity from -5 to 5. The actual sentiment score was given as the sum of the word score within a sentence. The Vader tool gauges the overall syntactical sentiment more so than the word usage. Conversely, AFINN gauges the type of words that are being used and their intensity. Additionally, sentences with key words (words relating to SNAP) were given a higher weight so that sentiment towards this issue would be amplified.Each article was tokenized to the sentence level, and each sentence was given a sentiment score according to the two sentiment analysis tools. Then, the scores were aggregated for each article with the weight that was assigned to each sentence. This aggregated score represents the sentiment of the article. To take into account of impact of the article, each article was then aggregated in regards to the traffic level of the website and the reading level of the article. This process is visualized below.

Sentiment Analysis Process

Additionally, information on the arguments and topics in these articles would be very useful to the ACFB. To do this, preliminary topic modeling (Latent Dirichlet Allocation) has been performed to extract the topical words from the set of text. It returns a set of words with probabilistic weight on each of the word to indicate its importance. Bigram collocation has been used to detect sets of two words that are most frequent and meaningful. Term frequency inverse document frequency (TFIDF) was used to detect important words across all the documents. Name Entity Recognition (NER) from the Stanford Natural Language Processing Group and gensim were used to detected key people or locations mentioned in the articles. After generating all the statistics, each word within TFIDF, bigram collocation and NER was multiplied with the weight that was computed with each of the documents. Then, all the words were aggregated into a list. Using this list, a word cloud can be generated to visualize meaningful words. Word clouds are especially of interest to our partners at the food bank. Along with the word cloud, its aggregation by each date will help the viewer understand the subject of the sentiment to better decipher the public opinion about SNAP.

Sentiment Visualization


Spatial Analysis
The AFINN and Vader scores were linked to the geocoded new outlets. Using ArcMap 10.4, spatial analysis was conducted on the outlets to determine whether there was any clustering of articles that had positive or negative sentiment about SNAP. In order to do this, a hexagon grid was created over the extent of a U.S. shapefile and a spatial join was conducted in order to join the number of news outlets to the hexagon polygons. After the spatial join, hot spot analysis was done by calculating the Getis-Ord Gi* statistic. The Getis-Ord Gi* statistic determines where there is clustering of cold spots and hot spots though looking at the location of features in relation to neighboring features
The outputs of the Getis-Ord Gi* statistic are z-scores and areas that have statistically significant high z-scores are hot spots while areas that have statistically significant low z-scores are areas that are cold spots. Significance is determined based on looking at the proportion of the local sum of features and its neighbors to all the features. If the difference between the calculated sum and the expected sum is very large, then the z-score is statistically significant . In the context of this research, hot spots are areas in which the articles have a positive sentiment on SNAP and cold spots are areas in which the articles have a negative sentiment on SNAP.


SNAP InfoMap


Politician Tracking Tool
The voting records of Georgia state representatives were collected through Open States, a site that collects data on state representatives. Bills were selected if they contained the phrases “food stamps”, “SNAP”, “food bank”, “food desert,” “hunger,” “food insecurity,” or “georgia peach card”. Bills with no votes were removed, and votes by representatives no longer in office were removed.

On the web app, user can select what chamber of the Georgia General Assembly they want (House or Senate). Then, they can choose a politician to learn about. The web app will then display the legislator’s voting record on bills relating to food stamps, and will link the user to further resources such as the text of the bills and a link to the legislator’s site.

Politician Tracking Tool


The food bank is planning on using our tools to inform their interaction with media outlets, to prepare for meetings with politicians, and to adjust their social media and outreach messaging.
We are proud to have been able to work alongside the food bank to create this web app. Frequent feedback and discussions with the Atlanta Community Food Bank helped us to shape our project to suit their needs.


Thank you!
We thank our mentor, Carl DiSalvo, Associate Professor and Coordinator for the MS in Human-Computer Interaction at Georgia Tech for his guidance and advice. We also thank our Food Bank partners Lauren Waits, Director of Government Affairs; Allison Young, Marketing Manager; and Jocelyn Leitch, Data and Insights Analyst; for educating us about food policy, food insecurity in Atlanta and across the nation, and the work of the Atlanta Community Food Bank. Finally, we would like to thank the staff and students participating in the Data Science for Social Good – Atlanta program for their support and assistance.

Seeing Like a Bike: Cycling into the Sunset

The main focus of our project is to provide valid data improve the cycling conditions of the City of Atlanta. Our targets were the bikers who may not feel comfortable driving during the peak rush hour traffic.

To do this we had to measure the stress of cyclists, in different situations, and tag them based on a Level of Traffic Stress (LTS) model. To measure the stress of a cyclist we needed to take into consideration certain environmental factors like traffic, infrastructure and pollution.

Traditionally, information about these factors would be gathered through surveys and reporting incidences as they occur. The Seeing like a Bike Team though, chose to use a sensor based
approach to data collection.

The data provided would help us determine the LTS level of a certain road segment. An LTS of 1 would be a path where a not so confident person would be comfortable riding. While an LTS of 4 would be just for well-seasoned riders.

While the overarching problem is an urban and social one, when we talk about sensors it can be reduced to five main engineering problems. The problems, their solution and the system design we came up with can be seen below:


The front Master box was 3D printed, while the back Slave box was a laser cut ABS box. The sensors were screwed into place as can be seen in the video below:

Next came the data collection for which we thank Jeremy, Mariam, and Jihwan for helping out by taking the boxes out to collect data. The data visualizations can be seen below:

The above shows the proximity sensors during a ride from TSRB to Home Park. Here we can see the value deflection while passing cars and other obstacles.

The above shows the PM sensors along the vertical axis during a ride from Midtown to Downtown. Here the PM sensors show really high peaks at the exact same time when the rider crossed large clouds of smoke and dust.

Along with the data we also collected videos with a GoPro mounted on the bikes. This video was gathered to help tag different instances in the video and data. In this way several 5 second signatures were marked and saved. The next step was to make a classifier for the right and left side sensors. This was done by going through the video and tagging every obstacle that was apparent. One such video can be seen below:

The reason this was done was to train the machine learning model that was built. The time based data was first translated into a distance domain so as to remove the bias due to different speeds of different objects with respect to the bike. Data interpolation and Smoothing was done to finally arrive at the Proximity pattern features.

Two machine learning algorithms, Support Vector Machine (SVM) and Random Forest were used to predict the classifier of each data segment. To compare the performance of the prediction models to the baseline classification power (when randomly predicting classifiers), we plot them together with varying train-test sets.Accuracy was seen to be around 50%, which was good but not enough. This was simply due to the fact that we needed more data, as the model was learning and improving quickly, and was way better than the base score of around 18%.

In the future we would like to see the Prediction Power of the model improve significantly by the collection of more data for learning. A bit of Feature engineering would also significantly improve the accuracy. The model would then be used to set well-defined, data-driven boundaries for the LTS model which would help the policy makers make the city safer for cyclists of all comfort levels.

Actually, our overarching goal is to identify environmental factors that give rise to bike riders’ stress level. In order for this, the identification of environment should be achieved first as sensors cannot detect semantic-level objects. Once we can tune and refine the prediction model for detecting environmental fators through feature engineering and modeling, it would be possible to advance to answering the real question — how bicycle infrastructures and environmental factors affect bike riders’ stress level? and how these relationships can be used for constructing the Level of Traffic Stress (LTS) model?


Ending on Good Note

Our project team with mentor Professor Ellen Zegura at the final presentation

The final presentation on Monday evening was a great opportunity for us to reflect on all of the hard work and learning we’ve done on our housing justice projects this summer. Our first project was an analysis and visualization of Atlanta’s Anti-Displacement Tax Fund, and the second was an interactive mapping tool to assist the Atlanta Legal Aid Society with a case about contract for deed properties. Although these housing justice issues are extremely complex and without clear solutions, we are proud of the results and tools we have created, and hope that they will enable our community partners to use data to better advocate for housing justice.

The Anti Displacement Tax Fund

The Anti-Displacement Tax Fund was developed as a response to community concerns surrounding rising property taxes and potential displacement due to urban revitalization projects on Atlanta’s Westside, namely the Mercedes Benz stadium and the western portion of the Beltline trail. The tax fund promises to help prevent displacement by offsetting property tax increases for eligible homeowners on the Westside, but community members remain concerned about how effective it may actually be in the long run. Our goals for this project were to calculate the number of eligible homeowners and the total cost of the program over time. We also sought to make our results accessible and open to community members by developing an interactive web application and getting community feedback along the way.

Our team hit some roadblocks initially with data collection and determining the best methods to achieve our desired results, but we eventually found 410 eligible homeowners using Fulton County Tax Assessor data, lien data from the Georgia Clerk’s Authority, and income modeling based on housing characteristics and Zillow data. Using historical tax assessor data, we forecasted property appreciation and property taxes for the next seven years by comparing the Westside to the Old Fourth Ward neighborhood, an area on Atlanta’s Eastside that previously experienced rising property taxes with Beltline construction. These neighborhoods were used to create clusters of properties with similar home characteristics and forecast property appreciation at the household level. By combining these results with our eligibility estimates, we were able to calculate an overall 7-year program cost of almost $1.8 million, much higher than the only other previous public estimate.

To make the data more open and accessible to community members, we created this online, interactive map tool: http://dssg.gatech.edu/adt/ The map shows a shaded region that represents neighborhoods eligible for the tax fund and dots that represent homes. You can search or click on a house, and information about program eligibility and forecasted property taxes for that home are displayed. An edit feature that will allow community members to update the property information to provide better eligibility and cost estimates is also under development. We are excited to have gotten great feedback about the tool from our partners at the Westside Atlanta Land Trust, and hope that it will allow both community members and policy makers to evaluate the program’s impact and alternative options. (View screenshot below)

Contract for Deeds: A Harbour Case Study

Our other project was an interactive mapping tool of contract for deed properties for our partners at the Atlanta Legal Aid Society. You can view the login page for this map here: http://dssg.gatech.edu/housing/login.html. Our project is called JUMA, or “Justice Map,” and allows Atlanta Legal Aid to view properties currently or previously owned by Harbour Portfolio, a real estate investment company. Atlanta Legal Aid is currently involved in a lawsuit against Harbour due to allegations of discriminatory and deceptive lending with their “contract for deed” business model. Contract for deeds allow people who could not afford a traditional mortgage to purchase a home through monthly payments, but they do not receive the title to the home until the purchase price has been paid in full, and can be evicted if they default on any payments.

Our mapping tool displays information about Harbour’s properties, including the current owner and appraised value. The information can be edited and notes can be left on each property to help with organization for the case. There are also demographic overlays such as income by zip code and racial density by census tract. We have enjoyed developing this tool with feedback from Atlanta Legal Aid and hope that it will allow them to interact with the Harbour properties in a new way to shed light on their case.


Using data science for projects that benefit the social good is extremely rewarding. We are so grateful to our partners at the Westside Atlanta Land Trust and the Atlanta Legal Aid Society; our sponsors from NSF, South Big Data Hub, Georgia Tech, and LexisNexis; and our amazing mentors Ellen, Amanda, and Chris for giving us this unique opportunity to learn about housing justice, data analysis, and community involvement.

Seeing Like a Bike: Iterations for Better Data Quality

Our team has been refining the sensor boxes as we collect data. Our friendly colleagues volunteered to ride a bike, and each time, they came back with handful of data along with points that need to be fixed in hardware, software, and data itself. Even though technical challenges that come from physical shocks and vibrations can be hardly in perfection due to the nature of electrical parts used, many issues have been resolved through making small changes in a iterative way.

Try-and-Error: Data Collection and Sensor Box Refinement

Without help of our awesome colleagues, this would have not been possible.

Whenever there are LED indications of sensor malfunctions or wrong data found, we unpacked the box and examined the flaws. Minor (?) issues that we identified and fixed are as follows:

  • Occasional hiccups in the communication between the Pi and Arduino -> resolved by implementing the timeout and reset functionality.
  • Impedance issue: all of sudden, an Arduino board stoped working and sent out “NACK” signals, and never came back to normal after resetting the board -> By removing some amount of solders from our custom bridge, this could be resolved. Too much solder on PCB prevents weak signals coming in and out of Arduino due to the low impedance.
  • Cable order: some cables for the Sonar sensor were found that the order of pins was reversed. This did not raise an error, but data was wrong -> We examined all the cables, bridges, and sensor pins.
  • Broken wires: some cables looked fine by its appearance but we found that a wire was broken inside the socket. This can be prevented by using stronger cables and sockets (that are tenable to bike shocks) in the future.
  • Hardware errors: some Arduino boards, USB-to-TTL connectors, and sensors were found that they were damaged and out of order -> this is the hardest part to identify. After finding them, we had to replace these parts.

Gas Calibration Data Collected

With help of Raj, a Ph.D. student from the department of Environmental Science, we were able to co-locate our gas sensors at the official gas sensing station that is 10 minutes away from Georgia Tech. By comparing data between sensor data and official data from the sensing station, we expect to adjust gas sensors to some degree. Since the temporal resolution of the official data is one hour, it would be hard to adjust them very precisely. Even though, this would increase our gas sensor accuracy greatly.


Environmental Signatures and Ground Truth Data

If we can identify what objects are around the bike only by looking at the sensor data, it is possible to use sensor data for semantic-level analyses. Without guaranteeing the connection between the sensory data and real-world objects, modeling environmental factors using sensory data would be hardly convincing to audience due to the noisy nature of sensors. Our strategy to analyze the sensory data begins with creating semantic-level signatures and classifying each segment of streaming data from bikes. In order to do that, we recorded environmental information in videos and voices using GoPro and voice recorders. These qualitative data provides ground-truth information for the sensor data.

Based on the Level of Transportation Stress (LTS) model, we listed possible obstacles and objects in the biking routes. After aligning the GoPro video and sensor streams by time, we qualitatively tagged each segment of the video (only when the circumstance was not too complex). For example, when a vehicle passes by the bike and there is no other objects around in a video segment, we assume that its corresponding sensory data is a typical classifier for a vehicle passing-by. After a test riding in downtown Atlanta, we collected a ground-truth data.


Here are some examples for creating signatures: (1) a narrow street with cars parked in parallel, and (2) a city road with a car passing by the rider.


The temporal pattern of the corresponding Lidar data to this video segment is as follows:


Since the frequency of the proximity values might provide better indications for objects rather than the temporal pattern of it, we converted this signal into the frequency domain using the Discrete Cosine Transform.

This frequency signature can be used to classify similar environmental factors in the data. Similar to this, the case where a car is passing by the rider is as follows.

These two different cases show their unique patterns to some degree. The graph of a street with cars parked in parallel shows a regular change of Lidar values which resulted in a high middle-level frequency (around 4 to 7). Meanwhile,  the case where a car is passing by the rider shows a higher value in a low frequency (around 2-3) since the Lidar value changes radically at one time. Of course, these are exploratory signatures, and more ground-truth data and other sensors need to be aggregated/merged to provide robust signatures.

We are working on generating more ground-truth data. The classification performance for data segments depends on (1) the quality of signatures, (2) the quality of ground-truth data, and (3) the prediction model (feature engineering). We hope to finish the first-round classifications of sensory data in a few days with a high prediction performance.

We are reporting our final results at the DSSG final presentation on Monday (July 24th, 2017).

Week 9: The Final Push

After scrambling the past couple of weeks with paper submission deadlines and the mid-program presentation, we are now working on re-running our models, finalizing our estimates, and making updates to our interactive tools. Time is of the essence, as we need to have everything wrapped up and ready to pass along in less than two weeks.

We realized last week that our original income model based on home characteristics was providing some strange results, and incomes well above the IRS distribution for the region. This was causing us to under predict eligibility based on income, so we modified our model by creating dummy variables in the Consumer Expenditure Survey data to classify properties in the south, in urban areas, and with black owners. All of these characteristics are representative of the eligible neighborhoods in Westside Atlanta, and accounting for them in our modeling gave us results on a household level that more accurately mirror the IRS income distribution and actual incomes in the area. We are also re-running the other pieces of our models, including the owner-occupancy classification, total program costs under different scenarios, and the home value appreciation clusters for Old Fourth Ward and the Westside.

Above: Map of clusters of homes in Old Fourth Ward and Westside neighborhoods, based on property appreciation trends and important home characteristics

As we enter into the final stages of our project, we have set up meetings with our community partners to receive feedback on our interactive mapping tools. On Monday, we met with members of the Westside Atlanta Land Trust (WALT) to discuss the eligibility tool. While we got great feedback on our progress, we still have some important changes to make, including implementing an edit functionality that will allow residents to update or approve their information in the database. This function will hopefully provide some ownership, oversight, and verification of the data, as owners know their own household characteristics best. We are also working on reformatting the information displayed about the properties and creating a dropdown box for sensitive data, such as lien data and estimated income. It is vital to be careful and respectful with how we display community members’ personal data.

We are now also making good progress on the Harbour Portfolio (predatory lending) mapping tool. On Thursday, we will have a meeting with Sarah Stein from Atlanta Legal Aid to demonstrate our progress, including new demographic overlays, a Zillow search function, and classification of properties owned by plaintiffs. The tool is also now hosted on the Georgia Tech server! You can see our login page at: dssg.gatech.edu/housing.

Above: Current progress on our Harbour mapping tool

We’re hoping to have our modeling and web applications completed by Wednesday evening so that we can spend Thursday and Friday making our poster and preparing for the final presentation on Monday. Time is flying!

Nearing the finish line

This past week we’ve been working on creating visualizations of the data we’ve collected and starting to prepare it to be put on the R Shiny Snap app.


Below is an analysis we created for understanding SNAP sentiment over time.  The y axis shows the Vader score, and the x axis shows the date. When you hover over one of the bars, you will be able to see the most frequent words from that time frame. This is important because positive sentiment can be due to many things – for example, people may be speaking positively about budget cuts to SNAP, or they be speaking positively about SNAP itself. Showing the most frequent words will help to tease out the meaning.

Additionally, we continued to work further on our map of news outlets. We included information about sentiment and then created a hotspot map of sentiment. In the map below, blue is a cold spot (negative about SNAP), and red is a hot spot (positive about SNAP). We are also planning on adding the top words to start extracting the meaning of this sentiment.

Finally, our politician tracking tool is coming along nicely. The data has been cleaned and is being displayed in RShiny. Below is a screenshot of the application: you are able to choose if are researching a senator or house of representatives member, and then you will choose the specific representative. Going forward we will include more detail on the bills and the representatives. 

The good, the bad and the ugly

The good: And there is light.

Front sensing box light on

While adjusting the first pilot sensing device (unit 1.0 beta #1), the team has been working on several parallel tasks for making possible to start to collect pilot data by the end of this week. Most of the issues concerning the current design are hardware related, while some industrial design concerns persist and they need to be addressed in future iterations of the case design. The overall goal of this phase of the project is fixing the critical issues which can avoid to users from collecting data in a regular operation of their bikes. We assume that better adjustments of the physical design and software implementation can be done, but the priority is collecting data, as well for being able to provide a meaningful feedback process in between the evaluation of the data gathered data, the testing of the thesis and assumption which were taken when the project started, and the improvement of the sensing devices.

From a construction a very physical perspective, the anchoring of the back sensing box to the rack needed to be improved. During the last trip, the original plastic clamps weren’t able to support the tight tension from the screws and they broke apart. For this purpose, new metal clamps were placed, and it seems now the anchoring will be able to stay for much longer. Additionally, the installation is easier, because of the lower number of screws, and the joint is somehow flexible as it depends on the flexibility of the own metal folded plate.

Back sensing box with new metal clamps


The bad: Without information, you’d go crazy. (Arthur C. Clarke)

However, no doubt, the most demanding and critical issue is the random, inconsistent, and malfunctioning of the custom gas sensors board. It is critical because, at least in this first model of the sensing unit, the detection of different profiles of gases and pollutants in the air is considered to be an important factor affecting the stress of riders and as well important and useful to distinguish in an indirect way, without the use of image, different kind of vehicles close to the bike. Many different issues happened during the different testing performed for understanding the behavior of the gas sensors.  560 kΩ resistors were replaced by 470 kΩ resistors, ADC’s boards show memory allocation issues, higher voltage is not only better in terms of resolution of signal but as well in stability of the sensors, inconsistent sensors and amplifier flows, etc. The way of debugging and testing the devices for allowing the correct calibration has finally been done by testing one by one the custom boards.

Custom boards for gas sensors

Given that the objective is at least having 3 functioning bikes by the end of the current week, it was possible to discard completely wrong and messy boards and just focusing on the ones which were showing consistent data, although not calibrated yet. The flaws of the sensors were so evident enough that even at this preliminary phase with our custom vacuum device, it was able to detect major failures of the tested sensors.

Left, well-functioning sensors displaying stabilized read under controlled atmosphere. Right, mal-functioning sensors displaying extremely noisy and inconsistent readings.


Stable readings (1, 3 and 4) and unstable readings (2)

After identifying the usable sensors, the next step is calibrating them in the official Atlanta gas sensing station which is providing quality air feedback for public institutions in the Metropolitan Area of Atlanta.

The ugly: I just want to say one word to you—just one word …“plastics!” … There’s a great future in plastics. (Mr. McGuire in “The Graduate”)

Concurrently, while gas sensors have been adjusted, the work has proceeded for getting the whole system working for the new two sensing units to be mounted during the week on bikes. New cases, boxes, and electronic component have been assembled and they have already gone on testing. With regard to the strictly physical components of the units, some issues have been detected, but still, they are not extremely important to avoid the pilot data collection. At least three additional design iteration should be needed to nail down the design of the devices. Different plastic materials and printing technologies have been tested, and still some other additional ones need to be tested. However, for the development and prototyping purposes, this exploration is useful and allows to detect design issues.


New parts for the units 2 and 3

Although the issues encountered during the hardware adjustment, we have been able to deliver the first sensing units for being used by external users which will provide important and meaningful data for improving the whole system and above all, data for modeling the riding condition the project is willing to study and understand. The second unit is already running and the third one will be soon too, by brave volunteers willing to help to collect the data.


This first deployment of the sensing unit is focused on adjusting the sensors, designing the pipeline of the data cleaning and processing, and finally to plot a basic data modeling able to encompass the theoretical background considered in this project and related to the quality of the biking infrastructure and affection of traffic into riders’ stress levels.
By the end of this week, we expect to have the gas sensors calibrated after 24 hours test at the Atlanta air quality facilities. It will allow us to equiped definitely three bikes with the sensors and being able to start to collect the data and to perform the first data analysis and data modeling from those pilot data.

Almost there !!!

The past two weeks were pretty hectic. We spent long hours in the lab and did tons of number crunching. The prior week, we had our midterm presentation and last week we had a deadline for paper submission. Fortunately, both went pretty well. We submitted our work titled “Displacement and Housing Justice: Analysis and Visualization of Atlanta’s Anti-Displacement Tax Fund” into Bloomberg Data for Good Exchange Conference. We are very excited about our work & hope it will be accepted.

Our primary focus for last two weeks has been the Anti-Displacement Tax Fund project. We have been trying to filter all eligibility requirements and predict future property taxes. We worked on Income & Liens eligibility requirement. Hayley has been working to gather liens data from Georgia Superior Court Clerks’ Cooperative Authority. Since the data is not easily extractable from the source, a random sample of 173 homeowners in Westside was gathered (23% of all homeowners). From this sample, 60 were found to have current liens, and the ratio of 0.35 was used as the lien rate. For Income Eligibility requirement, Bhavya & Chris have been trying to model household level income. The income eligibility model utilizes data from the U.S. Bureau of Labor Statistics Consumer Expenditure Survey (CEX)  for the years 2013, 2014, and 2015 & Zillow. In the model, the main output variable of interest is the household’s before-tax-income. An innovation of the analysis is the use of observable, physical characteristics of a house to predict a household’s before-tax-income. XGBoost model outperformed the other machine learning models and was used for predicting income for all households in the relevant Westside neighborhoods. 

Jeremy estimated future assessed values for all households in Westside. We calculated the property tax value for each household & also the hike in property taxes for 7years in future. In the end, we found that 411 households in Westside might be eligible for Tax fund and total cost of the program might touch $821,484.


To make our work more consumable by the community, we thought of building a Web app to present the results of our work and spread awareness about this tax fund. To this end, Vishwamitra created a mapping tool for all the Westside households as shown in the above picture. This web app shows if a particular household is eligible for Tax Fund based on different eligibility conditions. The ones with the green marker represent the eligible homeowners & red ones are either unsure or not eligible. For better visualization, neighborhood & beltline overlay have been added. When we click on any household represented by a dot, A popup appears on the side which contains eligibility information like income, liens, and owner occupancy.  The popup also contains estimated property taxes for that property for 7years in future 2018-2024. To spread awareness about the eligibility requirements, a window has been added to show eligibility criterion laid down by Westside Future Fund.

Next, we plan to improve our income & tax estimation algorithm further. We plan to gather comprehensive lien data for all households to obtain a more accurate estimate of which homes qualify. Our community partners have also expressed interests in exploring the addition of residents in an adjacent neighborhood, Washington Park, into the geographic scope of the tax fund. Therefore, we will like to estimate the cost and possible effects of adding Washington Park into the tax fund.

Pushing through milestones

This week was a season of growth for our team best characterized through milestones, triumphs, and valuable lessons. After the wake of the mid-term presentations, our team headed back to the drawing board to work out kinks in our poster, oral presentation, research, and paper. Individually, we made it a priority to focus on the inevitable conclusion of this program and the tools and deliverable’s that would then be available to our partners and the community. In order to ensure the longevity of the work we have produced for our project through this program we have decided to look not only to our resources as far as automation of code and already prepared scripts, but also the opportunity to inspire future Data Science for Social Good students participating in the program moving forward, pursuing an internship with the Atlanta food bank, or taking a special topics course for social good at Georgia Tech. By cleaning our code, providing documentation, and automating applicable tools and interactive applications, it will be significantly easier for our partners to maintain the provided deliverable’s and continue to see results in “real-time.”

Image: Proudly posing with our poster. (From left to right: Dorris, John, Miriam, Mizzani)


Looking toward the future, we are optimistic about the functionality of our project and we are proud that we were able to turn data into actionable information and meaningful story telling. The research that we have conducted over the course of this program with respect to past research conducted using topic modeling, deep learning or multi-aspect sentiment analysis of media coverage relating to food, public safety and health, is uncharted territory with endless paths and destinations. With our finding, we hope to inspire others that are interested in contributing to community organizations and small companies in visualizing their progress, public perception of their brand, and provide tangible analytics tools to improve their strategic planning, business decision-making, and tactical reporting.

As we continue to grow and sew into this project we look forward to adding additional practical functions, such as website metrics tracking traffic and real-time sentiment analytics, link topic modeling tools to the sentiment analysis tool, and finding the most effective and efficient way to cope with duplicate news articles. Adding these features will enhance to usability of the tools, and further encourage the user to stay engaged and comfortable with the interactive design. As we move closer to the end of the project we are determined to complete tasks and continue cleaning data, debugging code, geocoding articles to generate spatial statistics, ect. Although tedious and timely, manually geocoding our sources will enhance the accuracy of our results in terms of the location of news outlets from collected articles across the nation. We look forward to expanding our tool in the near future to cater to the food bank network at large and use data to contribute to the continued success of the Atlanta Community Food Bank and non-profit organizations across the globe.

Image: Current geocoding status of locations based on the collected articles. There are 1,600 articles in total that were collected using webhose.io in this sample.