Last Friday, we met with our partners from the Food Bank and they gave us further insights on the directions they would like us to take for our data collection. In addition, they gave us feedback on the SNAP app that we are building for them. They plan to host this app on their website and in the near future, they would like us to give a presentation about our app to relevant stakeholders. These regular meetings with our stakeholders are valuable because it is a way for us to make sure that the data collection, data analysis, and app creation are in alignment with their overall goals for this project. To better understand the internal operations of the Atlanta Community Food Bank, we will be taking a tour of the food bank next week.
This week, we spent a significant amount of time cleaning and organizing the data that we collected. We have been working with data from ProPublica Congress and Open States to determine the voting records of Georgia politicians and issues related to food stamps. In addition, we have cleaned and organized the Twitter data based on whether they are geo-tagged and have relevant information regarding SNAP. Usually, since only one percent of tweets are geo-tagged, the number of geo-tagged tweets are quite small and the number of geo-tagged tweets relevant to SNAP are even smaller. We also have Facebook data which has been cleaned and organized as well. We plan to run sentiment analysis and social network analysis on this data in order to better understand the discourses regarding SNAP on social media.
One main chore that we are dealing with this week is the sentiment analysis for the 1,600 articles that we collected through web hose, which is a site that allows one to scrape websites based on search terms. As previously mentioned, the two types of sentiment analysis tools that we are using are Vader and AFINN. Each article is considered as an instance in the data. Each instance is tokenized into sentences and words to extract features from them. Sentiment analysis, in particular, uses sentence tokenization and gives a numerical score.The problem with doing the sentiment analysis comes with whether each instance has weight or number of people that it should represent. It would not be fair to give all the text the same weight when that text had indication of more views (likes, hits, retweets). In order to cope with this problem, features about each instance were gathered. Additionally, information about the arguments and topics that are frequently appearing within these articles would be very useful to the stakeholders. To do this, preliminary topic modeling using one method called Latent Dirichlet Allocation (LDA) has been performed to extract the topical words from the set of text. Other information such as n-grams, name entity recognition, and tf-idf (term frequency-inverse document frequency) has been used.
The collected articles are currently being organized in order to conduct social network analysis on them. Social network analysis will not only be conducted on all of the articles, but by news source, and by news sources in Georgia. By doing this, we can better understand the ways of how SNAP is being reported about on various levels.
We are currently working on more add-ons for our SNAP App. This week, we added background information about the SNAP program into the SNAP app. In addition, we have been working on tying in an automated social media analytics tool into the SNAP App that can allow the client to see an update of the press surrounding main topics of SNAP in real time. We have been looking into how the sentiment analysis can be possibly automated as well as how we can use frameworks such as Hadoop or Spark that can store the massive amount of data we have in analytics. We are also exploring the possibility of integrating Google trends and Google keyword planner within the app as well.
Next week, we will continue on with these data chores. We also hope to begin to add some more parts to the SNAP app such as politician tracking and Google trends.