Hello week three! Last week, team EV kicked off the project by doing an exploratory data analysis to become more familiar with the dataset that we are working with. We found several interesting results and have posted them here so readers can take a look!
This week, we started to dive into our first two subprojects: sentiment analysis of reviews and categorization of reviews based on topic. For sentiment analysis, we’ve successfully built a Support Vector Machine model that is currently able to predict whether a review has positive or negative sentiment at around 84% accuracy when compared to hand labeled reviews. This model is currently performing better than the previously used model, which was created using Microsoft Azure Machine Learning Studio, but we are still looking for ways to improve it further.
For our subproject on review categorization, there is significantly more prep work that must be done before analysis. Before we do data analysis, we must categorize over 140,000 reviews! Our plan is to first build a training set on 14,000 reviews, and then utilize machine learning techniques to classify the majority of the reviews. Even so, 14,000 is a lot of reviews, so the EV team is going to crowdsource categorizing a portion of them to Amazon’s Mechanical Turk. While this will ultimately speed up our process, we have to write an IRB protocol and set up a proper environment to collect valid training data from MTurk. These two steps, writing the IRB and setting up MTurk, are where we have focused our work on subproject two this week. At the end of next week, we’re hoping to have the all of the materials set up to start requesting Turkers to complete our tasks.