Reflections from the DSSG 2015 Fire Team

(posted on behalf of the DSSG Fire team students)

19670711006_ae830a5fb7_k

After a year since the culmination of our DSSG 2015 summer program, we wanted to share a bit about how the work we did with DSSG last summer has developed since then.

We spent the summer of 2015 working with the Atlanta Fire Rescue Department (AFRD) to help them use disparate data sources from various city departments to identify new properties that required fire inspection, and we built a predictive model to help them prioritize fire inspections according to the fire risk of commercial properties.

You can see some of our blog posts from last summer on beginning the project, understanding and joining our data sources, riding along with fire inspectors to understand their existing processes, conducting preliminary analyses of the data, and building a predictive model of fire risk.

We created a framework, which we call Firebird, to describe this process of property discovery and risk prediction, as seen below.

framework_logo

As a more permanent home for this work, we have created a website for Firebird, which provides a high-level overview of the project, and includes a link to our code on Github.

At the end of last summer, we presented our work to an audience of local data scientists at the final summer presentation at General Assembly, garnering interest from several firefighters from neighboring counties that were in attendance. Following that presentation, Fire Chief Joel Baker, the head of AFRD, invited our team to speak at a meeting of the AFRD executive staff, including the battalion chiefs for each of the 7 battalions that comprise the city of Atlanta.

Following this, AFRD has already begun to implement our recommendations, from starting to inspect the properties at highest risk of fire at greater priority than other properties, to beginning conversations about allocation of inspection personnel and resources to reflect the distribution of commercial properties requiring inspection in the city.

In September 2015, we submitted and presented a short paper describing our work and its outcomes to the Bloomberg Data for Good Exchange, a conference on applications of data science for problems of social good, involving participants from academia, industry, government, and NGOs.

Then, wanting to further the impact of this work, we submitted a full paper to the 2016 Knowledge Discovery and Data Mining (KDD) conference, a top conference in the data mining field. It has recently been accepted, and we will be presenting the work there in August. A pre-print draft of the paper can be found here.

Finally, two representatives from our project, Dr. Bistra Dilkina and Dr. Matt Hinds-Aldrich, presented this work at the National Fire Protection Association (NFPA) Annual Conference this June. The NFPA magazine also recently published an article on Embracing Analytics, with a nice description of our work, explaining our process and its results to a wider audience of fire professionals.

A Predictive Model for Fire Risk in Atlanta

As part of our deliverables to the Atlanta Fire Rescue Department (AFRD), we are giving them a list of potential properties to inspect. However, we needed to be able to prioritize this list based on fire risk, so that AFRD can best allocate their inspection resources. To prioritize the list of properties to inspect, we created a model that predicts fire risk based on certain characteristics of properties in Atlanta. This model was built in R statistical programming language and used a SVM (Support Vector Machine) algorithm. The model used 58 independent variables to predict fire as an outcome variable. Data sources for features in the model include the Costar properties dataset, Parcel data and SCI data from the City of Atlanta, demographic data from the U.S. Census Bureau, and fire incident and inspection data from AFRD. Features were based on property location, land or property use, financial factors, time-based factors such as year built, condition, occupancy, size, building details, owner information, demographics of property location, and inspection data.

Prediction Model Validation

Our predictive model was found to be highly predictive of fires. We validated our predictive model in two ways:

First, we validated our model using a time-based approach. The model would be easy to validate if we could run the model and, after predicting which buildings would catch on fire in the next year, we could look into the future to see which actually did catch on fire. Because we can’t look into the future, we simulated this approach by using data from 2011 – 2014 to predict fires in the last year of data, 2014 – 2015. We used 10 bootstrapped random samples and took the average of each of them to calculate our results. This model did very well, with an average accuracy of 0.77 and average area under the curve (AUC) of 0.75. Here is a confusion matrix of the results:

validation1

Figure 1: Confusion matrix for time-based model validation approach.

The most important metric in this case is true positives – that is, how many properties the model predicted to have a fire that actually did have a fire. Of the properties in the last year of data that did have a fire, our model was able to predict 73.31% of them. This means that for every 10 fires, our model would have predicted approximately seven of them. Considering how few fires occur (only about 6% of properties have fires), this is much better than if you were guessing by chance at which properties would catch on fire.

We also validated our model using 10-fold cross validation, a more standard machine learning validation approach. This model also did quite well, with an average accuracy of 0.78 and average AUC of 0.73. Here is a confusion matrix of the results:

validation2

Figure 2: Confusion matrix for 10-fold cross-validation approach.

In this validation, we were able to predict true positives 67.56% of the time. This means that for every 10 fires, our model would have predicted almost 7 of them.

It is worth briefly discussing the implications of the false positives in this model. In both validation approaches, we had a substantial amount of false positives – that is, properties that our model predicted would have a fire, but did not actually have a fire. Though many predictive models try to maximize the specificity (the ratio of true negatives to all negatives) by increasing true negatives and reducing false positives, in the context of determining which properties to inspect, false positives are actually quite valuable. False positives represent properties that share many characteristics with those properties that did catch on fire. Thus, because they have these characteristics, these are properties that may be at high risk of catching on fire, and should be inspected by AFRD. Additionally, because in a sense our training set and the data set that we ultimately apply the model to are the same (that is, the list of commercial properties in Atlanta), a perfect model with no false positives would do nothing more than tell us which buildings had previously caught on fire. While this is useful to know, it is data AFRD already has. False positives give us the added value of predicting properties that have not caught on fire, but are at risk of fire due to their characteristics.

We want to give the caveat that this particular model is not necessarily the best fit of the data. Although we tried many other algorithms and configurations of factors and found this model to be the most predictive, further experimentation would undoubtedly yield a more predictive model. We encourage AFRD or others to build upon our methods to improve the model if they wish.

Applying the predictive model to potential inspections

After we built the predictive model, we applied it to the list of current and potential inspections so that AFRD could prioritize inspections to focus on properties most at risk of fire. To do this, we first computed the raw output of the prediction model on this list of properties. This generated a score between 0 and 1 for each property (see Figure 3 below). To be more useful, we translated these scores to a 1-10 scale. Then we divided these scores into low risk (1), medium risk (2-5), and high risk (6-10).

riskscore

 

Figure 3: Transforming model output to risk scores.

We then applied these risk scores to the list of current and potential properties to inspect, and included them on the interactive map.

As a result of this work, AFRD will be able to focus their inspection efforts on those commercial properties in Atlanta that are most at risk of fire. We hope that this focused inspection will result in fewer fires, fewer fire-related injuries, and fewer fire-related deaths in Atlanta.

Thanks for following our blog posts this summer! It’s been a pleasure to work with Dr. Matt Hinds-Aldrich and the rest of our contacts at AFRD. Please feel free to contact me at ohaimson@uci.edu with any questions about this blog post or the project in general.

– Oliver Haimson

Answering the Call of Duty

We have just wrapped up our second week working with the United Way of Metro Atlanta’s 211 Call Center (for more information on the 211 call center, check out last week’s blog post!).  This past week has been very productive and given us more insight into the problem we’re trying to solve this summer.

 

After analyzing some of the sample data we gathered last week, we decided to collect more data about abandoned calls.  Looking at the data from the past few months, we noticed that there are a few numbers that have called hundreds of times a month.  We will notify the 211 director of this issue.

 

Our main goal for the summer is to analyze the data to make a menu that benefits both the callers and the agents.  The current menu is pictured below.  Currently, the menu is long and repetitive with some inaccurate prompts.  Some of the improvements we hope to make on the caller’s end are condensing the information, taking out the repetitive sections, and allowing repeat callers to skip information they already know.

Call tree

On Wednesday, we had the opportunity to experience the other end of calls by listening to agents handle calls.  This was very beneficial, as it allowed us to see what agents do for each call and what they would like to be improved.  The biggest potential improvement involves data entry.  All of the information gathered is manually entered by the agent while they are on the phone.  Agents have to rush to input the caller’s age, zip code, insurance status, employment information, and more while trying to find the best organizations to handle the callers’ needs. A way we hope to improve this is having callers input numerical data (phone number, age, zip code, etc.) and yes/no questions (veteran status, insurance status, etc.) before the call is connected to the agent.

To Get a 1-2-1 Response, Call 2-1-1

United Way of Metro Atlanta offers serious commitment to their customers to maintain full satisfaction and ensure that all individuals have the opportunity to thrive and be part of a prospering community. The United Way of Metropolitan Atlanta was the first to introduce a 2-1-1 service in 1997. 2-1-1 is completely operated by private non-profit community-service organizations. The organization offers variety of informational services that ranges from debt counseling and financial assistance to emergency food and homeless services.

We are a committed team of GATECH students that will work closely with 211 United Way of Greater Atlanta to help them reach their optimal goal to provide the best services to the disadvantaged individuals of the community. Our team consists of Hamid Mohammadi, a master student in the program of Statistics; Richard Huckaby, a student in computer engineering, and myself, Fatheia Ahmeda; an Applied Mathematics student.

We set an appointment with Mr. Zubler, the director of the United Way 2-1-1 on the 20th of May where we met at the Atlanta main office.  Mr Zubler gave us a brief introduction to the operation of his office that included a tour of the place. He was very helpful and patient in dealing with all our inquiries and he provided us with samples data that included records of day to day activities. The second day, Thursday 21, Hamid went back to the office and spent all his afternoon with the director to explore more data.  As a first step in working and analyzing the data we have  constructed a visualization diagram that displays the distribution of call volume for the top 20 counties in Georgia .
Graph

We are planning several approaches to work with the available data to minimize wait time for callers and optimize the Interactive Voice Response system (IVR) based on the analysis of the data.  Automating the routine customer service interactions is the most realistic solution to solve the wait time issue. The automated voice menu will reduce the need for agents to handle the call manually and thus decrease the wait time by a considerable time.

Several critical pieces of data are disparate across different files which makes merging the data for analysis challenging. For example, the file that contains the duration for each call is separate from the one containing the origin/location of the call and there is no unique identifier to join the records from each file.
As a team we are excited to face the challenge and exert our best effort to finish our tasks.