As part of our deliverables to the Atlanta Fire Rescue Department (AFRD), we are giving them a list of potential properties to inspect. However, we needed to be able to prioritize this list based on fire risk, so that AFRD can best allocate their inspection resources. To prioritize the list of properties to inspect, we created a model that predicts fire risk based on certain characteristics of properties in Atlanta. This model was built in R statistical programming language and used a SVM (Support Vector Machine) algorithm. The model used 58 independent variables to predict fire as an outcome variable. Data sources for features in the model include the Costar properties dataset, Parcel data and SCI data from the City of Atlanta, demographic data from the U.S. Census Bureau, and fire incident and inspection data from AFRD. Features were based on property location, land or property use, financial factors, time-based factors such as year built, condition, occupancy, size, building details, owner information, demographics of property location, and inspection data.
Prediction Model Validation
Our predictive model was found to be highly predictive of fires. We validated our predictive model in two ways:
First, we validated our model using a time-based approach. The model would be easy to validate if we could run the model and, after predicting which buildings would catch on fire in the next year, we could look into the future to see which actually did catch on fire. Because we can’t look into the future, we simulated this approach by using data from 2011 – 2014 to predict fires in the last year of data, 2014 – 2015. We used 10 bootstrapped random samples and took the average of each of them to calculate our results. This model did very well, with an average accuracy of 0.77 and average area under the curve (AUC) of 0.75. Here is a confusion matrix of the results:
Figure 1: Confusion matrix for time-based model validation approach.
The most important metric in this case is true positives – that is, how many properties the model predicted to have a fire that actually did have a fire. Of the properties in the last year of data that did have a fire, our model was able to predict 73.31% of them. This means that for every 10 fires, our model would have predicted approximately seven of them. Considering how few fires occur (only about 6% of properties have fires), this is much better than if you were guessing by chance at which properties would catch on fire.
We also validated our model using 10-fold cross validation, a more standard machine learning validation approach. This model also did quite well, with an average accuracy of 0.78 and average AUC of 0.73. Here is a confusion matrix of the results:
Figure 2: Confusion matrix for 10-fold cross-validation approach.
In this validation, we were able to predict true positives 67.56% of the time. This means that for every 10 fires, our model would have predicted almost 7 of them.
It is worth briefly discussing the implications of the false positives in this model. In both validation approaches, we had a substantial amount of false positives – that is, properties that our model predicted would have a fire, but did not actually have a fire. Though many predictive models try to maximize the specificity (the ratio of true negatives to all negatives) by increasing true negatives and reducing false positives, in the context of determining which properties to inspect, false positives are actually quite valuable. False positives represent properties that share many characteristics with those properties that did catch on fire. Thus, because they have these characteristics, these are properties that may be at high risk of catching on fire, and should be inspected by AFRD. Additionally, because in a sense our training set and the data set that we ultimately apply the model to are the same (that is, the list of commercial properties in Atlanta), a perfect model with no false positives would do nothing more than tell us which buildings had previously caught on fire. While this is useful to know, it is data AFRD already has. False positives give us the added value of predicting properties that have not caught on fire, but are at risk of fire due to their characteristics.
We want to give the caveat that this particular model is not necessarily the best fit of the data. Although we tried many other algorithms and configurations of factors and found this model to be the most predictive, further experimentation would undoubtedly yield a more predictive model. We encourage AFRD or others to build upon our methods to improve the model if they wish.
Applying the predictive model to potential inspections
After we built the predictive model, we applied it to the list of current and potential inspections so that AFRD could prioritize inspections to focus on properties most at risk of fire. To do this, we first computed the raw output of the prediction model on this list of properties. This generated a score between 0 and 1 for each property (see Figure 3 below). To be more useful, we translated these scores to a 1-10 scale. Then we divided these scores into low risk (1), medium risk (2-5), and high risk (6-10).
Figure 3: Transforming model output to risk scores.
We then applied these risk scores to the list of current and potential properties to inspect, and included them on the interactive map.
As a result of this work, AFRD will be able to focus their inspection efforts on those commercial properties in Atlanta that are most at risk of fire. We hope that this focused inspection will result in fewer fires, fewer fire-related injuries, and fewer fire-related deaths in Atlanta.
Thanks for following our blog posts this summer! It’s been a pleasure to work with Dr. Matt Hinds-Aldrich and the rest of our contacts at AFRD. Please feel free to contact me at email@example.com with any questions about this blog post or the project in general.
– Oliver Haimson