We started out with just two pieces of information to model energy usage at Georgia Tech: the date-time and the energy usage at every hour for the past three years. Below is a visualization of the raw data for one of the buildings. As you can see, we decided to train all of our models on the first two years of data and reserve the last year to assess the performance of the model.
Using just the date and time of day alone, it is difficult to predict energy usage accurately. However, with those variables, we were able to engineer several other features including the month, day of the week, day of the year, hour of the day, and an indicator variable for whether or not it is a holiday. Our best model using these variables as predictors was a Generalized Additive Model (GAM), which gave us an R-squared of 0.55 and an average error of about 24.6 kWh (8.6%). According to our model, the hour of day was the most important predictor of energy consumption at a given hour. This was not a bad baseline model, but there was plenty of room for improvement.
Incorporating External Data
We have since incorporated two external datasets into our model. We scraped weather data from Weather Underground, which provides real-time weather data down to the minute. From that, we were able to get the temperature, humidity, etc. at any point over the last few years. The other data we extracted was class schedule information from OSCAR, Georgia Tech’s online student portal. From this, we were able to determine the number of classes that took place (as well as the number of students enrolled in those classes) at any given point, in any given building, over the last three years. Our hypothesis was that there would be a strong, positive correlation between the number of classes taking place and energy consumption.
Including all of the above information into our models improves our results significantly. Our best overall model of energy usage so far is a weighted average of a GAM and gradient boosted decision trees. Both methods on their own work fairly well, but when we average their predictions together, we get results that are superior to either method individually. This model gives us an R-squared of 0.75 and an average error of about 19.4 kWh (6.7%). Below is a graph of the predictions of our model superimposed over the actual data. Our model does reasonably well, but it tends to underpredict extreme values. Going forward, we will continue to try other methods and add more features in an effort to improve our predictions.
Interpreting Results
These models are not just useful for prediction. One useful result from tree-based algorithms like gradient boosting are the relative variable importances, which roughly speaking, tells us how much variation in energy usage is explained by each variable. As we can see, the number of classes is by far the most important predictor of energy usage, followed by hour of the day, number of students, and day of the year.
Using the GAM model, we can make inferences about what the precise nature of these relationships actually is. For example, for every additional class being held in the Clough Undergraduate Learning Commons (CULC), we estimate that the energy expenditure will increase by approximately 1.4 kWh, holding all other predictors constant. This kind of information can provide useful insights into how to improve energy efficiency on campus.
Going Forward
Currently, we are looking exclusively at the CULC. Later on, we will use what we learned from modeling the CULC to model other buildings on campus as well as the campus as a whole. That way, we can target buildings that are most inefficient and in need of an upgrade. Also, since there are many buildings on campus which don’t hold classes, the class schedule data will have limited utility. As a result, we plan on using information about the number of people connected to each building’s WiFi as a proxy for building occupancy.