Today marks the completion of the CDS program. Boy, does time fly.
It feels like just yesterday we met with our PI, Dr. Asensio, to go over initial designs of the database we constructed over the last 10 weeks. From the beginning, the objective of this project was to build a comprehensive database to help city officials and Georgia Tech evaluate the impacts of housing investment on utility consumption. The main challenge we faced was that city data were spread across many different departments and entities, many of which had different data entry practices. We also obtained a lot of the data from sources outside the city, such as the Census, NOAA, and a private real estate data company, since this information is not housed within Albany’s databases. Collecting this data turned out to be a bigger challenge than expected, as each dataset posed unique challenges related to access, standardization, or volume.
To wrangle all these disparate datasets into a workable structure, much of our work this summer focused on using automatic processing methods to merge data and evaluate performance in new ways that were not previously possible. This involved standardizing housing addresses within Albany (spelling, street endings, cardinal directions), geocoding all those addresses, parsing data from HUD reports, converting datasets to time series format, and then linking all of these datasets into a relational database structure. In the end, we were able to build a SQL Server database hosted on Azure that links information on utilities, taxes, each housing project, Census data at the block group and tract levels, weather, and real estate information. We used Python to clean and merge the data, ArcGIS for some spatial exploration, and RStudio for preliminary analysis. While we didn’t come away with many tangible insights to share with the city, we created the infrastructure necessary to transition into the analysis phase of the project. The data have come a long way, and we can’t thank everyone involved with the CDS program enough for giving us the opportunity to work with real data that will be used to make a significant impact.
We presented our work in its final form to ESRI and Albany city officials on Wednesday ahead of the CDS end-of-program presentations that same day. They were excited to see the work we had done, and were interested in scheduling a meeting with city officials. We can’t wait to hand the database off to see what kinds of stories will be told. All of our scripts are commented, our process has been documented, and we have constructed a visual schema and data dictionary for the database. This will allow the city to more easily maintain the database and add data in the future. Hopefully, the database will help the city make better-informed policy decisions and initiate conversations between the city and its citizens in the future regarding energy efficiency, housing investment, and neighborhood blight.