Weekly Happenings: This week we met with our advisor twice to clarify the data we need. In order to do this, we scoped out the objectives and potential issues as much as possible and presented our ideas in a short report. I’ll share just the first few sections with you to give a sense of what we’re doing. On the fun side, Daniel Garcia and I went to the Jazz Festival and had a good time. We also met up with a group from CRUISE (plus random people on the field) and played soccer for a solid two hours! We also found a decent coffee maker down on the first floor of Klaus and I felt like writing some (bad) poetry.
Below is an intro to the problem and some of the challenges we face.
Introduction: The objective of this project is to study large datasets of network logs and use them to identify patterns in peoples’ movements. Our goal is to better understand migration patterns, space utilization, and WiFi usage trends in order to allow for better resource allocation and planning. A way to determine how people make use of the available spaces is to look at the times and places they are logging in to a wireless network in conjunction with external data sources (e.g. demographics, weather, etc).
Challenges: There are several challenges concerning the use of WiFi data. In many situations, there are various access points within range of a device. Therefore, the device has several options to connect to the network and only one is chosen (dynamic DHCP). Devices do not necessarily actively search for the closest access point. Furthermore, access points can have different signal strengths, which makes some access points have more devices connected to them, even if a device is further away from it than to another access point. All of these considerations lead to discrepancies between the device’s location and the log data indicating the access point it is connected to. For example, if a device is initially connected to the nearest access point but the device moves, it is possible that the original connection will be maintained despite the fact that other closer access points might be available. Other places have a sparse coverage, so a connection may be established with an access point on another floor or building.
Another challenge is to develop a heuristic for linking the number of people in a location with the number of WiFi connected devices. Overestimates occur when a person has multiple devices and underestimates when they have none or are connected to access points further away. To examine this issue, we performed three validation studies to determine the accuracy of the hotspot data by recording the number of people in a location over a period of two hours. We found that the recorded number of devices was much lower than the number of people. By no means does this provide us with a heuristic, but only serves to highlight the care that must be exercised when interpreting the number of WiFi connections in an area.
Additionally, the data itself is clouded by several artifacts of the protocols used to authenticate and maintain wireless connections. The result is that there are many duplicate entries with the same timestamp and MAC address. As noted above, it is also common to have devices in multiple locations associated with the same user ID at the same time. This contributes to the aforementioned challenge of associating people with devices. Also, it artificially inflates the dataset size.
Much WiFi data there was,
So from the third of Klaus,
Three set out to analyze it much.
Never daunted never sloth (not really)
Too much data there was too see,
So set out the three,
To find better ways to analyze it.
Places been and places going,
All is possible from data flowing,
But be wary of false conclusions,
Correlation does not causation mean!