Exercizing the Numbers

A Northwestern University EECS 349 Project

The Problem

Northwestern University's largest gym is Henry Crown Sports Pavilion, further referenced to as HCSP. At times, HCSP can be very crowded, and certain types of equipment, for example treadmills or squat racks, may not be available based on the number of people at the gym. It is often frustrating when you plan to do a workout at the gym but get there and find that all the necessary equipment is taken.

The Solution

Our goal was to write a ML algorithm that would estimate the number of people at the gym at a given time in the future. This would be allow people to make a more informed decision about whether to go to the gym or not, based on their specific needs. We narrowed the main factors affecting the number of people at the gym to the following:

After acquiring the above data, we wrote a python script to compile it into a .csv file that Weka could interpret. We then entered the training/testing phase where we iteratively found the best format for our .csv file.

Testing and Training

Our training/testing data was compiled into 4 different csv files with the following differentiators:

Attendance was our classifier label, hence we wanted to evaluate the different performance between trying to predict a nominal grouped range of attendances versus trying to predict the actual numeric attendance.

The other difference lends itself to the nature of our dataset, as we found a large amount of entries with 0 under the attendance label. Hence, we decided to use datasets without those entries to see whether this would positively affect performance.

We tested most of the algorithms we learnt throughout the course using the Weka package. The list of ML methods include Decision Trees, Nearest Neighbour, Linear Regression, Multilayer Perceptron and Logistic Regression.

We used 10-fold validation to build models using the aforementioned algorithms.

Results

We can see from our graphs that removing examples where the attendance was zero generally decreased the 10-fold cross validation accuracy for our different algorithms. The algorithm with the greatest performance was the nearest neighbor variant, KStar. Reporting an accuracy of 83.215% in 10-fold cross validation, it classifies our data quite well. See the graph below for the accuracies of every algorithm tested on our data.

Analyzing our results, it makes sense that the KStar Nearest Neighbor algorithm gives the best performance on the nominal set. The nature of the data set is well suited so that the strengths of a Nearest Neighbor algorithm are maximized. The number of patrons at HCSP for some half hour time slot is likely to be close to the number of patrons during a half hour time slot with similar attributes, so we can expect Nearest Neighbor to classify quite well, especially when we have binned our attendance numbers into four groups.

Future work for this project includes adding and removing attributes to increase the accuracy of the model. Better attribute selection would undoubtedly lead to increased validation accuracy. Given the success rate of the KStar algorithm, it would also be worth looking into implementing the algorithm for a web based application for use by Northwestern students.

Team Members

Yannick Mamudo
Samir Joshi
Matt Niemer
Caroline Grace Alexander