Predicting college choice based on academic and extracurricular interests

Luke Olney / LukeOlney2016@u.northwestern.edu / Professor Douglas Downey's EECS 349 (Machine Learning) class at Northwestern University

Introduction

This project is devoted to the following question: how do students' interests affect how they apply to, and where they get into, different colleges and universities? This is a variant on a familiar topic to those applying to college - a number of online tools are available for predicting undergraduate college admissions success. The high availability of data, as high dependence of admissions decisions on certain numerical quantities (SAT score, GPA, class rank, etc.), make this a natural translation to a machine learning task. In particular, the phenomena of ‘decision threads’ ^3 on the CollegeConfidential.com message boards provide large quantities of semi-structured data for the factors that go into college admission, quantitative and qualitative. As near mirrors of students’ college applications, however, this data provides more information than what is necessary to get admitted into a particular University. It provides a full record of personal information like extracurriculars, job experience, volunteer service, and summer activities, which tell a great deal about

That colleges possess certain “personalities” is a common theme in popular and social media, inspiring guides and quizzes in publications from US News ^4 to Buzzfeed ^5. Appropriately, a possible approach to this task is to retrieve data from publicly-shared information on social networks like Facebook. Existing personality research has commonly taken that approach ^2. The advantage of the CollegeConfidential data is that it clearly delineates categories that are relevant to a student’s reasons for applying to that college, which may be difficult to distinguish in the data that Facebook's grpah API applies.

Though my results show that a students' academic interests can't predict with any certainty where they will apply, the most informative words nevertheless show interesting patterns. I also confirmed that it was possible to predict just from interests to predict whether a student would be admitted to a school in a particular athletic conference just from results alone.

Methods and Results

All data was collected from so-called "results" threads on the message board of CollegeConfidential.com, where students who applied to a particular university will post whether they were accepted or rejected, along with information that typically goes on a college application. I prioritized collection from early decision or early action decision threads, since these applicants have made a greater commitment to their school and therefore provide a stronger signal for the school's personality. This makes up partially for the lack of data on which school the person chooses of the ones to which they were accepted. Data reported here is sometimes in a regular format, but there are often low levels of adherence to the posted format and the format varies between threads for different universities in different years. My scraper attempted to collect text that belonged to one of the 'interests' categories that was often defined in the suggested format, including: 'Summer Activities', 'Volunteer/Community service', 'Extracurriculars', and 'Job/Work Experience.'

Data came from schools belonging to the Big-10, NESCAC, and Ivy league athletics conferences. I collected responses from a total of 643 users over two years (2014-2015). There was also substantial cross-posting between these groups — ~21% of users whose response I collected posted to more than one group. That portion of users who didn’t cross-post is still useful to consider because their post certainly provides a signal for which college college they choose, even if it detracts from my original purpose.

I considered both the task of predicting whether students applied to school in a particular athletic conference and also whether they were accepted, given that they applied; the latter is mainly a control to gauge the effectiveness of the first. I began with a straightforward bag-of-words representation using monograms, digrams, and trigrams of the input text. I triedwith and without tf-idf weighting to control for the influence of words that appear frequently in general, Trigrams proved useful in capturing the names of certain clubs or organizations, such as 'Habitat for Humanity.' I do an 80-20 split of the collected data for training and validation purposes, and train using several classifiers: Naive Bayes and logistic regression. The conditional independence assumptions of Naive Bayes are particularly apt here, since applicants usually present their activities in a laundry list format (though often with extensive explanation).

Accuracy does not vary significantly when using different classifiers, although it is much better with the 'accepted' task than with the 'applied' task. Accuracy was higher for Ivy League 'accepted' predictions, showing that a persons' activities are relatively more useful for gaining acceptance there, and lower among Big-10 schools.

Although hypothesis testing shows that the predictions are better than chance, the 'accepted' prediction task is much more effective than the 'applied.' Nonetheless, the most informative features tell an interesting story, showing that in aggregate there are certain features that define the applicant pool of each group of schools. In particular, the words that most inform whether a person will apply an Ivy League school include 'founder' and 'intern,' which indicate career ambition, and also appear often in the . The words that best predict application to NESCAC schools include activities like 'debate' and 'art' Words that are highly predictive for 'accepted' don't show this kind of variety; the interests that are most useful for getting accepted seem to be so among all the schools considered.

Figure 1: Learning methods and accuracy

Method	Accuracy (applied)	Accuracy (accepted)
B	.54	.76
Naive Bayes	.54	.81
Naive Bayes w/ tf-df weighting	.57	.85
Logistic Regression	.52	.79

Figure 2: Examples of informative words (for applied)

NESCAC	Ivy League	Big-10
debate	founder	captain
volunteer	intern	founder
music	lab	club

Figure 3: Examples of informative words (for accepted)

NESCAC	Ivy League	Big-10
founder	lab	state
lab	founder	intern
intern	intern	hospital

Future Work

A similar analysis could extend to other groupings of schools - possibilities include a breakdown to individual schools, a comparison of different categories of schools (e.g., technical and liberal arts). Another sort of analysis might try to learn the semantics of different activities, grouping activities that are similar either in topic or in usefulness for gaining admission.