Educational Data Mining
on H.S. Student Web Searches


Search engines have become a vital part of everyday life. For informational, navigational, or transactional needs, most people utilize a search engine. This need paved the way for the study of web query classification - labeling search queries into appropriate categories. The big three search engines - Google, Bing and Yahoo - generate billions of dollars in ad revenue by classifying web queries into appropriate categories. Along with providing relevant search results, this classification allows search engine providers to generate relevant online advertisements for users.

The number of mobile devices used in the K-12 learning space has been growing steadily. School-issued mobile devices, by law, are required to have web filters installed on them to filter inappropriate content. These web filters log students' online activity. This research investigates student web query logs which may provide valuable information on student learning using school-issued mobile devices [1]. Data mining techniques are used to perform an in-depth analysis of student web queries [2]. Subsequently, machine learning algorithms are used for the binary classification of these web queries as either school related or non-school related [3]. Lastly, a regression analysis is performed to test whether a correlation exists between the aforementioned web queries to student grade point average [4].

Project Description

For this project, dataset described below will be used. The original code for this project was written in R language and Java. Classification algorithm was implemented in Java [3]. This code needs to be refactored by applying Object-Oriented Programming (OOP) concepts. New features will be added to the algorithm to improve classifier accuracy. The code also needs to be documented for the creation of an API. A local SQL database may be used to store the data temporarily which may then be used by the classifier during query enrichment process [3].

This project is a continuation of two previous semester projects:

  1. 2016 Fall Project Paper
  2. 2016 Spring Project Paper

Project Data

For this study, web query reports from Web filter logs were extracted, anonymized and put into a Comma Separated Value (CSV) file by an authorized school administrator. All the CSV files are then merged together. The raw data pulled from the aforementioned logs will comprise of approximately 1,000,000 online search queries (SQ).