Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech. It might also use what are called "rare pairs", an individual's habits of word collocation. A paper and MIT Thesis describe existing systems. The paper uses this Java-based data mining software (C4.5, neural network, and SVM).
This semester's project will have two focuses. First, we will become familiar with the system and collect additional data. For data we will use plaintext emails. We will collect data from as many participants (subjects) as possible. Each participant (including each team member) will create ten different emails, each of length 100-200 words, and each one on a different subject (e.g., what you like/don't like to eat, what you like/don't like to do, what type of schoolwork you like/dislike, etc.).
Second, and most importantly, we will format the feature-vector data for ease of processing by other project systems, specifically the Biometric Authentication System and the Data Mining Systems teams.
Also, if time permits, we will correctly implement the nearest neighbor algorithm, rerun the previous experiment, possibly improve the method of running experiments, and run a larger experiment by combining the new data with the old.