Stylometry (see Wikipedia definition)
is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or
disputed documents, and it has legal as well as academic and literary applications.
Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents,
of ransom notes, and of other documents in forensics, etc.
Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques.
For features, stylometry typically analyzes the text by using word frequencies and identifying patterns
in common parts of speech.
A framework paper and
MIT Thesis describe some existing systems.
This is a continuation of previous projects, see the associated Research Day 2010 paper entitled
Stylometry System - Use Cases and Feasibility Study
last semester's technical paper.
Biometric systems consist of data collection, feature extraction, and pattern classification.
Here, the data are simply plaintext files,
and you will use the text data that corresonds to similarly captured keystroke data.
The design of the stylometry features is based on the following criteria:
- ratios of counts likely to characterize an individual's stylometric preferences
- the categories and many features described in the above "framework paper"
- the character-based features of the keystroke features
(for stylometry, however,
they are character and digram frequency statistics rather than key-press duration and transition statistics)
The first major objective of this work is to obtain authentication accuracy of the stylometry system
and compare it to that of the keystroke system.
The second major objective is to combine the keystroke and stylometry systems to increase accuracy of the
combined system over that of either the standalone keystroke system or the standalone stylometry system.
Last semester Vinnie Monaco developed the data capturing and feature extraction programs for this work.
The data capturing program captures the text portion of the keystroke data collected by the keystroke project
so that we are working on the same underlying data.
We also use a generic Feature Data Format so we can
- use the generic nearest-neighbor classifier used in the keystroke work
- easily combine the keystroke and stylometry feature data to obtain combined stylometry and keystroke performance
Before we can use these programs, however, Vinnie will revise the data capturing program to use the more accurate
This semester's team, together with the Keystroke team,
will obtain new data samples from a large population of about 50 subjects, 10 data samples from each subject
in two sets of five samples each recorded with a separation of at least two weeks.
You will also learn how to run the programs (code) of the system
and run various experiments.
Fast Agile XP Deliverables
We will use the agile methodology,
particularly Extreme Programming (XP) which involves small releases and fast turnarounds in roughly two-week iterations.
Many of these deliverables can be done in parallel by different members or subsets of the team.
The following is the current list of deliverables
(ordered by the date initiated, deliverable modifications marked in red,
initiated date marked in bold red if programming involved,
completion date and related comments marked in green,
pseudo-code marked in blue):
- 2/1 (one week) Review the work from last semester.
Work with your instructor, customer John Stewart, and Vinnie Monaco to plan the work for the semester.
- 2/1 (ongoing for the semester).
Collect data samples together with the Keystroke team.
- 2/1 (1-2 weeks).
Work with Vinnie Monaco to learn how to run the various programs of the system.