Stylometry (see Wikipedia definition)
is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or
disputed documents, and it has legal as well as academic and literary applications.
Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents,
of ransom notes, and of other documents in forensics, etc.
Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques.
The input data are simply plaintext files.
For features, stylometry typically analyzes the text by using word frequencies and identifying patterns
in common parts of speech.
A framework paper and
MIT Thesis describe some existing systems.
Last semester we developed a reasonably robust Pace University Stylometry Biometric System (PSBS)
and the feature set is being enlarged currently by Vinnie Monaco.
The design of the stylometry features is based on the following criteria:
- ratios of counts likely to characterize an individual's stylometric preferences
- the categories and many features described in the above "framework paper"
- the character-based features of the keystroke features
(for stylometry, however,
they are character and digram frequency statistics rather than key-press duration and transition statistics)
The 2008 federal Higher Education Opportunity Act requires institutions of higher learning
to make greater access control efforts for the purposes of assuring that students of record
are those actually accessing the systems and taking exams in online courses
by adopting identification technologies as they become more ubiquitous.
To meet these needs, keystroke and stylometry biometrics were investigated at Pace University
towards developing a robust system to authenticate (verify) online test takers.
The performance of the stylometry system on online tests, however, was rather poor and simply fusing the keystroke and stylometry systems
by combining their features did not boost the performance of the keystroke system alone.
This work has been described in
last semester's technical paper from Research Day 2011
and extended in the IJCB2011 paper
to be presented at the International Joint Conference on Biometrics in October 2011.
Everything related to last semester's stylometry project (user guides for input system and feature extractor, technical papers) is at
Revised data collected.
Because the stylometry results were rather poor last semester,
this project will focus solely on stylometry and on much longer text input with the aim of obtaining reasonable accuracy on the PSBS.
This semester we will first find books on the Internet, for example see
where you can cut-and-paste an HTML book into Notepad to get it into text form (.txt).
We will start with 30 authors and 10 writing samples from each author, 5 for training and 5 for testing.
Most of this semester's effort will be running experiments to obtain accuracy (e.g., Equal Error Rate) as a function of text length
or as a function of population size (number of authors).
We would like to run experiments with samples of different word lengths -
for example, the first 250 words of each of the 300 samples, the first 500 words of each sample, etc.
We will likely
We have some long-text samples and expect to obtain more from DPS students and DPS graduates teaching at various institutions.
Fast Agile XP Deliverables
We will use the agile methodology,
particularly Extreme Programming (XP) which involves small releases and fast turnarounds in roughly two-week iterations.
Many of these deliverables can be done in parallel by different members or subsets of the team.
The following is the current list of deliverables
(ordered by the date initiated, deliverable modifications marked in red,
initiated date marked in bold red if programming involved,
completion date and related comments marked in green,
pseudo-code marked in blue):
Work with your customers to determine the type or types of books to use.
For example, do we want all the books to be of the same type, like science fiction or romance?
Or do we want books of a variety of types and from different time periods to facilitate distinguishing among them?
Then find 30 authors that have each written 10 books. We now have 300 books, half for training and half for testing PSBS.
Create a program that reads in a book text file and outputs samples of different lengths -
the first 250, 500, 750, 1000, 1500, 2000, 2500, 3000, 4000, 5000 words (10 different word lengths).
- 9/22 (1-2 weeks)
Experiment 1: using the 250-word samples, train PSBS on 5 samples from each of the 30 authors and
test PSBS on the other 5 samples from each of the 30 authors to obtain a Receiver Operating Characteristic (ROC) Curve.
Work with Vinnie Monaco to learn how to run the various programs of the system.
- 9/22 (1-2 weeks)
Run similar experiments using the other sample lengths.