Stylometry (see Wikipedia definition)
is the study of the unique linguistic styles and writing behaviors of individuals in order to determine authorship. Stylometry can be used to attribute authorship to anonymous or
disputed documents, and it has legal as well as academic and literary applications.
Stylometry has been used to determine (or narrow the possibilities of) the authorship of historic documents,
of ransom notes and other documents in forensics, etc.
Stylometry uses statistical analysis, pattern recognition, and artificial intelligence techniques.
For features, stylometry typically analyzes the text by using word frequencies and identifying patterns
in common parts of speech.
A paper and
MIT Thesis describe some existing systems.
A previous system was developed and tested, see Stylometry
Technical Paper (fall 2007) and associated slides. Also see Authentication Technical Paper (fall 2007) and associated slides.
A rather primitive feature extraction component was developed in C# (not of current value).
Since we will not be using the previously developed system, this is a new project rather than a continuation.
This project has two parts which can be done in parallel until completion of the first part.
Part 1 (two-three week duration)
Conduct a library and internet search to determine an interesting and unique application of stylometry for research.
Unique applications might include determining the age or gender of the author, verifying one's identity
in biometric applications (such as the identity of a student taking an online test), or determining email authorship.
A table enumerating all the possible applications of stylometry would be appropriate.
Unless otherwise determined, the focus this semester will be determining email authorship,
an area of forensic linguistics.
The DPS customer is particularly interested in creating stylometric profiles of a user
based on the user's social networking comments.
A profile from a networking site such as facebook can be scanned for comments from a user as
these comments are tagged with the authors name.
Emails from the same person can then be tested, with the existing system, against this profile for matching.
The software would scan html pages from a user profile, extract comments that follow the posting persons name,
and use these comments to build a sylometric profile of the user.
This stylometric profile could then be used to identify authors' of emails.
Emails and profile comments are both informal online forms of communication,
and the use of special characters may be similar in these two communication venues.
This special application might require the coding of unique stylometry features to capture such things
as the usage of chatroom shorthand.
Build logical arguments why such software would be both valuable and unique,
and develop use cases to support the arguments.
Part 2 (full semester duration)
Develop a powerful stylometry system.
First, read the existing literature and check for the possibility of available existing software.
The system and experimental setup will have several components.
- Data collection will provide input to the system and, in this case, simply involves the collection of plaintext files.
- A feature extraction component calculates such measurements as the average word length, letter frequencies,
etc. To create a powerful stylometry system, use as many features as possible.
Divide the features into two categories: general features and those more specific to this application.
For ease of processing we will use a special Feature Data Format.
- A pattern classification component will estimate authorship (1 of n problem).
For simplicity and to demonstrate feasibilily we will use the nearest neighbor classifier.