Authorship Verification of Social Media Content
Authorship verification is often considered as the task of comparing multiple pieces of writing to determine if they are produced by a single author,
such as Shakespeare and suspected Shakespeare writings.
This project, however, deals with the type of problem where there is no closed candidate set, but rather one suspect, a purported author,
and the challenge is to determine if the suspect is or is not the author.
Traditional machine learning algorithms such as support vector machine, decision trees,
and na´ve Bayes have achieved high accuracy rates for this type of classification problems.
However, new algorithms in the natural language processing space, such as word2vec, might have potential for improving accuracy rates even further.
This project aims to explore the effectiveness of using the word2vec algorithm for authorship verification on social media content
(e.g. Facebook postings, Tweets, microblogs etc.).
- Implement word2vec using Python;
- Run basic example to demonstrate how the algorithm works;
- Run text classification experiment on existing social media data set (Facebook data set for 30 users);
- Identify at least one other social media dataset and run text classification experiment ("big data" set);
- Compare the results of the two experiments above;
- Li, Jenny S., et al. "Authorship Authentication Using Short Messages from Social Networking Sites."
e-Business Engineering (ICEBE), 2014 IEEE 11th International Conference on. IEEE, 2014.
- Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).
This project is a continuation of last semester's project, see
2016 Fall Project Paper.