ChunkedHCs Algorithm for Authorship Verification Problems: Reddit Case Study

Anh Duc Le (Munster Technological University and Rigr AI); Justin McGuinness (Munster Technological University); Edward Dixon (Rigr AI)

Authors: Anh Duc Le (Munster Technological University and Rigr AI), Justin McGuinness (Munster Technological University), and Edward Dixon (Rigr AI)

DFRWS USA 2021

Abstract

Cybercrime can be associated with undisclosed social media accounts deliberately used to conduct unethical or illegal activities such as cyberbullying, fraudulent transactions, human trafficking, etc. The objective of this paper is to identify whether two social media accounts belong to the same person by examining the accounts’ writing, i.e. comments and posts. To that end, this preliminary study introduces a new algorithm, ChunkedHCs, specifically designed for the authorship verification task to decide whether a pair of texts are written by the same person. In the domain of machine learning and deep learning, there have been previous authorship verification approaches, which often involve complex feature selections or sophisticated preprocessing steps due to the complexity of topic heterogeneity. Such limits provide motivations to seek a simpler yet more robust approach that could offer competitive verification ability. ChunkedHCs is based on the statistical testing Higher Criticism (Donoho & Jin, 2004) and the HC-based similarity algorithm (Kipnis, 2020a & 2020b) (Kestemont et al., 2020). Using Reddit users’ data, ChunkedHCs offer a promising performance with an accuracy of 0.94 and an F1 of 0.9381 for texts between 29,000 and 30,000 characters. It is speculated that the algorithm could also be highly applicable to identify if two accounts are used by the same person for other social media platforms such as Facebook, Twitter and even dark web forums. Various avenues of further research on ChunkedHCs are also proposed.

Downloads

ChunkedHCs Algorithm for Authorship Verification Problems: Reddit Case Study (Paper)

ChunkedHCs Algorithm for Authorship Verification Problems: Reddit Case Study (Slides)