“Because the computer said so!”
Can computational authorship analysis be trusted?
This study belongs to the domain of authorship analysis (AA), a discipline under the umbrella of forensic linguistics in which writing style is analysed as a means of authorship identification.
Due to advances in natural language processing and machine learning in recent years, interest in computational methods of AA is gaining over traditional stylistic analysis by human experts. It may only be a matter of time before the software will assist, if not replace, a forensic examiner. But can we trust its verdict? The existing computational methods of AA receive critique for the lack of theoretical motivation, black box methodologies and controversial results, and ultimately, many argue that these are unable to deliver viable forensic evidence.
The study replicates a popular algorithm of computational AA in order to open one of the existing black boxes. It takes a closer look at the so-called “bag-of-words” (BoW) approach – a word distributions method used in the majority of AA models, evaluates the parameters that the algorithm bases its conclusions on and offers detailed linguistic explanations for the statistical results these discriminators produce.
The framework behind the design of this study draws on multidimensional analysis – a multivariate analytical approach to linguistic variation. By building on the theory of systemic functional linguistics and variationist sociolinguistics, the study takes steps toward solving the existing problem of the theoretical validity of computational AA.
Baxter, G., & Croft, W. (2016). Modeling language change across the lifespan: Individual trajectories in community change. Language Variation and Change, 28(2), 129-173. doi:10.1017/S0954394516000077
Biber, D. (1988). Variation across speech and writing (Paperback ed.). Cambridge: Cambridge University Press.
Biber, D. (1995). Dimensions of register variation: a cross-linguistic comparison. Cambridge: Cambridge University Press.
Biber, D., & Conrad, S. (Eds.). (2001). Variation in English: Multi-Dimensional Studies. Harlow: Pearson Education/Longman.
Biber, D., & Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge University Press.
Biber, D., Conrad, S., & Leech, G. N. (2015). Longman student grammar of spoken and written English. Harlow, Essex: Longman.
Biber, D., & Finegan, E. (Eds.). (1994). Sociolinguistic Perspectives on Register. New York: Oxford University Press.
Chaski, C. E. (2013). Best practices and admissibility of forensic author identification. Journal of Law and Policy, 21(2), 333–376.
Cheng, E. K. (2013). Being pragmatic about forensic linguistics. Journal of Law and Policy, 21(2), 541–550.
Coulthard, M., Johnson, A., & Wright, D. (2016). An Introduction to Forensic Linguistics: Language in Evidence. London/New York: Taylor & Francis.
Coulthard, M. (2013). A failed appeal. International Journal of Speech Language and the Law, 4(2), 287–302. https://doi.org/10.1558/ijsll.v4i2.287
Eagleson, R. (1994). Forensic analysis of personal written texts: a case study. In J. Gibbons (Ed.), Language and the Law (pp. 362–373). Harlow: Longman.
Eder, M., Rybicki, J., & Kestemont, M. (2016). Stylometry with R: a package for computational text analysis. R Journal, 8(1), 107–121.
Emma Identity. (2017). Get Inside My Time Machine: A Quick Trip to the Stylometry Origin. Retrieved January 10, 2019, from https://medium.com/emma-identity/get-inside-my-time-machine-a-quick-trip-to-the-stylometry-origin-b65481549096
Grant, T. (2013). TXT 4N6: method, consistency, and distinctiveness in the analysis of sms text messages. Journal of Law and Policy, 21(2), 467–494.
Grant, T., & Baker, K. (2001). Identifying reliable, valid markers of authorship: a response to Chaski. International Journal of Speech, Language and the Law, 8(1), 66–79.
Grieve, J. (2007). Quantitative Authorship Attribution: An Evaluation of Techniques. Literary and Linguistic Computing, 22(3), 251–270. https://doi.org/10.1093/llc/fqm020
Halliday, M. A. K. (1985). An introduction to functional grammar. London: Edward Arnold.
Hancock, J. T., Woodworth, M. T., & Porter, S. (2013). Hungry like the wolf: A word-pattern analysis of the language of psychopaths. Legal and Criminological Psychology, 18(1), 102–114. https://doi.org/10.1111/j.2044-8333.2011.02025.x
Helt, M. E. (2001). A multi-dimensional comparison of British and American spoken English. In D. Biber & S. Conrad (Eds.), Variation in English: multi-dimensional studies (pp. 171–183). Harlow: Pearson Education/Longman.
Juola, P. (2007). Future Trends in Authorship Attribution. In P. Craiger & S. Shenoi (Eds.), Advances in Digital Forensics III (pp. 119–132). Springer New York.
Juola, P. (2008). Authorship Attribution. Hanover, MA, USA: Now Publishers Inc.
Kestemont, M. (2014). Function Words in Authorship Attribution. From Black Magic to Theory? In A. Feldman, A. Kazantseva & S. Szpakowics (Eds.) Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL) (pp. 59–66). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-0908
Kocher, M., & Savoy, J. (2017). A simple and efficient algorithm for authorship verification. Journal of the Association for Information Science and Technology, 68(1), 259–269. https://doi.org/10.1002/asi.23648
Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Computational Methods in Authorship Attribution, 60(1), 9–26.
Koppel, M., & Winter, Y. (2014). Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology, 65(1), 178–187. https://doi.org/10.1002/asi.22954
Labov, W. (1972). Sociolinguistic Patterns. Philadelphia: University of Pennsylvania.
McMenamin, G. R. (2002). Forensic linguistics: Advances in forensic stylistics. Boca Raton, Fla: CRC Press.
McTear, M., Callejas, Z., & Griol, D. (2016). The Conversational Interface: Talking to Smart Devices (1st ed.). Springer Publishing Company, Incorporated.
Mendenhall, T. C. (1887). The Characteristic Curves of Composition. Science, ns-9(214S), 237. https://doi.org/10.1126/science.ns-9.214S.237
Mendenhall, T. C. (1901). A Mechanical Solution of a Literary Problem. The Popular Science Monthly, LX (7), 97–105.
Milroy, J., & Milroy, L. (1998). Varieties and Variation. In F. Coulmas (Ed.), The Handbook of Sociolinguistics. Blackwell Publishing. Retrieved from
Nickerson, R. S. (1998). Confirmation Bias : A Ubiquitous Phenomenon in Many Guises. Review of General Psychology, 2(2), 175–220. https://doi.org/10.1037/1089-2618.104.22.168
Nini, A. (2014). Multidimensional Analysis Tagger 1.3 - Manual. Retrieved from http://sites.google.com/site/multidimensionaltagger
Pennebaker, J. (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Press.
Privacy, Security and Automation Lab. (n.d.). Retrieved January 4, 2019, from https://psal.cs.drexel.edu/index.php/Main_Page
R Core Team. (2017). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
Risinger, D. M., Saks, M. J., Thompson, W. C., & Rosenthal, R. (2002). The Daubert/Kumho Implications of Observer Effects in Forensic Science: Hidden Problems of Expectation and Suggestion. California Law Review, 90(1), 1–56.
Rocha, A., W. J. Scheirer, C. W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, E. Stamatatos. (2017). Authorship Attribution for Social Media Forensics. IEEE Transactions on Information Forensics and Security, 12(1), 5–33. https://doi.org/10.1109/TIFS.2016.2603960
Rudman, J. (2012). The State of Non-Traditional Authorship Attribution Studies-2012: Some Problems and Solutions. English Studies, 93(3), 259–274. https://doi.org/10.1080/0013838X.2012.668785
Solan, L. M. (2013). Intuition versus algorithm: the case of forensic authorship attribution. Journal of Law and Policy, 21(2), 551–576.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol., 60(3), 538–556.
Stamatatos, E. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), 421–439.
Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., & Potthast, M. (2016). Clustering by Authorship Within and Across Documents. In K. Balog, L. Cappellato, N. Ferro & C. Macdonald (Eds.), CLEF. Retrieved from http://ceur-ws.org/Vol-1609/16090691.pdf
Tagliamonte, S. (2012). Variationist sociolinguistics: change, observation, interpretation. Malden, MA: Wiley-Blackwell.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393. https://doi.org/10.1002/asi.20316
Copyright (c) 2019 Author and Journal of Language Works
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The author/the authors hold the rigths to articles presented in the journal. The author/the authors are granted the right to reproduce their article as they see fit, if they mention LWorks as the original publisher of the article.