PP-Fr-am - S2 -4: Parallel papers Friday morning
11:00am - 11:30am
Authorship attribution for short Chinese instant messages based on dependency grammar
Guangdong University of Foreign Studies, People's Republic of China
This study aims to identify possible discriminant syntactic features for forensic authorship attribution for short Chinese instant messages. Short Chinese instant messages naturally produced by several authors were sampled from WeChat, an instant messaging application. Dependency relations between the words in each sentence in these instant messages were manually annotated and thus these sentences with annotations form a treebank. Some stylometric features based on dependency structure were tested in some classification algorithms. Part of these features prove to be useful discriminant features in this case and thus may enter the feature pools composed of available features for forensic authorship attribution.
11:30am - 12:00pm
Analysis of speaker and co-articulation effects based on sub-band cepstral variances in the Japanese vowels of 300 male speakers
1Australian National University, Australia; 2J.P. French Associates, Forensic Speech & Acoustics Laboratory, UK
The longstanding aim of achieving robust forensic voice identification is hampered by a number of complex and intertwined factors of variability in the speech signal such as: (1) speaker differences; (2) co-articulation effects; (3) channel conditions; (4) elicitation styles. Here we study the effects of factors (1) and (2) using a 300-speaker corpus of the 5 Japanese vowels extracted from monosyllabic utterances with 10 phonetic contexts varying only in their preceding consonants. These citation-style utterances were recorded 4 times in one channel condition (microphone with 4-kHz bandwidth), and linear-prediction cepstral coefficients were automatically obtained from the steady-state frames of the vowel nuclei. Using two-way analysis of variance and a band-selective cepstral distance to generate the variances, speaker and context effects were then separated per vowel and in consecutive sub-bands selected across the full range of 4 kHz. The results show that context effects tend to be stronger in sub-bands corresponding to the lower-formant range, while those spanning the higher-formant range contain relatively more speaker influences. These findings are consistent with observations reported in previous studies based on formant frequencies, and therefore lend support to our proposition that the sub-band information contained in the cepstrum holds the capacity for acoustic-phonetic interpretability of within- and between-speaker variability.
12:00pm - 12:30pm
Large-scale authorship attribution with sociolinguistically dynamic data
Aston University, United Kingdom
This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts (ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes them significantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantly from a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands of authors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimental setup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data (e.g. blog posts or emails).
The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collects aggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recent advances in natural language processing such as vector-based word representations obtained through neural network training. Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the prediction lists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.
Joulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759, 2016.
12:30pm - 1:00pm
‘To Whom This May Concern’: audience profiling in forensic (con)texts
Centre for Forensic Linguistics, Aston University
This paper demonstrates the utility of audience analysis in forensic casework, and how and why such analysis might be of use in real-world investigations. Using a real case as an exemplar, the paper demonstrates that we can move beyond profiling authors to considering who an author’s actual intended audience is and whether this is consistent with the language of the message. This is of particular use in fraud examinations, when linguistic data has been uncovered which investigators consider to ‘have something odd about it’. The analysis normally begins with authorship profiling, the emphasis being on the question “what kind of linguistic persona wrote this?”. However, another question is often just as relevant – “who is the intended audience for this communication?”
Audience analysis makes use of existing linguistic theories, in particular Bell’s (1984) Audience Design, Grice’s (1975) Maxims, and Sperber and Wilson’s (1986) Relevance Theory, and this paper demonstrates how such theories can be incorporated into a linguistic toolkit which can be applied to real data.
The data in this case consist of an exchange of emails between two financiers, flagged because an investigator felt there was something “odd” about the language. Closer analysis (in the form of audience analysis) indicated that the language of the content was inconsistent with the profile of the primary addressee, a finding which resulted in a change in direction of the investigation and the discovery of a transatlantic fraud in the making.