Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session Overview
#SK5: Quantitative Textual Analysis Paper Session 2
Friday, 26/Jul/2019:
1:40pm - 2:40pm

Session Chair: Patrick Juola
Location: Salons 4 & 5, Grand Ballroom, Marriott City Center

Show help for 'Increase or decrease the abstract text size'

Named Entity Recognition in the Humanities: The Herodotos Project

Brian Daniel Joseph1, Christopher Gerard Brown1, Micha Elsner1, Marie-Catherine de Marneffe1, Alexander Erdmann1,2, James Wolfe1, Colleen Kron1, William Little1, Andrew Kessler1, Petra Ajaka1, Benjamin Allen1, Yukun Feng1, Morgan Amonett1, Amber Huskey1, Charles Woodrum1

1The Ohio State University, United States of America; 2New York University Abu Dhabi

Solutions to the computational linguistic problem known as Named Entity Recognition (NER) have much to offer humanists. In this presentation, we offer a “proof of concept” in support of this assertion by introducing the Herodotos Project and its development and use of NER systems for Classical languages.

NER is a machine learning technique that trains on large amounts of text in which humans have manually, accurately, and consistently identified the named entities of interest. Trained NER systems then learn to generalize, identifying and distinguishing the annotated classes of named entities in other, similar texts automatically. While high-quality NER systems are available for the “larger” high-resource languages like English and for domains like newswire, no such systems are readily available for Classical Latin or Greek , both low resource languages, and the range of text-types available in them.

In the context of the Herodotos Project, an ethnohistory project aiming to first develop a catalogue of ancient peoples named in classical sources (i.e., groups, tribes, empires, etc.), and then to compile all known information about them and to map and describe their interactions, we have developed Latin-based and Greek-based NER systems and with them we have been able to automatically identify -- with a high degree of accuracy (over 90%) -- the names of ancient groups of peoples in classical texts as well as other relevant named entities, specifically persons and places.

We demonstrate here our systems and discuss the value that their development can offer to humanistic research. First and foremost, we outline how they can be generalized so as to be applicable to other languages and to other entities of interest, e.g. dates or quantities. Second, since the training of our NER systems has involved annotation of considerable amounts of Latin and Greek text, we detail other uses that our annotated Classical texts can be put to. Finally, we discuss several issues in the manual annotation process regarding the identification and distinction of person, group, and place names which hold implications of wider interest for research into onomastics, which we see as a basic humanistic enterprise given the importance of names and naming — think of historical figures, their often eponymous significant achievements or political movements, literary characters, etc. — to much of the humanities.

Hacking Multi-word Named Entity Recognition on HathiTrust Extracted Features Data

Patrick J Burns

New York University, United States of America

Multi-word named entity recognition (NER)—the automated process of extracting from texts the names of people, places, and other lexical objects consisting of more than one word—is a core natural language processing (NLP) task. Current popular methods for NER used in Digital Humanities projects, such as those included in the Natural Language Toolkit or Stanford CoreNLP, use statistical models trained on sequential text, meaning that the context created by adjacent words in a text determines whether a word is tagged as an entity or as belonging to a named entity (NE). This short paper looks at a situation where statistical models cannot be used because sequential text is unavailable and suggests a “hack” for approximating multi-word NER under this constraint. Specifically, it looks at attempts to extract multi-word NEs in the HathiTrust Extracted Features (HTEF) dataset. The HTEF dataset provides page-level token counts for nearly 16 million volumes in the HathiTrust collection as an effort to provide “non-consumptive” access to book contents. That is, HTEF data is provided in a pseudo-random manner—a scrambled list of token counts—and not as words in a consecutive, readable word order. Accordingly, statistical NER methods cannot be used with HTEF data.

In recent projects involving HTEF data, I have had initial success with extracting multi-word NEs with a different approach, namely by 1. creating permutations of page-level tokens provided in the HTEF that are likely to be NE constituents; and 2. querying these permutations against an open knowledge base, specifically WikiData, in order to determine if they are valid NEs. The method, implemented in Python, can be summarized with the following example: 1. given the phrase “New York and New Jersey”, we construct a dictionary with the following word counts—{‘and’: 1, ’Jersey’: 1, ’New’: 2, ‘York’: 1}; 2. taking all of the permutations of potential NE constituents, here defined by capital letters, we construct the following list—[’Jersey New’, ‘Jersey York’, ‘New Jersey, ‘New York’, ‘York Jersey’, ‘York New’]; and, lastly, 3. querying Wikidata for all items in this list, we return only positive matches as valid entities, i.e. the following list—[‘New Jersey’, ‘New York’]. This method does not account for all possible multi-word NEs (for example, because of the capitalization constraint in step 2, “City of New York” would not be considered due to the lowercase ‘of’), but nevertheless represents a novel solution for performing a core NLP task on a pseudo-random text collection. Moreover, it represents a good example of using a general knowledge base, like WikiData, as a validation mechanism in NLP tasks. Lastly, it represents an attempt to push the boundaries of what can be derived from the HTEF dataset, while also respecting its non-consumptive nature, i.e. it is only trying to extract information from the dataset based on the likelihood of adjacent tokens, not to reconstruct entire sentences, paragraphs, or pages. With these points considered, the paper should be a useful model for other text-focused Digital Humanities projects looking to extend NLP methods to non-consumptive text collections or similarly challenging text-as-data datasets.

Analyzing the Effectiveness of Using Character N-grams to Perform Authorship Attribution in the English Language

David Gerald Berdik

Duquesne University, United States of America

Authorship attribution is a subfield of natural language processing which can be applied to practical issues such as copyright disputes. While there are many different methods that can be used to perform such an analysis, the effectiveness of these methods varies depending on the material that is being analyzed as well as the parameters chosen for the selected methods. One of these methods involves using groups of n consecutive characters, called n-grams, where n refers to the number of characters in the gram. It is expected that n-grams of different lengths will vary in their accuracy of attributing the correct author to a questioned document. Specifically, it is expected that as n-grams become larger, performance will improve, reach a peak, and then begin to degrade.

Using Patrick Juola’s Java Graphical Authorship Attribution Program (JGAAP), we performed an analysis on Koppel Schler’s blog corpus by taking all corpus entries with at least 300 sentences, separating their first 100 sentences and last 100 sentences into separate entries, and running n-gram tests from 1 to 50 to determine what an ideal size would be for performing authorship attribution using character n-grams. Based on the results of testing, we showed that contrary to the bell curve-like performance that was anticipated, n-gram accuracy peaks fairly early before beginning its decline. Future work will include character n-gram analysis on different languages to determine how much variance, if any, is present between languages.

Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: ACH 2019
Conference Software - ConfTool Pro 2.6.129+TC
© 2001 - 2019 by Dr. H. Weinreich, Hamburg, Germany