#SG5: Quantitative Textual Analysis Paper Session 1
Thursday, 25/Jul/2019:
2:00pm - 3:30pm

Session Chair: Heather Froehlich
Location: Salons 4 & 5, Grand Ballroom, Marriott City Center

Topic Modeling and Textual Analysis of American Scientific Journals, 1818 – 1922

Shawn Martin

Indiana University, United States of America

When performing a distant reading of some of the most prominent American scientific publications in the nineteenth-century U.S., some very clear patterns emerge. LDA topic modeling and textual analysis methods of over one hundred years of the American Journal of Science (AJS), Proceedings of the American Association for the Advancement of Science (PAAAS), and the Journal of the American Chemical Society (JACS) between 1818 and 1922 helps historians to understand how these journals reflect the larger social context of nineteenth-century American science.

Through much of the early nineteenth century AJS served as a news source for American scientists; in the mid-nineteenth century it began to publish more original research in a variety of different fields, and by the twentieth century was dedicated almost entirely to geology. PAAAS was fairly similar to AJS, but by the twentieth century was almost entirely dedicated to news of the society and also discussed theory and method of science somewhat more often than AJS. JACS combined elements of both AJS and PAAAS. Early in its publication, JACS published research, but it is not until the 1890s that JACS began to serve as a news source and a space for discussion of theory and method in chemistry.

Overall this analysis shows that there was an increase in discussion of business and professional issues and a shift in the journals that scientists used to discuss these issues. This shift happened during a very specific period, 1870 – 1890, the very same time that specialized scientific societies, particularly the American Chemical Society, split from the more generalized American Association for the Advancement of Science. Therefore, at least in the United States, there is a clear shift in professional identity of American scientists in the nineteenth century from generalists to specialists, and topic modeling allows historians of scientists to clearly identify this evolution of professional identity.

Using HathiTrust to Explore Librarianship’s Past.

Eric Novotny

Penn State University, United States of America

Using tools developed by HathiTrust I have analyzed hundreds of volumes of library science journals published between 1876-1920; a formative period for American librarianship. In 1876 the American Library Association was founded and the Library Journal was established. Over the next several decades more specialized journals appeared. These early journals provided a forum for librarians to discuss common concerns.

My research explores a collection of library journals I created in the HathiTrust Digital Library. The collection includes national library publications such as the Library Journal as well as many state publications like the Bulletin of the Iowa Library Commission, and the Wisconsin Library Commission Bulletin. The collection expands the canon and provides historians a more complete picture of early library conversations taking place across the country.

I will share findings generated from the computer analysis and discuss how these results compare with studies conducted using more traditional methods. Additionally, in the spirit of the early library journals which emphasized the exchange of practical information, I will discuss the process of creating a corpora for text analysis in Hathi and the challenges and opportunities afforded by the HathiTrust research tools.

Best Practices in Authorship Attribution in Greek

Sean Vinsick, David Berdik, Patrick Juola

Duquesne University, United States of America

Authorship attribution is a key task, not only in humanities scholarship, but also in application such as resolution of legal disputes. As with other forensic disciplines, accuracy is key to fairness and justice. Recent scholarship has focused substantial effort on finding the best and most accurate techniques for determining the author of a document, but much of this effort has been focused on English. Other languages, such as Greek, have not received as much attention.

Using the Java Graphical Authorship Attribution Program, we ran hundreds of thousands of different experiments testing the ability to correctly determine the correct author on a corpus of 50 Greek language blogs. Additional hundreds of thousands of experiments were performed testing the ability to correctly attribute the correct author on a 161-authors in a corpus of Greek-language tweets. Each experiment uses various combinations of canonicizers, event drivers, and distance measurements. By using a Leave One Out analysis driver, each author's document set is trained with a single document left out as an unknown author, then each author is compared against each of the known documents and ranked in order of similarity. This method allows us the ability to grade the correctness of each experiment against all authors by checking the held-out document's author against the highest ranked author. After the grading we can quantifiably demonstrate the experiment options that provide a more accurate form of authorship attribution.

Parallel Lines: Modeling Event Modality and the Possible Worlds of Fiction

Matthew Sims, David Bamman

UC Berkeley, United States of America

In his essay “The Art of Fiction,” Henry James harshly criticizes the English novelist Anthony Trollope. “[Trollope] admits,” James writes, “that the events he narrates have not really happened, and that he can give his narrative any turn the reader may like best. Such a betrayal of a sacred office seems to me, I confess, a terrible crime.” James is making a strong claim here for the sanctity of representational authenticity, for the novelist’s responsibility to truth. Truth, however, is a slippery concept, in narrative fiction as well as in life. Rather than trying to account for the verisimilitude of fiction then, we propose a far more straightforward question: within the context of a novel, how might we go about determining those events that are depicted as actually occurring as opposed to all those events that could have occurred based on the expectations, assumptions, and imaginations of the characters and the narrator?

Although this may seem like a difficult problem, there is in fact a direct way to address it. Computational work in event detection in natural language processing (including datasets released under the ACE, ERE and TAC-KBP programs) represents events in part through their linguistic modality: specifically, whether an event is asserted as occurring (“He opened the door”) or not (“He wanted to open the door”; “He would be in trouble if he opened the door”). By adopting this theoretical framework, we can use a powerful form of representation for distinguishing between actual occurrences and possible occurrences (including beliefs, hypotheticals, commands, threats, desires, and promises).

Our current work in this space involves annotating events in 200,000 words from 100 different literary texts (creating a new dataset with very different qualities than the news data examined in previous work). Using a neural model trained to distinguish event modality in this annotated dataset, we will discuss the empirical distinctions we find when applying the model to a large collection of novels.

We believe quantifying asserted and unasserted events is useful not only for comparing genres and charting historical shifts in novelistic events, but also for exploring how novels construct possible worlds in parallel to those specific worlds that are realized by the plot. In fact, for some novels, what may have happened or could have happened is arguably as compelling as what actually occurred, indicating in turn the richness of the novelistic imagination.

