Should language models be treated as models? If so, of what?
In this talk, I am going to argue that language models (including large language models such as OpenAI’s GPT series) should be treated as scientific models of external languages: languages understood as a social object and that are plausibly sets of linguistic conventions adopted by a community. In order to defend this view, I will first reject two related positions: the first claims that language models can be used as models of linguistic competence while the second claims that language models are (merely) models of their training data.
Many within computational linguistics, and specifically the field of distributional semantics, are excited about the prospect of language model technology as a new form of scientific inquiry into language (Baroni, 2022; Lenci, 2008; Sahlgren, 2008; Westera & Boleda, 2019). Perhaps the most vocal recent proponent of this view is (Piantadosi, 2023), who claims that language model technology challenges some of the core claims of the generative linguistic tradition. Although they don’t phrase it this way, the best interpretation of this view is that language models can be treated as models of linguistic competence, and so we can then inspect the model as a way of investigating linguistic competence. But the idea that deep learning neural networks could inform linguistic inquiry in this way has been criticized by Chomsky (Chomsky et al., 2023; Norvig, 2012) and others (Dupre, 2021; Veres, 2022). Roughly put, these critics worry that language models are blank slate systems (i.e. they do not have the same innate restrictions as humans) that simulate speaker performance without emulating speaker competence. Although there is evidence being produced from the emerging probing classifier literature that attenuates the strength of the points made by these critics, I will argue that they are nevertheless right.
Many who are skeptical of the possibility of language models providing linguistic insight have instead claimed that language models are merely models of their training data. After all, language models are constructed by setting a language model the task of predicting new text given what has come previously according to the distributional properties of the data it was trained on. This is the second position I will consider. A similar view can be found in Chiang’s (2023) suggestion that language models are best thought of as compressions of their training data, as a jpeg is of a higher resolution image. This view also has an affinity with Kilgarriff’s (1997) famous claim that word meanings only exist relative to the statistical properties of corpora. However, I will argue against this position, for in considering the success of a language model we do not evaluate its success in the language prediction task it is trained on, but set the model to work on new evaluation tasks. One of the amazing insights of language model technology is that these systems are able to perform so well across a wide range of natural language processing tasks. The nature of the evaluation tasks for such models – as well as their success in them – reveals that we are not holding models to a standard internal to the training corpus but are instead testing the extent to which they track something consistent across both their training set and evaluation sets.
I will argue then that language models should be thought of as models of the external language understood as a social object: the E-language in Chomsky’s (1986) terms. What language models are trained on is the actual activity of a language, where all instances across training and evaluation sets are taken to share the feature of being part of the wider language. Viewed through this lens, we are able to see the exciting possibility that language models bring, for they provide us with a way of exploring E-languages that was not available before. If E-languages are a set of social conventions, then they are undoubtedly highly complex objects, and if we acknowledge the fact that any speaker’s cognizance of that set of conventions is going to be incomplete and imperfect, then access to that complex object has previously looked fraught with difficulty. This is partly why Chomsky (1986) has taken there to be no point in positing E-languages. But now that we are able to construct models of an E-language, and in doing so bypass the cognitive domain in a way that wasn’t possible before, we have a new and exciting way of investigating them. I will finish by drawing upon recent work in philosophy of science on the use of deep learning models in scientific practice in order to further support the positive view defended here (Creel, 2020; Shech & Tamir, 2023; Sullivan, 2022, 2023).
References
Baroni, M. (2022). On the proper role of linguistically-oriented deep net analysis in linguistic theorizing (arXiv:2106.08694). arXiv. https://doi.org/10.48550/arXiv.2106.08694
Chiang, T. (2023). Chatgpt is a blurry JPEG of the web. The New Yorker. https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
Chomsky, N. (1986). Knowledge of Language: Its nature, origin, and use. Praeger.
Chomsky, N., Roberts, I., & Watumull, J. (2023, March 8). The false promise of chatgpt. The New York Times. https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html
Creel, K. A. (2020). Transparency in Complex Computational Systems. Philosophy of Science, 87(4), 568–589. https://doi.org/10.1086/709729
Dupre, G. (2021). (What) can deep learning contribute to theoretical linguistics? Minds and Machines, 31(4), 617–635. https://doi.org/10.1007/s11023-021-09571-w
Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31(2), 91–113. https://doi.org/10.1023/A:1000583911091
Lenci, A. (2008). Distributional semantics in linguistic and cognitive research. Italian Journal of Linguistics, 20(1), 32.
Norvig, P. (2012). Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning. Significance, 9(4), 30–33. https://doi.org/10.1111/j.1740-9713.2012.00590.x
Piantadosi, S. (2023). Modern language models refute Chomsky’s approach to language. LingBuzz. https://lingbuzz.net/lingbuzz/007180
Sahlgren, M. (2008). The distributional hypothesis. Italian Journal of Linguistics, 20.1, 33–53.
Shech, E., & Tamir, M. (2023). Understanding from Deep Learning Models in Context [Preprint]. https://philsci-archive.pitt.edu/21296/
Sullivan, E. (2022). Understanding from Machine Learning Models. The British Journal for the Philosophy of Science, 73(1), 109–133. https://doi.org/10.1093/bjps/axz035
Sullivan, E. (2023). Do Machine Learning Models Represent Their Targets? Philosophy of Science, 1–11. Cambridge Core. https://doi.org/10.1017/psa.2023.151
Veres, C. (2022). Large language models are not models of natural language: They are corpus models. IEEE Access, 10(Journal Article), 61970–61979. https://doi.org/10.1109/ACCESS.2022.3182505
Westera, M., & Boleda, G. (2019). Don’t blame distributional semantics if it can’t do entailment. IWCS. https://doi.org/10.18653/v1/W19-0410