Psycholinguistic Corpora

Basic course information


In traditional psycholinguistic studies, researchers create a list of isolated sentences that are often peculiar (e.g., double-center embedding: The mouse that the cat that the dog liked chased ran away.), and present them to experiment participants. As they read or listen to each of those sentences, participants were often asked to perform somewhat unnatural tasks, like acceptability judgments ("Did you find the sentence to be acceptable?") or lexical decision tasks ("Did you see a word or a non-word?"). While these studies have substantially contributed to psycholinguistics, people seem to be doing something quite different from everyday language processing: people do not read a list of isolated peculiar sentences, or judge if a sentence is acceptable or not outside of labs. It can be questioned how much of the findings from these traditional studies apply to the actual, natural language processing.

Recently, there are increasingly more attention to studies using psycholinguistic corpora. Psycholinguistic corpora are data sets that consist of natural texts or audio (e.g., story books) and people's behavioral or neural reactions to them as they simply read those texts or listen to the audio. Unlike the traditional approach, both the stimuli and task are closer to what people are actually doing outside of labs. Furthermore, those datasets are often publicized, and researchers can analyze the data in the way they want without collecting data by themselves.

In this seminar, we will discuss research papers using psycholinguistic corpora to learn (i) what psycholinguistic corpora can add to  undertandings of human language processing (or what it cannot inform us of), and (ii) how psycholinguistic corpora can be used. Since the central theme of the seminar is the approach using psycholinguistics rather than the research questions, we will discuss papers with various topics: parsing, acoustic/phonological representations, reference resolution, etc.

Course policies

Communication with Instructor

Names/Pronouns and Self-Identifications:

I recognize the importance of a diverse student body, and we are committed to fostering inclusive and equitable classroom environments. I invite you, if you wish, to tell us how you want to be referred to in this class, both in terms of your name and your pronouns (he/him, she/her, they/them, etc.). Keep in mind that the pronouns someone uses are not necessarily indicative of their gender identity. Additionally, it is your choice whether to disclose how you identify in terms of your gender, race, class, sexuality, religion, and dis/ability, among all aspects of your identity (e.g., should it come up in classroom conversation about our experiences and perspectives) and should be self-identified, not presumed or imposed.  I will do my best to address and refer to all students accordingly, and I ask you to do the same for all of your fellows.


Overall structure

4 credits

75%: Presentation

15%: Discussion questions

10%: Active participation in discussion

7 credits

45%: Paper

40%: Presentation

10%: Discussion questions

5%: Active participation in discussion

See the evaluation guidelines for more details.

See the guidelines for the term paper here.

Reading list

NOTE: These are tentative lists and would be updated in the coming couple of weeks. You are more than welcome to suggest a paper or a topic!

Prediction and Surprisal theory

Demberg & Keller (2008):

Boston et al. (2008):

Frank & Bod (2011):

Brennan and Hale (2019):

Smith and Levy (2013):

Brodbeck et al. (2022):

Heilbron et al. (2023):

Frequency vs Predictability

(Futrell et al. (2021):, about the courpus itself)

Yan et al. (2018):

Shain (2019):

Semantic processing (Semantic similarity)

Broderick et al. (2018):


Dubey et al. (2013):

Jaffe et al. (2020):

Seminck (2020): (short review):

Acoustic/phonological representations

Gillis et al. (2021): (Not just sounds)

Khalighinejad et al. (2017):

Gwilliams et al. (2022):