Background

Chomsky and others have very directly claimed that large language models (LLMs) are equally capable of learning possible and impossible human languages.

Intro

They provide extensive new experimental evidence to inform the claim that LLMs are equally capable of learning possible and impossible languages in the human sense.

Arguably, the central challenge for such work is the fact that there is no agreed-upon way of distinguishing these two groups.

They do not feel positioned ourselves to assert such a definition, so they instead offer some examples of impossible languages on a continuum of intuitive complexity.

What they find is that language models indeed struggle to learn impossible languages.

Experiment 1: Impossible Languages

Shuffle Languages、Reverse Languages、Hop Languages

Experiment 1: Impossible Languages

Pretrain datasets: BabyLM dataset

Model: Training GPT-2 small from scratch

Evaluation set: 10k sents from BabyLM dataset

Experiment 2: Language Models Disprefer Counting Rules

Question:

They show that impossible languages are harder for GPT-2 to learn. However, perplexity is a coarse-grained metric of language learning, and the question remains: do language models learn natural grammatical structures better than impossible grammars?

Experiment 2: Language Models Disprefer Counting Rules

review:

NOHOP: All 3rd-person present tense verbs in the input sentence are lemmatized, and the sentence is tokenized. For each 3rd-person present tense verb, a special marker representing the verb’s number and tense is placed right after the lemmatized verb. Singular verbs are marked with a special token S , and plural verbs are marked with P . Like the other control languages, NOHOP has a pattern that is most similar to English.

TOKENHOP: Identical transformation to NOHOP, but the special number/tense markers are placed 4 tokens after the verb.

WORDHOP: Identical transformation to NOHOP and TOKENHOP, but the special number/tense markers are placed 4 words after the verb, skipping punctuation.

Experiment 2: Language Models Disprefer Counting Rules

Surprisal differences

(expected to be large)

Surprisal

Experiment 2: Language Models Disprefer Counting Rules

The NOHOP model, which has the verb marking pattern most similar to English, consistently has the lowest mean marker surprisal across training steps.

The NOHOP model also has the highest mean surprisal difference across training

Both of these results indicate that GPT-2 has learned to expect the marker tokens when they follow a more natural grammatical pattern and was very surprised when they did not appear at the correct positions.

Contra claims by Chomsky and others that LLMs cannot possibly inform our understanding of human language, they argue there is great value in treating LLMs as a comparative system for human language and in understanding what systems like LLMs can and cannot learn.

They have shown that GPT-2 models do not master their set of synthetic impossible languages as well as natural ones, challenging the unfounded assertions stated previously.

Even in the absence of a clear definition of what constitutes a possible or impossible language, they believe that their investigations advance this debate regarding LLMs.

LLMs lack strong in-built linguistic priors, yet they can learn complex syntactic structures.

Background

Intro

Experiment 1: Impossible Languages

Experiment 1: Impossible Languages

Experiment 2: Language Models Disprefer Counting Rules

Experiment 2: Language Models Disprefer Counting Rules

Experiment 2: Language Models Disprefer Counting Rules

Experiment 2: Language Models Disprefer Counting Rules

Discussion and Conclusion

deck

deck

Yao

Background

Intro

Experiment 1: Impossible Languages

Experiment 1: Impossible Languages

Experiment 2: Language Models Disprefer Counting Rules

Experiment 2: Language Models Disprefer Counting Rules

Experiment 2: Language Models Disprefer Counting Rules

Experiment 2: Language Models Disprefer Counting Rules

Discussion and Conclusion

deck

More from Yao