Codility

Classifying texts by CEFR level

Hypothesis

  • The frequency of distribution of a word in language corresponds to a certain level of command of the language
    • The more frequent the word, the earlier in your learning path you will learn it

Problem

  • How to map a text to a certain level of difficulty or command of the language

text

units

  • words
  • N-grams
  • syntactic patterns

frequency

excluding IT  terms?

each unit's


in real language

corpus

CEFR level?

  • US English
  • contemporary
  • domain?

What can we measure in text?

general language corpus

  • 500 most freq: A1
    • (70,000 - 2,000)
  • 1000 most freq: A2
    • (1,999 - 500)
  • etc.

CEFR specialized corpus

  • texts for all levels
    • get the relative frequency of each unit in each level

Where is each unit sorted according to its absolute frequency?

In which level does each unit appear with a most similar relative frequency?

Map frequencies to CEFR levels?

Just an example, advice from the expert would be needed to create these groups.

general language corpus

  • A1: 70,000 - 2,000
  • A2: 1,999 - 500
  • B1: 499 - 100
  • etc

CEFR specialized corpus

Examples

'have': 3942

'hand' 431

have: 5/613

'hand': 1/613

The word appears this number of times in the general language corpus

Codility

By msoutopico

Codility

  • 214