Codility
Classifying texts by CEFR level
Hypothesis
- The frequency of distribution of a word in language corresponds to a certain level of command of the language
- The more frequent the word, the earlier in your learning path you will learn it
Problem
- How to map a text to a certain level of difficulty or command of the language
text
units
- words
- N-grams
- syntactic patterns
frequency
excluding ITÂ terms?
each unit's
in real language
corpus
CEFR level?
- US English
- contemporary
- domain?
What can we measure in text?
general language corpus
- 500 most freq: A1
- (70,000 - 2,000)
- 1000 most freq: A2
- (1,999 - 500)
- etc.
CEFR specialized corpus
- texts for all levels
- get the relative frequency of each unit in each level
Where is each unit sorted according to its absolute frequency?
In which level does each unit appear with a most similar relative frequency?
Map frequencies to CEFR levels?
Just an example, advice from the expert would be needed to create these groups.
general language corpus
- A1: 70,000 - 2,000
- A2: 1,999 - 500
- B1: 499 - 100
- etc
CEFR specialized corpus
Examples
'have': 3942
'hand' 431
have: 5/613
'hand': 1/613
The word appears this number of times in the general language corpus
Codility
By msoutopico
Codility
- 222