Annotating a
Code-switching Corpus

Ways and Challenges

Lingzi Zhuang

Bingyan Hu

Outline

Introduction: code-switching
Corpus Description
Our Tasks:
1. POS Tagging
2. Language ID
Future Work
1. Parsing
2. Semantic Role Labelling

Code-Switching

All cases where lexical items and grammatical features of two languages appear in one sentence.

Muysken, Pieter. Bilingual Speech: A Typology of Code-Mixing. Port Chester, NY, USA: Cambridge University Press, 2001.

CS in Singapore/Malaysia

Official recognition: English, Mandarin, Malay, Tamil (SG & ML)
Lingua franca: English (SG & ML), Malay (ML, national language)
Chinese varieties: Hokkien (Min-nan), Teochew (Chao-shan), Hakka (Ke-jia), Cantonese (Yue).
Mandarin (written language, lingua franca among various Chinese-speakers). Language of instruction.
In Singapore, 1979 Speak Mandarin campaign prescribes official status to Mandarin Chinese solely
Official bilingualism > extensive code-switching
Loans from dialect

CS in Singapore/Malaysia

Official recognition: English, Mandarin, Malay, Tamil (SG & ML)
Lingua franca: English (SG & ML), Malay (ML, national language)
Chinese varieties: Hokkien (Min-nan), Teochew (Chao-shan), Hakka (Ke-jia), Cantonese (Yue).

https://commons.wikimedia.org/wiki/File:Map_of_sinitic_languages_cropped-en.svg

CS in Singapore/Malaysia

Official recognition: English, Mandarin, Malay, Tamil (SG & ML)
Lingua franca: English (SG & ML), Malay (ML, national language)
Chinese varieties: Hokkien (Min-nan), Teochew (Chao-shan), Hakka (Ke-jia), Cantonese (Yue).
Mandarin (written language, lingua franca among various Chinese-speakers). Language of instruction.
In Singapore, 1979 Speak Mandarin campaign prescribes official status to Mandarin Chinese solely
Official bilingualism > extensive code-switching
Loans from dialect

CS in Singapore/Malaysia

English ~ Singlish ~ Singaporean Mandarin ~ Mandarin
https://www.youtube.com/watch?v=y7f_2Cw-XhM
Different levels of mixing
1. Lexical ( insertion of lexical items)
2. Syntactical ( alternation between structures)
Idiolectal variation

CS in Singapore/Malaysia

English ~ Singlish ~ Singaporean Mandarin ~ Mandarin
https://www.youtube.com/watch?v=y7f_2Cw-XhM
Different levels of mixing
1. Lexical ( insertion of lexical items)
2. Syntactical ( alternation between structures)
Idiolectal variation

I didn’t know that … yeah… I didn’t tell you ’cause I thought that nĭ (you)… yŏu (have) meeting … yeah wŏ jiù (so I) méiyŏu (did not) reconcile nàgè (that) part with nĭ jiăng de nàgè part (that part which you mentioned).

SEAME Corpus

Southeast Asia Mandarin English Code-switching corpus (LDC2015S04)
Nanyang Technological U (SG), U Sains Malaysia (ML)
156 speakers; 19-33 yrs; balanced in gender.
- 82% Singaporean, 18% Malaysian
192 hrs audio in conversational and interview (monologue) styles
63 hrs of individual, sentence-/semantic chunk-level utterances transcribed.
- 18% from conversational, 82% from interview.
- 54% Singaporean, 46% Malaysian.

http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2015-05/2015-05-Seame/

SEAME Corpus

01NC01FBX_0101	86300	88370	then area five 的 total 是
01NC01FBX_0101	165090	167860	不 懂 but official result 还 没有 出 i think 出 了 他们 就 会
01NC01FBX_0101	275720	281420	as in 我 可以 meet la but 我 不 懂 我 还 以为 你们 我 不 懂 你们 有 confirm 去 gym then 我 自己 也 没有 带 我 东西
01NC01FBX_0101	532940	538300	maybe 他 at least at least 他 没有 跟 你 讲 他 在 做工 during the week saturday and sunday
01NC01FBX_0101	579040	580900	做工 做到 很 迟 then in the end 他
01NC01FBX_0101	597330	606920	then 就 不 懂 讲 什么 话 then andy 就 started saying that like 你 去 civil service 你 真的 要 有 like 你 的 honors 那种 不然 就 like 很 disadvantage in terms of 你 的 pay 这些 then 我 就 讲 你 不 是 second up
01NC01FBX_0101	615770	625430	andy's school 的 那个 miss singapore universe 那个 头发 短短 then 每次 参加 那种 pageant 就 总之 她 蛮 出名 的 then 就 他 突然间 讲 到 like peggy 是 第六 年 liao 了
01NC01FBX_0101	625650	627930	她 还 在 读 她 的 对对 对 她 今年 是 sixth year
01NC01FBX_0101	669870	673820	oh 他 拿 third class 他 差一点点 他 的 F.Y.P. screwed up 他 拿 到 B. minus C. plus
01NC01FBX_0101	706820	709160	屁 没有 打包 啊 他 没有 打包 过
01NC01FBX_0101	743340	747050	我 我 是 觉得 很 浪费 那个 sem 我 die die 继续 take then 看 怎么样 讲
01NC01FBX_0101	860180	864580	but then 如果 你 不 take I.A. 你 take I.O. 你 要 clear more electives
01NC01FBX_0101	1031580	1036200	she could have transferred course eh 你 懂 我 有 两 个 or should say 我 那个 F.Y.P. friend 那个 男 的
01NC01FBX_0101	1036230	1045850	他 是 from engine 的 then 他 就 就 也 是 a levels 考得 很 烂 很 烂 then 就 被 丢 进 engine but then 他 year one sem one 就 考得 like three point something 就 not bad then 他 就 apply then 就 换
01NC01FBX_0101	1045860	1051850	他 year one sem two 就 来 econs then 我 还有 多 两 个 friend 也 是 有 一个 女 的 更 惨 她 是 我 J.C. first three months 的 friend
01NC01FBX_0101	1099510	1105490	超 喜欢 啦 我 觉得 我 是 读 对 东西 我 很 开心 我 那 时候 appeal accountancy 我 没有 进
01NC01FBX_0101	1106560	1115600	因为 我 的 first choice 我 放 accountancy second choice 我 才放 econs then 我 就 没有 进 accountancy 因为 那 时候 那个 cutoff 是 a a B.S. 我 不 是 蛮 高 的 我 就 拿 B.B.B. 那种
01NC01FBX_0101	1115690	1118680	then in the end 就读 econs then 我 还去 appeal 一 次
01NC01FBX_0101	1124910	1131600	就 我 觉得 three years then some more 它 是 一个 professional job then 我 就 觉得 i mean like why spend four years doing a general arts
01NC01FBX_0101	1330860	1335250	but i think 这种 business 的 应该 没有 很 凶 like 那种 major project 酱
01NC01FBX_0101	1470440	1480000	从 我 一 到 那个 toa-payoh M.R.T. station 我 就 看 很多 人 惨 了 那个 announcement 就 讲 there was some delay in the previous train then 就 什么 it might cause it might cause a 什么 delay in your ride 什么 东西
01NC01FBX_0101	1556330	1560860	but then 谁 会要 从 ang-mo-kio 搭 到 jurong-east then 搭 去 pasir-ris
01NC01FBX_0101	1869620	1878690	它 有 那个 show flat then 我 跟 我 friend 我 跟 jerrin 就 很 gian to 去 看 then 就 弄到 很 美 很美 可是 很 小 很 小 but 它 的 five rooms hor 就 你家 也 是 five rooms 对 吗
01NC01FBX_0101	1898810	1906880	and then 他 就 pay 了 like almost sixty six hundred thousand for 那个 屋子 就 more than half of a million for 一个 新 的 H.D.B.flat leh

Our Task: so far...

Given word-segmented transcription, implement different types of lexical/syntactic annotation that are potentially useful for feature extraction of code-switching behaviour.
- POS tags
- Language identification: Mandarin, English, other
  - Named entities
  - Singlish/Manglish discourse particles (ultimately loaned from local varieties of Chinese)

Our Task: so far...

Given word-segmented transcription, implement different types of lexical/syntactic annotation that are potentially useful for feature extraction of code-switching behaviour.
- POS tags
- Language identification: Mandarin, English, other
  - Named entities
  - Singlish/Manglish discourse particles (ultimately loaned from local varieties of Chinese)

Our Task: so far...

Given word-segmented transcription, implement different types of lexical/syntactic annotation that are potentially useful for feature extraction of code-switching behaviour.
- POS tags
- Language identification: Mandarin, English, other
  - Named entities
  - Singlish/Manglish discourse particles (ultimately loaned from local varieties of Chinese)

I am trying to avoid it [ar] (… emphatic declarative)
But then if cannot get a bank then die already [lorh] (… hasten affirmation of new circumstance)

Our Task: Overall Challenges

Spoken language corpus
- Non-standard words/spellings
- Fragments, repetitions, etc.
Inconsistent quality of transcription
- Spelling/character mistakes (have not > kerosene)
- Undelivered promises
  - discourse particles, named entities, loans
Problematic utterance selection
- Many utterances contain more than one sentence; no boundary marked

Our Task: Overall Challenges

Spoken language corpus
- Non-standard words/spellings
- Fragments, repetitions, etc.
Inconsistent quality of transcription
- Spelling/character mistakes (have not > kerosene)
- Undelivered promises
  - discourse particles, named entities, loans
Problematic utterance selection
- Many utterances contain more than one sentence; no boundary marked

Our Task: Overall Challenges

Spoken language corpus
- Non-standard words/spellings
- Fragments, repetitions, etc.
Inconsistent quality of transcription
- Spelling/character mistakes (have not > kerosene)
- Undelivered promises
  - discourse particles, named entities, loans
Problematic utterance selection
- Many utterances contain more than one sentence; no boundary marked

POS-Tagging

Word-level segmentation of Chinese parts
- Original SEAME segmentation less-than-ideal
- Stanford Chinese Word Segmenter works well with bilingual Chinese-English data
POS-tag Chinese and English parts separately
Automatic POS-tagging using Stanford POS-tagger
- Penn Treebank & Penn Chinese Treebank standards
Map PTB and CTB tags to Universal POS Tagset
- "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

Word-level segmentation of Chinese parts
- Original SEAME segmentation less-than-ideal
- Stanford Chinese Word Segmenter works well with bilingual Chinese-English data
POS-tag Chinese and English parts separately
Automatic POS-tagging using Stanford POS-tagger
- Penn Treebank & Penn Chinese Treebank standards
Map PTB and CTB tags to Universal POS Tagset
- "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

Word-level segmentation of Chinese parts
- Original SEAME segmentation less-than-ideal
- Stanford Chinese Word Segmenter works well with bilingual Chinese-English data
POS-tag Chinese and English parts separately
Automatic POS-tagging using Stanford POS-tagger
- Penn Treebank & Penn Chinese Treebank standards
Map PTB and CTB tags to Universal POS Tagset
- "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

Word-level segmentation of Chinese parts
- Original SEAME segmentation less-than-ideal
- Stanford Chinese Word Segmenter works well with bilingual Chinese-English data
POS-tag Chinese and English parts separately
Automatic POS-tagging using Stanford POS-tagger
- Penn Treebank & Penn Chinese Treebank standards
Map PTB and CTB tags to Universal POS Tagset
- "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

http://universaldependencies.github.io/docs/u/pos/

http://www.petrovi.de/data/lrec.pdf

POS-Tagging

Chinese/中文

English/英文

Chinese/中文

他们就要 take I.O. 所以可以自己去找

POS-Tagging

Chinese/中文

English/英文

Chinese/中文

他们就要 take I.O. 所以可以自己去找

Chinese/中文

English/英文

["他们", "就", "要";

"所以", "可以", "自己", "去", "找"]

["take", "I.O."]

POS-Tagging

Chinese/中文

English/英文

Chinese/中文

他们就要 take I.O. 所以可以自己去找

Chinese/中文

English/英文

["他们", "就", "要";

"所以", "可以", "自己", "去", "找"]

["take", "I.O."]

Chinese/中文

English/英文

[("他们", "PN"), ("就","AD"), ("要", "VV");

("所以", "CC"), ("可以","VV), ("自己", "AD"), ("去", "VV"), ("找", "VV")]

[("take", "VB"), ("I.O.", "NNP")]

POS-Tagging

Chinese/中文

English/英文

("take", "VERB"), ("I.O.", "PROPN"),

[("他们", "PRON"), ("就","ADV"), ("要", "VERB"),

Chinese/中文

("所以", "CONJ"), ("可以","VERB), ("自己", "ADV"), ("去", "VERB"), ("找", "VERB")]

Chinese/中文

English/英文

[("他们", "PRON"), ("就","ADV"), ("要", "VERB");

("所以", "CONJ"), ("可以","VERB), ("自己", "ADV"), ("去", "VERB"), ("找", "VERB")]

[("take", "VERB"), ("I.O.", "PROPN")]

POS-Tagging

Problems fixed:
- discourse particles (global search using a list)
  - "lah", "leh", etc.
In progress:
- Named entities
  - Manual identification using crowdsourcing
  - "lord" "of" "the" "rings" type
- Discourse markers
  - "well", "you know", "like", "right", "okay", etc.
  - Crowdsourcing results unsatisfactory
  - Manual disambiguation in lab?

POS-Tagging

Problems fixed:
- discourse particles (global search using a list)
  - "lah", "leh", etc.
In progress:
- Named entities
  - Manual identification using crowdsourcing
  - "lord" "of" "the" "rings" type
- Discourse markers
  - "well", "you know", "like", "right", "okay", etc.
  - Crowdsourcing results unsatisfactory
  - Manual disambiguation in lab?

POS-Tagging

Unfixed:
- POS-tags for loans (Malay, non-Mandarin Chinese)
Systematic error
- Breaking up the sentences limits context scope for POS-tagger
  - Words on the margin may not be tagged accurately
  - POS-ambiguous words are more likely to receive the wrong tag
- Stuttering (partial repetition), especially at code-switching boundaries, produces half-words whose POS tags might be noisy

POS-Tagging

Unfixed:
- POS-tags for loans (Malay, non-Mandarin Chinese)
Systematic error
- Breaking up the sentences limits context scope for POS-tagger
  - Words on the margin may not be tagged accurately
  - POS-ambiguous words are more likely to receive the wrong tag
- Stuttering (partial repetition), especially at code-switching boundaries, produces half-words whose POS tags might be noisy

We will like (*VERB) 去 (go to) 一个人 (someone’s) 家里 (home).

POS-Tagging

Unfixed:
- POS-tags for loans (Malay, non-Mandarin Chinese)
Systematic error
- Breaking up the sentences limits context scope for POS-tagger
  - Words on the margin may not be tagged accurately
  - POS-ambiguous words are more likely to receive the wrong tag
- Stuttering (partial repetition), especially at code-switching boundaries, produces half-words whose POS tags might be noisy

We will like (*VERB) 去 (go to) 一个人 (someone’s) 家里 (home).

我 (I) 做了 (did) 一点点 (a little) lit (*VERB) literature (NOUN) review ah

Language ID

Chinese and English character sets are mutually exclusive
Outstanding cases:
- discourse particles (solved)
- uncaught loans from Malay and non-Mandarin Chinese
  - The dictionary method

Future: from lexical to syntactical

Possible next steps:
- Parsing?
  - Structure of colloquial speech tends to be flat
- Semantic role labelling?
  - Current semantic role labelling tools are monolingual and rely on parsing information
  - Translate-label-replace trick?

Future: from lexical to syntactical

Possible next steps:
- Parsing?
  - Structure of colloquial speech tends to be flat
- Semantic role labelling?
  - Current semantic role labelling tools are monolingual and rely on parsing information
  - Translate-label-replace trick?

Future: from lexical to syntactical

因为突然间喜欢 photography 就一直找 photography [loh]

--- Because suddenly like photography, so always find photography []

--- 因为突然间喜欢摄影就一直找摄影 []

http://cogcomp.cs.illinois.edu/page/demo_view/

http://barbar.cs.lth.se:8081/parse

http://www.ltp-cloud.com/demo/

Future: from lexical to syntactical

But nevertheless 我有 learn 了另外一种 technique
- But nevertheless I have learn [] another one kind technique
- 但是但是我有学了另外一种技巧

http://cogcomp.cs.illinois.edu/page/demo_view/

http://barbar.cs.lth.se:8081/parse

http://www.ltp-cloud.com/demo/

Future

From lexical to syntactical...
Possible next steps:
- Parsing?
  - Structure of colloquial tends to be flat
- Semantic role labelling?
  - Current semantic role labelling tools are monolingual and rely on parsing information
  - Translate-label-replace trick?
    - "Alternation" idea: there is a "basic structure", which is either Chinese or English.
    - If basic structure is English, then word-translate Chinese parts to English, and use English semantic parser. Vice versa.
    - Criterion: verb? (Basic syntax is English if most VERBs are English; vice versa)

Annotating a Code-switching Corpus

Outline

Code-Switching

CS in Singapore/Malaysia

CS in Singapore/Malaysia

CS in Singapore/Malaysia

CS in Singapore/Malaysia

CS in Singapore/Malaysia

SEAME Corpus

SEAME Corpus

Our Task: so far...

Our Task: so far...

Our Task: so far...

Our Task: Overall Challenges

Our Task: Overall Challenges

Our Task: Overall Challenges

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

POS-Tagging

Language ID

Future: from lexical to syntactical

Future: from lexical to syntactical

Future: from lexical to syntactical

Future: from lexical to syntactical

Future

Thank you!

code-switching

More from Bingyan Hu

Annotating a
Code-switching Corpus