Annotating a
Code-switching Corpus

Ways and Challenges

Lingzi Zhuang

Bingyan Hu

Outline

  1. Introduction: code-switching
  2. Corpus Description
  3. Our Tasks:
    1. ​POS Tagging
    2. Language ID
  4. Future Work
    1. ​Parsing
    2. Semantic Role Labelling

Code-Switching

All cases where lexical items and grammatical features of two languages appear in one sentence.

"

"

Muysken, Pieter. Bilingual Speech: A Typology of Code-Mixing. Port Chester, NY, USA: Cambridge University Press, 2001.

CS in Singapore/Malaysia

  1. Official recognition: English, Mandarin, Malay, Tamil (SG & ML)

  2. Lingua franca: English (SG & ML), Malay (ML, national language)

  3. Chinese varieties: Hokkien (Min-nan), Teochew (Chao-shan), Hakka (Ke-jia), Cantonese (Yue).

  4. Mandarin (written language, lingua franca among various Chinese-speakers). Language of instruction.

  5. In Singapore, 1979 Speak Mandarin campaign prescribes official status to Mandarin Chinese solely

  6. Official bilingualism > extensive code-switching

  7. Loans from dialect

CS in Singapore/Malaysia

  1. Official recognition: English, Mandarin, Malay, Tamil (SG & ML)

  2. Lingua franca: English (SG & ML), Malay (ML, national language)

  3. Chinese varieties: Hokkien (Min-nan), Teochew (Chao-shan), Hakka (Ke-jia), Cantonese (Yue).

CS in Singapore/Malaysia

  1. Official recognition: English, Mandarin, Malay, Tamil (SG & ML)

  2. Lingua franca: English (SG & ML), Malay (ML, national language)

  3. Chinese varieties: Hokkien (Min-nan), Teochew (Chao-shan), Hakka (Ke-jia), Cantonese (Yue).

  4. Mandarin (written language, lingua franca among various Chinese-speakers). Language of instruction.

  5. In Singapore, 1979 Speak Mandarin campaign prescribes official status to Mandarin Chinese solely

  6. Official bilingualism > extensive code-switching

  7. Loans from dialect

CS in Singapore/Malaysia

  1. English ~ Singlish ~ Singaporean Mandarin ~ Mandarin

  2. https://www.youtube.com/watch?v=y7f_2Cw-XhM

  3. Different levels of mixing

    1. Lexical ( insertion of lexical items)

    2. Syntactical ( alternation between structures)

  4. Idiolectal variation

CS in Singapore/Malaysia

  1. English ~ Singlish ~ Singaporean Mandarin ~ Mandarin

  2. https://www.youtube.com/watch?v=y7f_2Cw-XhM

  3. Different levels of mixing

    1. Lexical ( insertion of lexical items)

    2. Syntactical ( alternation between structures)

  4. Idiolectal variation

I didn’t know that … yeah… I didn’t tell you ’cause I thought that (you)… yŏu (have) meeting … yeah wŏ jiù (so I) méiyŏu (did not) reconcile nàgè (that) part with nĭ jiăng de nàgè part (that part which you mentioned).

SEAME Corpus

  1. Southeast Asia Mandarin English Code-switching corpus (LDC2015S04)

  2. Nanyang Technological U (SG), U Sains Malaysia (ML)

  3. 156 speakers; 19-33 yrs; balanced in gender.

    • 82% Singaporean, 18% Malaysian

  4. 192 hrs audio in conversational and interview (monologue) styles

  5. 63 hrs of individual, sentence-/semantic chunk-level utterances transcribed.

    • 18% from conversational, 82% from interview.  

    • 54% Singaporean, 46% Malaysian.

http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2015-05/2015-05-Seame/

SEAME Corpus

    • .
01NC01FBX_0101	86300	88370	then area five 的 total 是
01NC01FBX_0101	165090	167860	不 懂 but official result 还 没有 出 i think 出 了 他们 就 会
01NC01FBX_0101	275720	281420	as in 我 可以 meet la but 我 不 懂 我 还 以为 你们 我 不 懂 你们 有 confirm 去 gym then 我 自己 也 没有 带 我 东西
01NC01FBX_0101	532940	538300	maybe 他 at least at least 他 没有 跟 你 讲 他 在 做工 during the week saturday and sunday
01NC01FBX_0101	579040	580900	做工 做到 很 迟 then in the end 他
01NC01FBX_0101	597330	606920	then 就 不 懂 讲 什么 话 then andy 就 started saying that like 你 去 civil service 你 真的 要 有 like 你 的 honors 那种 不然 就 like 很 disadvantage in terms of 你 的 pay 这些 then 我 就 讲 你 不 是 second up
01NC01FBX_0101	615770	625430	andy's school 的 那个 miss singapore universe 那个 头发 短短 then 每次 参加 那种 pageant 就 总之 她 蛮 出名 的 then 就 他 突然间 讲 到 like peggy 是 第六 年 liao 了
01NC01FBX_0101	625650	627930	她 还 在 读 她 的 对对 对 她 今年 是 sixth year
01NC01FBX_0101	669870	673820	oh 他 拿 third class 他 差一点点 他 的 F.Y.P. screwed up 他 拿 到 B. minus C. plus
01NC01FBX_0101	706820	709160	屁 没有 打包 啊 他 没有 打包 过
01NC01FBX_0101	743340	747050	我 我 是 觉得 很 浪费 那个 sem 我 die die 继续 take then 看 怎么样 讲
01NC01FBX_0101	860180	864580	but then 如果 你 不 take I.A. 你 take I.O. 你 要 clear more electives
01NC01FBX_0101	1031580	1036200	she could have transferred course eh 你 懂 我 有 两 个 or should say 我 那个 F.Y.P. friend 那个 男 的
01NC01FBX_0101	1036230	1045850	他 是 from engine 的 then 他 就 就 也 是 a levels 考得 很 烂 很 烂 then 就 被 丢 进 engine but then 他 year one sem one 就 考得 like three point something 就 not bad then 他 就 apply then 就 换
01NC01FBX_0101	1045860	1051850	他 year one sem two 就 来 econs then 我 还有 多 两 个 friend 也 是 有 一个 女 的 更 惨 她 是 我 J.C. first three months 的 friend
01NC01FBX_0101	1099510	1105490	超 喜欢 啦 我 觉得 我 是 读 对 东西 我 很 开心 我 那 时候 appeal accountancy 我 没有 进
01NC01FBX_0101	1106560	1115600	因为 我 的 first choice 我 放 accountancy second choice 我 才放 econs then 我 就 没有 进 accountancy 因为 那 时候 那个 cutoff 是 a a B.S. 我 不 是 蛮 高 的 我 就 拿 B.B.B. 那种
01NC01FBX_0101	1115690	1118680	then in the end 就读 econs then 我 还去 appeal 一 次
01NC01FBX_0101	1124910	1131600	就 我 觉得 three years then some more 它 是 一个 professional job then 我 就 觉得 i mean like why spend four years doing a general arts
01NC01FBX_0101	1330860	1335250	but i think 这种 business 的 应该 没有 很 凶 like 那种 major project 酱
01NC01FBX_0101	1470440	1480000	从 我 一 到 那个 toa-payoh M.R.T. station 我 就 看 很多 人 惨 了 那个 announcement 就 讲 there was some delay in the previous train then 就 什么 it might cause it might cause a 什么 delay in your ride 什么 东西
01NC01FBX_0101	1556330	1560860	but then 谁 会要 从 ang-mo-kio 搭 到 jurong-east then 搭 去 pasir-ris
01NC01FBX_0101	1869620	1878690	它 有 那个 show flat then 我 跟 我 friend 我 跟 jerrin 就 很 gian to 去 看 then 就 弄到 很 美 很美 可是 很 小 很 小 but 它 的 five rooms hor 就 你家 也 是 five rooms 对 吗
01NC01FBX_0101	1898810	1906880	and then 他 就 pay 了 like almost sixty six hundred thousand for 那个 屋子 就 more than half of a million for 一个 新 的 H.D.B.flat leh

Our Task: so far...

  1. Given word-segmented transcription, implement different types of lexical/syntactic annotation that are potentially useful for feature extraction of code-switching behaviour.

    • POS tags

    • Language identification: Mandarin, English, other

      • Named entities

      • Singlish/Manglish discourse particles (ultimately loaned from local varieties of Chinese)

Our Task: so far...

  1. Given word-segmented transcription, implement different types of lexical/syntactic annotation that are potentially useful for feature extraction of code-switching behaviour.

    • POS tags

    • Language identification: Mandarin, English, other

      • Named entities

      • Singlish/Manglish discourse particles (ultimately loaned from local varieties of Chinese)

Our Task: so far...

  1. Given word-segmented transcription, implement different types of lexical/syntactic annotation that are potentially useful for feature extraction of code-switching behaviour.

    • POS tags

    • Language identification: Mandarin, English, other

      • Named entities

      • Singlish/Manglish discourse particles (ultimately loaned from local varieties of Chinese)
  • I am trying to avoid it [ar] (… emphatic declarative)

  • But then if cannot get a bank then die already [lorh] (… hasten affirmation of new circumstance)

Our Task: Overall Challenges

  • Spoken language corpus

    • Non-standard words/spellings

    • Fragments, repetitions, etc.

  • Inconsistent quality of transcription

    • Spelling/character mistakes (have not > kerosene)

    • Undelivered promises

      • discourse particles, named entities, loans

  • Problematic utterance selection

    • Many utterances contain more than one sentence; no boundary marked

Our Task: Overall Challenges

  • Spoken language corpus

    • Non-standard words/spellings

    • Fragments, repetitions, etc.

  • Inconsistent quality of transcription

    • Spelling/character mistakes (have not > kerosene)

    • Undelivered promises

      • discourse particles, named entities, loans

  • Problematic utterance selection

    • Many utterances contain more than one sentence; no boundary marked

Our Task: Overall Challenges

  • Spoken language corpus

    • Non-standard words/spellings

    • Fragments, repetitions, etc.

  • Inconsistent quality of transcription

    • Spelling/character mistakes (have not > kerosene)

    • Undelivered promises

      • discourse particles, named entities, loans

  • Problematic utterance selection

    • Many utterances contain more than one sentence; no boundary marked

POS-Tagging

  • Word-level segmentation of Chinese parts

    • Original SEAME segmentation less-than-ideal

    • Stanford Chinese Word Segmenter works well with bilingual Chinese-English data

  • POS-tag Chinese and English parts separately

  • Automatic POS-tagging using Stanford POS-tagger

    • Penn Treebank & Penn Chinese Treebank standards

  • Map PTB and CTB tags to Universal POS Tagset

    • "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

  • Word-level segmentation of Chinese parts

    • Original SEAME segmentation less-than-ideal

    • Stanford Chinese Word Segmenter works well with bilingual Chinese-English data

  • POS-tag Chinese and English parts separately

  • Automatic POS-tagging using Stanford POS-tagger

    • Penn Treebank & Penn Chinese Treebank standards

  • Map PTB and CTB tags to Universal POS Tagset

    • "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

  • Word-level segmentation of Chinese parts

    • Original SEAME segmentation less-than-ideal

    • Stanford Chinese Word Segmenter works well with bilingual Chinese-English data

  • POS-tag Chinese and English parts separately

  • Automatic POS-tagging using Stanford POS-tagger

    • Penn Treebank & Penn Chinese Treebank standards

  • Map PTB and CTB tags to Universal POS Tagset

    • "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

  • Word-level segmentation of Chinese parts

    • Original SEAME segmentation less-than-ideal

    • Stanford Chinese Word Segmenter works well with bilingual Chinese-English data

  • POS-tag Chinese and English parts separately

  • Automatic POS-tagging using Stanford POS-tagger

    • Penn Treebank & Penn Chinese Treebank standards

  • Map PTB and CTB tags to Universal POS Tagset

    • "A set of coarse POS categories exists cross-linguistically in one form or another" (Carnie)

POS-Tagging

Chinese/中文

English/英文

Chinese/中文

他们 就 要                               take I.O.                               所以 可以 自己 去 找

POS-Tagging

Chinese/中文

English/英文

Chinese/中文

他们 就 要                               take I.O.                               所以 可以 自己 去 找

Chinese/中文

English/英文

["他们", "就", "要";

"所以", "可以", "自己", "去", "找"]

["take", "I.O."]

POS-Tagging

Chinese/中文

English/英文

Chinese/中文

他们 就 要                               take I.O.                               所以 可以 自己 去 找

Chinese/中文

English/英文

["他们", "就", "要";

"所以", "可以", "自己", "去", "找"]

["take", "I.O."]

Chinese/中文

English/英文

[("他们", "PN"), ("就","AD"), ("要", "VV");

("所以", "CC"), ("可以","VV), ("自己",  "AD"), ("去", "VV"), ("找", "VV")]

[("take", "VB"), ("I.O.", "NNP")]

POS-Tagging

Chinese/中文

English/英文

("take", "VERB"), ("I.O.", "PROPN"),

[("他们", "PRON"), ("就","ADV"), ("要", "VERB"),

Chinese/中文

("所以", "CONJ"), ("可以","VERB), ("自己",  "ADV"), ("去", "VERB"), ("找", "VERB")]

                     

Chinese/中文

English/英文

[("他们", "PRON"), ("就","ADV"), ("要", "VERB");

("所以", "CONJ"), ("可以","VERB), ("自己",  "ADV"), ("去", "VERB"), ("找", "VERB")]

[("take", "VERB"), ("I.O.", "PROPN")]

POS-Tagging

  • Problems fixed:

    • discourse particles (global search using a list)

      • "lah", "leh", etc.

  • In progress:

    • Named entities

      • Manual identification using crowdsourcing

      • "lord" "of" "the" "rings" type

    • Discourse markers

      • "well", "you know", "like", "right", "okay", etc.

      • Crowdsourcing results unsatisfactory

      • Manual disambiguation in lab?

POS-Tagging

  • Problems fixed:

    • discourse particles (global search using a list)

      • "lah", "leh", etc.

  • In progress:

    • Named entities

      • Manual identification using crowdsourcing

      • "lord" "of" "the" "rings" type

    • Discourse markers

      • "well", "you know", "like", "right", "okay", etc.

      • Crowdsourcing results unsatisfactory

      • Manual disambiguation in lab?

POS-Tagging

  • Unfixed:

    • POS-tags for loans (Malay, non-Mandarin Chinese)

  • Systematic error

    • Breaking up the sentences limits context scope for POS-tagger

      • Words on the margin may not be tagged accurately

      • POS-ambiguous words are more likely to receive the wrong tag

    • Stuttering (partial repetition), especially at code-switching boundaries, produces half-words whose POS tags might be noisy

POS-Tagging

  • Unfixed:

    • POS-tags for loans (Malay, non-Mandarin Chinese)

  • Systematic error

    • Breaking up the sentences limits context scope for POS-tagger

      • Words on the margin may not be tagged accurately

      • POS-ambiguous words are more likely to receive the wrong tag

    • Stuttering (partial repetition), especially at code-switching boundaries, produces half-words whose POS tags might be noisy

 We will like (*VERB) 去 (go to) 一个 人 (someone’s) 家里 (home).

POS-Tagging

  • Unfixed:

    • POS-tags for loans (Malay, non-Mandarin Chinese)

  • Systematic error

    • Breaking up the sentences limits context scope for POS-tagger

      • Words on the margin may not be tagged accurately

      • POS-ambiguous words are more likely to receive the wrong tag

    • Stuttering (partial repetition), especially at code-switching boundaries, produces half-words whose POS tags might be noisy

 We will like (*VERB) 去 (go to) 一个 人 (someone’s) 家里 (home).

我 (I) 做 了 (did) 一点点 (a little) lit (*VERB) literature (NOUN) review ah

Language ID

  • Chinese and English character sets are mutually exclusive

  • Outstanding cases:

    • discourse particles (solved)

    • uncaught loans from Malay and non-Mandarin Chinese

      • The dictionary method

Future: from lexical to syntactical

 

  • Possible next steps:
    • Parsing?
      • Structure of colloquial speech tends to be flat
    • Semantic role labelling?
      • Current semantic role labelling tools are monolingual and rely on parsing information
      • Translate-label-replace trick?

Future: from lexical to syntactical

 

  • Possible next steps:
    • Parsing?
      • Structure of colloquial speech tends to be flat
    • Semantic role labelling?
      • Current semantic role labelling tools are monolingual and rely on parsing information
      • Translate-label-replace trick?

Future: from lexical to syntactical

因为  突然间 喜欢 photography 就 一直 找 photography [loh]

--- Because suddenly like photography, so always find photography []

--- 因为 突然间 喜欢 摄影 就 一直 找 摄影 []

Future: from lexical to syntactical

  • But nevertheless 我 有 learn 了 另外 一 种 technique

    • But nevertheless I have learn [] another one kind technique

    • 但是 但是 我 有 学 了 另外 一 种 技巧

Future

  • From lexical to syntactical...
  • Possible next steps:
    • Parsing?
      • Structure of colloquial tends to be flat
    • Semantic role labelling?
      • Current semantic role labelling tools are monolingual and rely on parsing information
      • Translate-label-replace trick?
        •  "Alternation" idea: there is a "basic structure", which is either Chinese or English.
        • If basic structure is English, then word-translate Chinese parts to English, and use English semantic parser. Vice versa.
        • Criterion: verb? (Basic syntax is English if most VERBs are English; vice versa)
           

Thank you!

code-switching

By Bingyan Hu

code-switching

code-switching

  • 1,716