Language Segmentation of Twitter Tweets using Weakly Supervised Language Model Induction

David Alfter

15 September 2015

@daalft

The Problem

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

Pacati,[Ved.pacati,Idg*peqǔōAv.pac-; Obulg.peka to fry,roast,Lith,kepū bake,Grpέssw cook,pέpwn ripe] to cook,boil,roast Vin.IV,264; fig.torment in purgatory (trs.and intrs.):Niraye pacitvā after roasting in N.S.II,225,PvA.10,14.-- ppr.pacanto tormenting,Gen.pacato (+Caus.pācayato) D.I,52 (expld at DA.I,159,where read pacato for paccato,by pare daṇḍena pīḷentassa).-- pp.pakka (q.v.).‹-› Caus.pacāpeti & pāceti (q.v.).-- Pass.paccati to be roasted or tormented (q.v.).(Page 382)

Abbha, (nt.) [Vedic abhra nt. & later Sk. abhra m. \"dark cloud\"; Idg. *m̊bhro, cp. Gr. <at>a)fro\\s</at> scum, froth, Lat. imber rain; also Sk. ambha water, Gr. <at>o)/mbros</at> rain, Oir ambu water]. A (dense & dark) cloud, a cloudy mass A <smallcaps>ii.</smallcaps> 53 = Vin <smallcaps>ii.</smallcaps> 295 = Miln 273 in list of to things that obscure moon-- & sunshine, viz. <b>abbhaŋ mahikā</b> (mahiyā A) <b>dhū- marajo</b> (megho Miln), <b>Rāhu</b> . This list is referred to at SnA 487 & VvA 134. S <smallcaps>i.</smallcaps> 101 (°sama pabbata a mountain like a thunder--cloud); J <smallcaps>vi.</smallcaps> 581 (abbhaŋ rajo acchādesi); Pv <smallcaps>iv.</smallcaps> 3 <superscript>9</superscript> (nīl° = nīla--megha PvA 251). As f. <b>abbhā</b> at Dhs 617 & DhsA 317 (used in sense of adj. \"dull\"; DhsA expl <superscript>s.</superscript> by valāhaka); perhaps also in <b>abbhāmatta</b> . <br /><b>--kūṭa</b> the point or summit of a storm--cloud Th 1, 1064; J <smallcaps>vi.</smallcaps> 249, 250; Vv 1 <superscript>1</superscript> (= valāhaka--sikhara VvA 12). <b>--ghana</b> a mass of clouds, a thick cloud It 64; Sn 348 (cp. SnA 348). <b>--paṭala</b> a mass of clouds DhsA 239. <b>--mutta</b> free from clouds Sn 687 (also as abbhāmutta Dh 382). <b>--saŋvilāpa</b> thundering S <smallcaps>iv.</smallcaps> 289.

The Intuition

LM 1

LM 2

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

LM 3

LM 1

LM 2

[] M

<smallcaps>i.</smallcaps>

287 (T. v. l.) = A

<smallcaps>v.</smallcaps>

265 (T., v. l.).

LM 3

n. ag.

fr. in med.

function one who

covets

abhijjhita

abhijjhātar

°itar °ātar

The Approach

N-GRAM Language Model

P(w_i|w_1,\ldots,w_{i-1}) =
P(wiw1,,wi1)=P(w_i|w_1,\ldots,w_{i-1}) =
d_{w_{i-n+1},\ldots,w_i} \frac{C(w_{i-n+1},\ldots,w_i)}{C(w_{i-n+1},\ldots,w_{i-1})}
dwin+1,,wiC(win+1,,wi)C(win+1,,wi1)d_{w_{i-n+1},\ldots,w_i} \frac{C(w_{i-n+1},\ldots,w_i)}{C(w_{i-n+1},\ldots,w_{i-1})}
\alpha_{w_{i-n+1},\ldots,w_{i-1}} P(w_i|w_{i-n+2},\ldots,w_{i-1})
αwin+1,,wi1P(wiwin+2,,wi1)\alpha_{w_{i-n+1},\ldots,w_{i-1}} P(w_i|w_{i-n+2},\ldots,w_{i-1})

{

n-gram probability

N-GRAM Language Model

P(w) = \frac{1}{\sum_{i=2}^{n}|\log P(w_i|w_{i-2},w_{i-1})|}
P(w)=1i=2nlogP(wiwi2,wi1)P(w) = \frac{1}{\sum_{i=2}^{n}|\log P(w_i|w_{i-2},w_{i-1})|}

word probability

word word mot palabra word palabra

word word mot palabra word palabra

word word mot palabra word palabra

word word mot palabra word palabra

0.74

word word mot palabra word palabra

word word mot palabra word palabra

word word mot palabra word palabra

0.02

word word mot palabra word palabra

word word mot palabra word palabra

0.01

word word mot palabra word palabra

0.02

word word mot palabra word palabra

word word mot palabra word palabra

0.89

word word mot palabra word palabra

0.41

word word mot palabra word palabra

0.35

word word mot palabra word palabra

word word

word

mot

palabra

palabra

The Catch

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

Let's reverse it!

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

mot palabra palabra palabra palabras...

Forward/Backwards generation

Merge most similar models

Similarity measure: Unigram distribution

Final models

word

word

mot

palabra

palabra

palabra

word

word-model assignment

The Results

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

Μόλις ψήφισα αυτή τη λύση Internet of Things, στο διαγωνισμό BUSINESS IT EXCELLENCE.

Demain #dhiha6 Keynote 18h @dhiparis "The collective dynamics of science-publish or perish; is it all that counts?" par David

Food and breuvages in Edmonton are ready to go, just waiting for the fans #FWWC2015 #bilingualism

Buna dabo naw (coffee is our bread).

Thank you for your attention

Thank you for your attention

Questions?

Tweet MT Slides

By daalft

Tweet MT Slides

  • 791