Twitter 

Hashtag 

Segmentation


Piyush Bansal


What is a 

#hashtag ?



  • People use the # symbol before a relevant keyword or phrase in their tweet to categorise their tweet .
  • The phrase or keyword starting with # is called a hashtag.

#homesandgarden 

#jtrocks

#happy

#dowhatyoulove

Understanding 

the problem.


  • The hashtag #lovewhatyoudo can be segmented into                         love what you do.
  • Automating this process without human intervention is the key idea of the problem.

Why is this problem so important ?



  • Holds a very huge potential in improving search of tweets on twitter and posts on google+  and facebook.

  • Can be used to solve related problems like URL segmentation to classify potentially unsafe blogs or websites beforehand.

  • Can offer huge help in sentiment analysis of tweets by understanding the hashtags being used and their meaning.

Challenges of 

the problem


  1. 1. Tweets contain excessively noisy data.
  2. 2. Lack any punctuation and capitalisation rules of grammar.
  3. 3. Phonetic spellings and acronyms pose challenge to the corpus being used. 





One (major?) challenge


Looking at some of the hashtags we observe 
something very important...

#nothere
#alwaysahead
#alwaysagain

can all be segmented into multiple semantically correct segmentations, how to establish which of these is more probable ?

Solution design.

[normalisation of tweets]

  • To cope up with noise in the data, we implement syntactic normalisation of tweets.

  • Phonetic spellings haven't been handled yet, but can be handled using edit distance.




Solution design.

[Normal segmentation]


  • Given a string and finding its most probable segmentation is not a very big challenge, and can be solved using a O(n^2) Dynamic Programming solution known classically as text segmentation.

  • But that does not solve the problem, since we already saw that we can have multiple possible segmentations for a given hashtag, and the criterion for best segmentation selection is nothing but unigram data model, which can not be relied upon.

Solution design.

[Looking into existing solutions.]


  • Faced with this problem I looked for solutions in areas where this problem already exists but could not find a very satisfying solution that could be put into use in our case.






An Observation.


Variable Length Window Sliding Technique.



An important Observation was made that the length of words in the unigram corpus closely follows a zipf's distribution.

  • The observation was that words of length 3-4-5 are more probable than any other length words.

  • We consider an imaginary window of length varying [LENGTH_START, LENGTH_END] where LENGTH_START is the length at which there is maximum probability.

Variable Length Window Sliding Technique.


  • If the probability of text lying inside the window is above a threshold probability called DELTA, we consider that text's probability as 1 and then segment (using DP) the left and right side text of it.

  • Then we consider that segmentations' probability as sum of -log10  of its left and right part and insert it with a unigram score in a dictionary to avoid duplicates.

  • That way we obtain some most probable segmentations ( with no guarantee of optimality ,though ) according to the unigram model.

Variable Length Window Sliding Technique.


  • Finally to have a better semantic result we rely on bigram data and not just on unigram data. 

  • Out of the best 5 unigram segmentations we find the score of these segmentations using bigram data.



Another Challenge




How to make sure that we display only relevant results ?

Outputting k-best solutions is not a big deal, but that is also misleading.


#alwaysahead can only mean either of 

  • always ahead
  • always a head

  • No other segmentation like alway sahe ad should result.





How do we take its care ?


Observing the Unigram and Bigram probabilities I have written a simple expression that regulates the number of outputs.

That is however mostly hard coded probabilistic expression, that was observed after a few test cases. There is a need to make it more robust and accurate.


Future work.



  • Even though contextual disambiguation can be used but tweets are very short in length, so not very useful.

  • Phonetic spellings can be corrected or a more web based corpus can be employed into usage.

  • Its possible to segment hashtags like #GSoC13 keeping in mind some heuristics, next we can work on such hashtags.


Twitter Hashtag Segmentation

By Piyush Bansal

Twitter Hashtag Segmentation

  • 2,671