Hello!

my name is @matason

 

this is a series of words

and,this,is,a,series,of,words

concatenated together with commas...

How would you get back to individual words?

JavaScript

'this,is,a,series,of,words'.replace(/,/g, ' ');
this is a series of words

Ruby

'this,is,a,series,of,words'.gsub(/,/, ' ')
this is a series of words

Python

'this,is,a,series,of,words'.replace(',', ' ')
this is a series of words

PHP

str_replace(',', ' ', 'this,is,a,series,of,words');
this is a series of words

But what about...

thisisaseriesofwords

concatenated together with thin air...

How would you get back to individual words?

Any ideas?

You will need

  • Sphinx - http://sphinxsearch.com/

Sphinx utilities

  • indexer - to produce a frequency dictionary
  • wordbreaker - to break words

What's a frequency dictionary?

A file containing a list of words each with frequency count

uk 6445
open 6420 
title 6287 
is 6218
return 6066 
window 5996

How can I create one?

$ indexer --buildstops demo.dict 100000 --buildfreqs demo -c sphinx.conf

sphinx.conf

  source demo
  {
    type = xmlpipe2
    xmlpipe_command = cat source.xml
    xmlpipe_fixup_utf8 = 1
  }
  index demo
  {
    source = demo
    path = /tmp/demo
  }
  indexer
  {
    mem_limit = 128M
  }

source.xml

  <?xml version="1.0" encoding="utf-8"?>
  <sphinx:docset>
    
  <sphinx:schema>
  <sphinx:field name="content"/>
  </sphinx:schema>
  
  <sphinx:document id="1">
    <content><![CDATA[Document content here]]></content>
  </sphinx:document>
    
  <sphinx:killlist>
  <id>1</id>
  </sphinx:killlist>
  
  </sphinx:docset>

How do I use wordbreaker?

$ echo thisisaseriesofwordsthatcannotbesplit | wordbreaker --dict demo.dict split

BingoBango!

this is a series of words that cannot be split

In the real world...

I used this approach whilst migrating away from a legacy CMS,  I was able to take URL's like this:

/toplevelsection/subsectionlevel/afurtherlevel/somepage

and convert them to URL's like this:

/top-level-section/subsection-level/a-further-level/some-page

So if you ever have

astringthatneedstobesplitintoitsindividualcomponentwords

Remember Sphinx!

Questions?

Resources

http://sphinxsearch.com/
http://sphinxsearch.com/blog/2013/01/29/a-new-tool-in-the-trunk-wordbreaker/

Where there's a wordbreaker there's a way

By matason

Where there's a wordbreaker there's a way

  • 2,511