Hello!
my name is @matason

this is a series of words
and,this,is,a,series,of,words
concatenated together with commas...
How would you get back to individual words?
JavaScript
'this,is,a,series,of,words'.replace(/,/g, ' ');
this is a series of words
Ruby
'this,is,a,series,of,words'.gsub(/,/, ' ')
this is a series of words
Python
'this,is,a,series,of,words'.replace(',', ' ')
this is a series of words
PHP
str_replace(',', ' ', 'this,is,a,series,of,words');
this is a series of words
But what about...
thisisaseriesofwords
concatenated together with thin air...
How would you get back to individual words?
Any ideas?
You will need
- Sphinx - http://sphinxsearch.com/
Sphinx utilities
- indexer - to produce a frequency dictionary
- wordbreaker - to break words
What's a frequency dictionary?
A file containing a list of words each with frequency count
uk 6445
open 6420
title 6287
is 6218
return 6066
window 5996
How can I create one?
$ indexer --buildstops demo.dict 100000 --buildfreqs demo -c sphinx.conf
sphinx.conf
source demo
{
type = xmlpipe2
xmlpipe_command = cat source.xml
xmlpipe_fixup_utf8 = 1
}
index demo
{
source = demo
path = /tmp/demo
}
indexer
{
mem_limit = 128M
}
source.xml
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="content"/>
</sphinx:schema>
<sphinx:document id="1">
<content><![CDATA[Document content here]]></content>
</sphinx:document>
<sphinx:killlist>
<id>1</id>
</sphinx:killlist>
</sphinx:docset>
How do I use wordbreaker?
$ echo thisisaseriesofwordsthatcannotbesplit | wordbreaker --dict demo.dict split
BingoBango!
this is a series of words that cannot be split
In the real world...
I used this approach whilst migrating away from a legacy CMS, I was able to take URL's like this:
/toplevelsection/subsectionlevel/afurtherlevel/somepage
and convert them to URL's like this:
/top-level-section/subsection-level/a-further-level/some-page
So if you ever have
astringthatneedstobesplitintoitsindividualcomponentwords
Remember Sphinx!
Questions?
Resources
http://sphinxsearch.com/
http://sphinxsearch.com/blog/2013/01/29/a-new-tool-in-the-trunk-wordbreaker/
Hello!
Where there's a wordbreaker there's a way
By matason
Where there's a wordbreaker there's a way
- 2,694