Ryan Stuart, CTO, Kapiche
+RyanStuart85
@rstuart85
+KrisRogers
Follow along at home!
{
'quick': [(1, 1, 4)],
'brown': [(1, 2, 6)],
'fox' : [(1, 3, 12)],
'jumps': [(1, 4, 16)],
'over' : [(1, 5, 22)],
'lazy' : [(1, 7, 31)],
'dog' : [(1, 8, 26)],
}
id
Gender
Location
Date
Comment
1
Male
Brisbane
01/12/14
The meetup is awesome. I'd prefer caviar instead of pizza though.
2
Female
Brisbane
02/12/14
Everything is great but I don't trust the free WiFi. #BigBrother.
{
'male' : [(1,)],
'female' : [(2,)],
}
We analyse each field:
Input: "The quick brown fox jumps over the lazy dog."
Token Stream: [ 'the', 'quick', 'brown', 'fox', 'jumps', 'over', ... ]
Post Filters: [ 'quick', 'brown', 'fox', 'jump', 'over', ... ]
Token Info: position, start index, length (analyser dependent)
Result: "quick": [ (doc_id, position, start, end,) ]
gender: 'male' and comment: 'pizza'
This slide is entirely my own opinion, could well be wrong!
Lucene is hard to change, at low-levels – Index format is too rigid
Michael McCandless - Committer, PMC member Lucene/Solr
import os
import shutil
import tempfile
from caterpillar.processing.index import IndexWriter, IndexConfig, IndexReader
from caterpillar.processing.schema import TEXT, Schema
from caterpillar.searching.query.querystring import QueryStringQuery
from caterpillar.storage.sqlite import SqliteStorage
path = tempfile.mkdtemp()
try:
index_dir = os.path.join(path + "examples")
with open('caterpillar/test_resources/alice.txt', 'r') as f:
data = f.read()
with IndexWriter(index_dir, IndexConfig(SqliteStorage, Schema(text=TEXT))) as writer:
writer.add_document(text=data)
with IndexReader(index_dir) as reader:
searcher = reader.searcher()
results = searcher.search(QueryStringQuery('W*e R?bbit and (thought or little^1.5)'))
print "Query: 'W*e R?bbit and (thought or little^1.5)'"
print "Retrieved {} of {} matches".format(len(results), results.num_matches)
finally:
shutil.rmtree(path)
8 GB 1600 MHz DDR3
APPLE SSD SM512E - 14.1 MB/s Random 4k read/writes (benchmark)
from whoosh.fields import TEXT, Schema, ID
from whoosh.index import create_in, os
schema = Schema(
title=TEXT(stored=True),
text=TEXT(stored=True)
)
ix = create_in(index, schema) # Create index
writer = ix.writer()
for article in articles:
writer.add_document(
title=unicode(page[1], 'utf-8'),
text=unicode(page[4], 'utf-8')
)
writer.commit()
The basic code
Actual Code: here.
Result - 18.160s
import lucene
from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.index import FieldInfo, IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version
store = SimpleFSDirectory(File(index_dir))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
config = IndexWriterConfig(Version.LUCENE_CURRENT, analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
writer = IndexWriter(store, config)
title = FieldType()
title.setIndexed(True)
title.setStored(True)
title.setTokenized(False)
title.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS)
text = FieldType()
text.setIndexed(True)
text.setStored(True)
text.setTokenized(True)
text.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
for article in articles:
doc = Document()
doc.add(Field("title", page[1], title))
doc.add(Field("text", page[2], text))
writer.addDocument(doc)
The (not really) basic code
Actual Code: here.
Result - 1.666s
from caterpillar.processing.index import IndexWriter, IndexConfig
from caterpillar.processing.schema import Schema, TEXT, CATEGORICAL_TEXT
config = IndexConfig(
SqliteStorage,
Schema(
title=TEXT(indexed=False, stored=True),
text=TEXT(indexed=True, stored=True),
url=CATEGORICAL_TEXT(stored=True)
)
)
with IndexWriter('/tmp/cat-index', config) as writer: # Create index
for article in articles:
writer.add_document(title=article[1], text=article[2])
The basic code
Actual Code: here.
Result - 15.655s
Result - 10.183s
Remove sentence tokenisation.
Use 2 process to index the articles.
import begin
from concurrent import futures
from caterpillar.processing.index import IndexWriter, IndexConfig
from caterpillar.processing.schema import Schema, TEXT, CATEGORICAL_TEXT
def index(path, articles):
# Same index code as before...omitted to save space
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
@begin.start
def run(articles):
chunks = [articles[:len(articles)/2], articles[len(articles)/2:]]
paths = [
os.path.join('/tmp/wiki-index', 'index-{}'.format(i))
for i in range(len(chunks)
]
with futures.ProcessPoolExecutor() as pool:
pool.map(index, paths, chunks)
Result - 6.299s
>>> import timeit
>>> timeit.timeit('a = 5')
0.03456282615661621
>>> timeit.timeit('foo()', 'def foo(): a = 5')
0.14389896392822266
The practical implication of this for me is that sometimes you need to forgo a "sexier" API in favour of less nested calls. I was overly addicted to the neat APIs possible in Python because of the bitterness instilled in me by Java, so I've had to rein that in. Still more to do on that front in Caterpillar.
positions = dict()
for token in tokens:
if token in positions:
positions[token].append((t.index, t.position,)]
else:
positions[token] = [(t.index, t.position,)]
The above code always runs two operations. If you are doing this thousands of times in an inner loop, it will add up! It doesn't need to be this way.
positions = dict()
for token in tokens:
try:
positions[token].append((t.index, t.position,)]
except KeyError:
positions[token] = [(t.index, t.position,)]
You can think of map as a for moved into C code. The only restriction is that the "loop body" of map must be a function call. Besides the syntactic benefit of list comprehensions, they are often as fast or faster than equivalent use of map.
This can be extended to don't write something yourself if there is an equivalent builtin. Chances are, the builtin is written in C and much faster then your pure Python implementation.
This can make a big difference to your everyday code.
>>> timeit.timeit('2 in my_list', 'my_list = [1,2,3,4,5]')
0.05846595764160156
>>> timeit.timeit('2 in my_set', 'my_set = set([1,2,3,4,5])')
0.04645490646362305
Also, don't forget about collections.deque!
Time Complexity article on the Python Wiki: here.
Lecture on Complexity of Python Operations: here.
PythonSpeed article on the Wiki: here.
197 function calls (192 primitive calls) in 0.002 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
1 0.000 0.000 0.001 0.001 re.py:212(compile)
1 0.000 0.000 0.001 0.001 re.py:268(_compile)
1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset)
1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset)
4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction)
3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile)
import cProfile
import re
cProfile.run('re.compile("foo|bar")')
Produces
def do_some_work_yo():
print 'Some work'
pr = cProfile.Profile()
pr.runcall(do_some_work_yo)
ps = pstats.Stats(pr)
ps.dump_stats("storage_benchmark.profile")
Demo......
You need to tell it what functions to profile. Easiest way to start is via the kernprof script.
$ kernprof -l script_to_profile.py
kernprof inserts a LineProfiler instance into the __bultins__ namespace under the name profile. It can be used as a decorator. As adjust your code.
@profile
def slow_function(a, b, c):
...
kernprof saves to a file <script_name>.lprof. This is the out you run line_profiler on.
$ python -m line_profiler script_to_profile.py.lprof
Pystone(1.1) time for 50000 passes = 2.48
This machine benchmarks at 20161.3 pystones/second
Wrote profile results to pystone.py.lprof
Timer unit: 1e-06 s
File: pystone.py
Function: Proc2 at line 149
Total time: 0.606656 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
149 @profile
150 def Proc2(IntParIO):
151 50000 82003 1.6 13.5 IntLoc = IntParIO + 10
152 50000 63162 1.3 10.4 while 1:
153 50000 69065 1.4 11.4 if Char1Glob == 'A':
154 50000 66354 1.3 10.9 IntLoc = IntLoc - 1
155 50000 67263 1.3 11.1 IntParIO = IntLoc - IntGlob
156 50000 65494 1.3 10.8 EnumLoc = Ident1
157 50000 68001 1.4 11.2 if EnumLoc == Ident1:
158 50000 63739 1.3 10.5 break
159 50000 61575 1.2 10.1 return IntParIO
This slide deck: http://goo.gl/KDY6sm
Caterpillar: https://github.com/Kapiche/caterpillar/
Caterpillar-lsi: https://github.com/Kapiche/caterpillar-lsi
Kapiche: http://kapiche.com