Personal Knowledge Management Powered by Apache Solr

Ruda Zhang

PhD student, Civil Engineering

University of Southern California

2015/08/28

Evolution of personal knowledge management

Pre-digital era
PC era
Web era
AI era

Pre-digital era

notebooks,
library classification;

PC era

digital documents,
"file" system;

Web era

webpage,
"personal wiki"
- gollum,
- DokuWiki,
- wikidPad

AI era

search bar,
NLP stacks (Solr)
- search engine,
- clustering.

Knowledge Management with Apache Solr

Taking enterprise search software for personal use.

Functionality Need

powerful search engine (Lucene)
- rich document indexing (Tika)
- auto-completion, spell check
- entry suggestion
dynamic clustering and labeling by document topics
- integrating note snippets into specific "views", which facilitates long-form writing.
- visualization: treemap (FoamTree), network
navigation: facets, breadcrumbs
cross-referencing / hyperlinking

Tailoring for personal knowledge pool

Data models (mostly schema-less natural language): .txt, .md, .docx, .pptx, .xlsx, .pdf, .html
- minimal tailoring of legacy document markups and schemas.
- users should be editing note contents, not HTML files.

Tailoring for personal knowledge pool

PKM use cases (against the global knowledge space)
- logs: Snippets of notes "in sack" for in-depth documenting.
- notes: Personally organized notes, beyond one-step reach of Google search.
- literature: References for attribution. [Significant for critical writing.]

Apache Solr

the most popular enterprise search platform
- Netflix, digg, NASA, Slack ...
Many libraries are also using Solr
built on Apache Lucene, an information retrieval software library

Indexing

Indexing rich documents with Apache Tika.
Apache Tika has a wide spectrum of supported formats.

Clustering

Carrot² is a document thematic clustering engine, written in Java and distributed under the BSD license.
- Clustering algorithms: Lingo, STC, Lingo3G™.
- Visualization: FoamTree, Circles
- Search feeds: Lucene, Solr, other search APIs.
Front-ends
- Carrot² Web Application
- Carrot² Document Clustering Workbench

Web UI

Velocity Search UI [built-in]
Project Blacklight, a Ruby on Rails Engine plugin.
Lucidworks Fusion, a platform for building enterprise search applications.

Shredder: current status

figuring out file type support (.md, .jpg, etc.)
building UI
stay tuned on my GitHub repo

shredder: personal knowledge management powered by Apache Solr

By rudazhan

shredder: personal knowledge management powered by Apache Solr

1,807

rudazhan