Personal Knowledge Management Powered by Apache Solr
Ruda Zhang
PhD student, Civil Engineering
University of Southern California
2015/08/28
Evolution of personal knowledge management
- Pre-digital era
- PC era
- Web era
- AI era
Pre-digital era
- notebooks,
- library classification;
PC era
- digital documents,
- "file" system;
Web era
- webpage,
- "personal wiki"
- gollum,
- DokuWiki,
- wikidPad
AI era
- search bar,
- NLP stacks (Solr)
- search engine,
- clustering.
Knowledge Management with Apache Solr
- Taking enterprise search software for personal use.
Functionality Need
- powerful search engine (Lucene)
- rich document indexing (Tika)
- auto-completion, spell check
- entry suggestion
- dynamic clustering and labeling by document topics
- integrating note snippets into specific "views", which facilitates long-form writing.
- visualization: treemap (FoamTree), network
- navigation: facets, breadcrumbs
- cross-referencing / hyperlinking
Tailoring for personal knowledge pool
- Data models (mostly schema-less natural language): .txt, .md, .docx, .pptx, .xlsx, .pdf, .html
- minimal tailoring of legacy document markups and schemas.
- users should be editing note contents, not HTML files.
Tailoring for personal knowledge pool
- PKM use cases (against the global knowledge space)
- logs: Snippets of notes "in sack" for in-depth documenting.
- notes: Personally organized notes, beyond one-step reach of Google search.
- literature: References for attribution. [Significant for critical writing.]
Apache Solr
-
the most popular enterprise search platform
- Netflix, digg, NASA, Slack ...
- Many libraries are also using Solr
- built on Apache Lucene, an information retrieval software library
Indexing
- Indexing rich documents with Apache Tika.
- Apache Tika has a wide spectrum of supported formats.
Clustering
-
Carrot² is a document thematic clustering engine, written in Java and distributed under the BSD license.
- Clustering algorithms: Lingo, STC, Lingo3G™.
- Visualization: FoamTree, Circles
- Search feeds: Lucene, Solr, other search APIs.
- Front-ends
- Carrot² Web Application
- Carrot² Document Clustering Workbench
Web UI
- Velocity Search UI [built-in]
- Project Blacklight, a Ruby on Rails Engine plugin.
- Lucidworks Fusion, a platform for building enterprise search applications.
Shredder: current status
- figuring out file type support (.md, .jpg, etc.)
- building UI
- stay tuned on my GitHub repo
shredder: personal knowledge management powered by Apache Solr
By rudazhan
shredder: personal knowledge management powered by Apache Solr
- 1,632