Personal Knowledge Management Powered by Apache Solr

Ruda Zhang

PhD student, Civil Engineering

University of Southern California

2015/08/28

Evolution of personal knowledge management

  • Pre-digital era
  • PC era
  • Web era
  • AI era

Pre-digital era

  • notebooks,
  • library classification;

PC era

  • digital documents,
  • "file" system;

Web era

  • webpage,
  • "personal wiki"
    • gollum,
    • DokuWiki,
    • wikidPad

AI era

  • search bar,
  • NLP stacks (Solr)
    • search engine,
    • clustering.

Knowledge Management with Apache Solr

  • Taking enterprise search software for personal use.

Functionality Need

  • powerful search engine (Lucene)
    • rich document indexing (Tika)
    • auto-completion, spell check
    • entry suggestion
  • dynamic clustering and labeling by document topics
    • integrating note snippets into specific "views", which facilitates long-form writing.
    • visualization: treemap (FoamTree), network
  • navigation: facets, breadcrumbs
  • cross-referencing / hyperlinking

Tailoring for personal knowledge pool

  • Data models (mostly schema-less natural language): .txt, .md, .docx, .pptx, .xlsx, .pdf, .html
    • minimal tailoring of legacy document markups and schemas.
    • users should be editing note contents, not HTML files.

Tailoring for personal knowledge pool

  • PKM use cases (against the global knowledge space)
    • logs: Snippets of notes "in sack" for in-depth documenting.
    • notes: Personally organized notes, beyond one-step reach of Google search.
    • literature: References for attribution. [Significant for critical writing.]

Apache Solr

  • the most popular enterprise search platform
    • Netflix, digg, NASA, Slack ...
  • Many libraries are also using Solr
  • built on Apache Lucene, an information retrieval software library

Indexing

  • Indexing rich documents with Apache Tika.
  • Apache Tika has a wide spectrum of supported formats.

Clustering

  • Carrot² is a document thematic clustering engine, written in Java and distributed under the BSD license.
    • Clustering algorithms: Lingo, STC, Lingo3G™. 
    • Visualization: FoamTree, Circles
    • Search feeds: Lucene, Solr, other search APIs.
  • Front-ends

Web UI

Shredder: current status

  • figuring out file type support (.md, .jpg, etc.)
  • building UI
  • stay tuned on my GitHub repo

shredder: personal knowledge management powered by Apache Solr

By rudazhan

shredder: personal knowledge management powered by Apache Solr

  • 1,589