searching across the federation

noah ó donnaile 06/03/19

Brief intro to distributed hashtables (which underlie or inspire many of the search systems)
Deep(er) dive into YaCy, a noteworthy distributed search system
Short mention of other relevant approaches
Shoutout to thesis about handling search spam

NB: federated search != federation search

Federated search: A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user.

Eg. Searching for users and posts at the same time is a federated search.

What is a DHT?

A distributed hashtable (DHT) is a key-value store spread across a number of machines.

There are many implementations and strategies for an efficient DHT (just like for a non-distributed hashtable).

DHTs are versatile, and are used somewhere in many distributed systems, such as Bitcoin, Ethereum, IPFS (all use Kademlia).

DHTs work with key-value data pairs. Typically you use them to find a value given a specific key.
A structured DHT may allow lookups of a specific key's value in O(log(N)) queries instead of O(N), as would be the case in an unstructured P2P network (where you must ask every node if it knows about this key).
DHTs usually require control over where a specific value is stored, which may not be suitable for Rabble, where content is mandated to be stored (at least) on the instance which created it.

Do not confuse DHT in this context with Dihydrotestosterone, a hormone related to hair loss.

DHTs are a well-studied area.
Rabble must deal with potentially malicious agents in its federated search system. There are well-defined strategies to handle attacks on DHTs, especially Sybil attacks:
- Centralised index of trusted nodes (not suitable for Rabble)
- Calculation of trust by social graph topology (worth looking into in our case)
- Proof of work required for new nodes (buzzword alert!)

Although DHTs are not obviously suited to the search problem, they can be adapted to it.

A DHT typically has a structured overlay network on top of the distributed network using a system like Pastry or Chord.

With this structure, you may search the DHT with arbitrary queries (ie. not just based on key-values), eg. using Structella (flooding and random walks to propagate queries across a Pastry network) or DQ-DHT (dynamic queries across a Chord network).

A DHT can also just be used as part of a thorough P2P search implementation, like in YaCY.

YaCy

A P2P web search engine

Based on info from:
Description of the YaCy Distributed Web Search Engine, Michael Herrmann, Kai-Chun Ning, Claudia Diaz, and Bart Preneel, 2014

YaCy is a peer-to-peer web search engine from 2003, still in active dev.

YaCy Peer IDs

The network is made up of peers. Each peer has a single global ID ("hash") obtained at first start-up. When a peer goes offline and comes back online, it reuses its old ID.

(there's no explanation of anything stopping malicious peers trying to assume already-owned IDs)

When a well-behaving new peer tries to get an ID, it must know the entire state of the network, and then chooses an unused ID based on a well-defined algorithm.

RWI DB

When a YaCy instance finds a new piece of content (in its case, when it crawls a new URL), it:

removes stop words
adds a {word, document URL} pair for each word occurring in the document to its RWI (reverse word index) database.

Solr DB

YaCy also adds a metadata document with {page title, encoding, list of clickable links, etc.} into its Solr (ie. Lucene) database.

Knowledge Distribution

YaCy then tries to distribute this newly found content across the network.

it hashes the RWI entries and Solr document
it stores these tuples in the 3 closest peers on the network to their hashes (by distance to the peers' hashes) (with allowance for peers who decline to store it) via a DHT-alike routing system.

Search

YaCy searches by both 'YaCy search requests' and 'Solr search requests'. Both searches operate along the following lines:

The local index is searched for the query.
A 'candidate set' of peers to forward the query to is compiled; it is compiled differently for each type of search request.
The peers are queried for results:
1. For a 'YaCy search request', this involves finding matches for the query words in the distributed RWI index.
2. For a 'Solr search request', this involves querying the Solr database.
The results from all of the peers are collated and then sorted by a well-defined algorithm.

Advantages of YaCy

Privacy-focused: search terms are hashed before distribution, which makes it a little less obvious what specifically is being searched for.
Robust: it is based on principles of DHTs, and supports nodes disappearing and reappearing. It has some redundancy in the network (everything is stored ~4 times).

Disadvantages of YaCy

It's complicated with a few seemingly arbitrary design decisions.
Documentation is lacking.
It's hard to tell if it acknowledges malicious peers.

Brief description of other systems

Gnutella

An unstructured p2p network providing filesharing capabilities, with support for search. Best known by its client LimeWire. Originally used query flooding for searches, but later versions supported query routing.

iTrust

An unstructured p2p network that distributes content to a random subset of nodes, and then sends queries to random nodes. Handles malicious agents well, but doesn't guarantee finding results (from iTrust: Trustworthy Information Publication, Search and Retrieval, Peter Melliar-Smith et al., 2012).

F2F

Suggests using social links between peers in p2p network to encourage good behaviour and discourage malicious actors by creating subnetworks (from F2F: reliable storage in open networks, Jinyang Li & Frank Dabek, 2006).

Pastiche

Based on Pastry. Assumes peers ("buddies") are untrustworthy; detects and mitigates malicious actors by periodically querying them for their stored data. Requires centralised authority to protect against Sybil attacks (from Pastiche: making backup cheap and easy, Cox et al., 2002).

searching across the federation

Contents

NB: federated search != federation search

What is a DHT?

A DHT can also just be used as part of a thorough P2P search implementation, like in YaCY.

YaCy

YaCy is a peer-to-peer web search engine from 2003, still in active dev.

YaCy Peer IDs

RWI DB

Solr DB

Knowledge Distribution

Search

Advantages of YaCy

Disadvantages of YaCy

Brief description of other systems

Brief description of other systems

Adversarial Web Search, Carlos Castillo & Brian D. Davison, 2010

Questions?

Questions?

Federated Search

Federated Search

Noah Ó Donnaile

searching across the federation

Contents

NB: federated search != federation search

What is a DHT?

A DHT can also just be used as part of a thorough P2P search implementation, like in YaCY.

YaCy

YaCy is a peer-to-peer web search engine from 2003, still in active dev.

YaCy Peer IDs

RWI DB

Solr DB

Knowledge Distribution

Search

Advantages of YaCy

Disadvantages of YaCy

Brief description of other systems

Brief description of other systems

Adversarial Web Search, Carlos Castillo & Brian D. Davison, 2010

Questions?

Questions?

Federated Search

More from Noah Ó Donnaile