noah ó donnaile 06/03/19
Federated search: A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user.
Eg. Searching for users and posts at the same time is a federated search.
A distributed hashtable (DHT) is a key-value store spread across a number of machines.
There are many implementations and strategies for an efficient DHT (just like for a non-distributed hashtable).
DHTs are versatile, and are used somewhere in many distributed systems, such as Bitcoin, Ethereum, IPFS (all use Kademlia).
Do not confuse DHT in this context with Dihydrotestosterone, a hormone related to hair loss.
Although DHTs are not obviously suited to the search problem, they can be adapted to it.
A DHT typically has a structured overlay network on top of the distributed network using a system like Pastry or Chord.
With this structure, you may search the DHT with arbitrary queries (ie. not just based on key-values), eg. using Structella (flooding and random walks to propagate queries across a Pastry network) or DQ-DHT (dynamic queries across a Chord network).
A P2P web search engine
Based on info from:
Description of the YaCy Distributed Web Search Engine, Michael Herrmann, Kai-Chun Ning, Claudia Diaz, and Bart Preneel, 2014
The network is made up of peers. Each peer has a single global ID ("hash") obtained at first start-up. When a peer goes offline and comes back online, it reuses its old ID.
(there's no explanation of anything stopping malicious peers trying to assume already-owned IDs)
When a well-behaving new peer tries to get an ID, it must know the entire state of the network, and then chooses an unused ID based on a well-defined algorithm.
When a YaCy instance finds a new piece of content (in its case, when it crawls a new URL), it:
YaCy also adds a metadata document with {page title, encoding, list of clickable links, etc.} into its Solr (ie. Lucene) database.
YaCy then tries to distribute this newly found content across the network.
YaCy searches by both 'YaCy search requests' and 'Solr search requests'. Both searches operate along the following lines:
Gnutella
An unstructured p2p network providing filesharing capabilities, with support for search. Best known by its client LimeWire. Originally used query flooding for searches, but later versions supported query routing.
iTrust
An unstructured p2p network that distributes content to a random subset of nodes, and then sends queries to random nodes. Handles malicious agents well, but doesn't guarantee finding results (from iTrust: Trustworthy Information Publication, Search and Retrieval, Peter Melliar-Smith et al., 2012).
F2F
Suggests using social links between peers in p2p network to encourage good behaviour and discourage malicious actors by creating subnetworks (from F2F: reliable storage in open networks, Jinyang Li & Frank Dabek, 2006).
Pastiche
Based on Pastry. Assumes peers ("buddies") are untrustworthy; detects and mitigates malicious actors by periodically querying them for their stored data. Requires centralised authority to protect against Sybil attacks (from Pastiche: making backup cheap and easy, Cox et al., 2002).
Also see: