ZeroDiscovery

@nd

Neuropil

OpenSearchSymposium 2021 / 12.10.2021

Stephan Schwichtenberg

Hello from

Meet Eliza & Marvin

Security of the Past: Limitations

only protection of bilateral IP connections

not protecting different data objects, but apis

unsuited for rapid change of data owners / new data channels

Security of the Future: ZeroTrust

trust perimeter has changed

fragmented information (flows) need protection

authn/authz must be possible everywhere

data objects governed by external/internal access policies (AP)

Security for Ecosystems:

Zero Trust / IDSA / AccessPolicies

data object interactions main driver for future IT architecture

devices produce and consume data at the same time

respect different data owners per device

if one fails, all suffer!

Our Approach

www.neuropil.org

OpenSource CyberSecurity Mesh

Milestones

development started in 2014

2016: first exhibition @FROSCON

2019: NGI Zero / EU funded

2021: beta-release HMI 2021

looking for pilots & partners

Random Identities

User Identities

Intent Identities

H256(X)

0...

8...

4...

c...

neuropil & IDSA

why we joined:

rules to enforce data ownership / sovereignty

increase data quality and transparency

building european-wide ecosystems

neuropil & IDSA

neuropil@IDSA

decentralized (meta-data) broker

each application/device is a connector

decentralized MQTT

Use Case:

Distributed Search Engine

Neuropil is a project that wants to turn the tables on online search and discovery: instead of search solutions calling the shots, data owners decide what content is publicly searchable in the first place.

They can do this through a new messaging layer that is private and secure by design. Data owners can send cryptographic and unique so-called intent messages that state what specific information can be found where.

The access to the actual information or content is also controlled by data owners, for instance to provide either paid or public free content.

Broker

central broker structures

has to grow with the size of connected instances
any central broker is attackable
information is doubled, and possibly outdated
- crawling is waste of energy
- legal aspects of copyright / data ownership
can withhold or change information
needs to understand many different languages (data models)
we would like to search for data, not URL's
who is the broker of all broker?
- federated broker
- distributed broker

central broker structures

Broker

central broker structures ?

Broker

how can we prevent an unfair advantage of any participant?
- leave the data / content where it is
- distribution of search index in a way that follows math rules
how can we prevent malicious content?
- add check before content is added
- use "trust" signatures to mark reviewed content (attribute)
how can we protect the privacy of users?
- use PPRL to share information about a document
- encrypt data if transported, allow access control

de-central algorithms

NGI Zero / part 1

subject="urn:neuropil:photo:library:v1"

Identity Token

Node Token

Intent Token

{
"iss": FP(Id),
"sub": "mail:pseudonym@example.com",
„pub“: <binary data>,

...
} + sig

{

"partner_dhkey": FP(N),

"attribute_1": "super_secret_sauce",

"attribute_1": bin(x),

} + sig

=> H(sub)

NGI Zero / part 1

step 1 / obfuscate subject:

{ "subject":"urn:neuropil:photo:library:v1" => 0fa6472ba9813c56 }
serves as a rendezvous point

step 2 / send intent token:

intermediate nodes had to store/match/resend intent token
parsing / interpretation / validation of token (signature check)
DoS attack (flooding with intent token)

step 3 / messages exchange:

data is encrypted
routing is based on hash distance / hash table
if (receiver count > 1) => sender has to duplicate messages

NGI Zero / part 1 / the past

step 1 / obfuscate subject:

{ "subject"="urn:neuropil:photo:library:v1" => 0fa6472ba9813c56 }
serves as a virtual rendezvous point
create a "pheromone" to discover routing information
- a pheromone is an attenuated, counting bloom filter (p 1:1000 / 128 subjects)
- attenuated => capture age/hop count information / signal strength
- counting => allows to remove entries

0fa6 472b a981 3c56 (32 bytes)

01010000 00100100 10001000 00010100 (3*4 bytes)

NGI Zero / part 1 / the future

step 2 / obfuscate attributes of intent token:

attributes = { "urn:osf:search:countrycode": "DE", ... }
build hash of { key, value } pairs, and add the result to a bloom filter
the resulting bloom filter (ABF/Attribute Bloom Filter) can act as:
- a simple policy enforcement:
  - arriving intent token have to match the attribute bloom filter
  - the attribute filter can be derived from e.g. simple graphql query
- tbd: a simple syntax validator
- tbd: push the ABF also on the network layer
  - probabilistic event / object dissemination network

NGI Zero / part 1 / the future

step 2.5 / increase storage capabilities of ABF:

current approach is limited with memory size
new approach based on Roaring Bitmaps (Lemire et.al.)
abilty to encode far more items into ABF
more efficient union / intersection (10x-20x faster)
less memory usage

NGI Zero / part 1 / the future

NEW

step 2 / obfuscate attributes of intent token:

NGI Zero / part 1 / the future

{
"iss": FP(Id),
"sub": "urn:neuropil:photo:library:v1",
„pub“: <binary data>,

...
} + sig

{

"partner_dhkey": FP(N),

"attribute_1": "super_secret_sauce",

"attribute_1": bin(x),

...

} + sig

Object

=> BF(obj)

Object Fields

=> BF(attributes)

step 3 / discovery of best path and exchange security token

no interpretation of intent token in intermediate nodes
no DoS attack possible on virtual rendezvous point
no single point failure
discovery path uses signal strength as optimization
arriving intent token will be filtered

step 4 / messages exchange:

data is encrypted
routing along the pheromone trail (based on probability)
efficient pubsub / sender only has to send one message
arriving messages can be validated

NGI Zero / part 1 / the future

now is the time for questions or a short coffee break

NGI Zero / part 1 / questions

initial idea of the NGI Zero project:

use the virtual address space as a catchword index
"urn:osf:search:v1" => 0fa6472ba9813c56
"mydocument.odt" => 65c3189ab2746af0

approach works for single words / URL's / etc.:

documents contain more than one word: LSH / minhash signatures
what about pictures and other data sets (biology / chemistry / ...)?

not every node wants to be part of a specific search index

need additional subjects to manage search

Neuropil zero search

what is a good "distributed" index?
- define "search entry" attributes / data model (JSON/Ontologies)
- how can we distribute the search entries across a DHT?
map - reduce as a guiding principle
- but what and how to map (de-centralized) ?
  - currently looking into LSH/minhash (ANN/KNN) (for text search)
- but what and how to reduce (user specific) ?
  - define and use a "ranking" based on attributes

NGI Zero / part 2

what is a good "distributed" index?
- define "search entry" attributes / data model (JSON/Ontologies)
- how can we distribute the search entries across a DHT?
cryptographic longterm key hashing (Schnell et. al.)
- construct a 256-bit hash value from a vector/dataset (or document)
- discovery through address space
minhash signature / frequency mapping
- use the minhash and its distribution to create a 256-bit hash
- mapping to address space (hamming distance)

Neuropil zero search

minhash signatures:

split text into shingles / ngrams, hash each
min/maxhash (more efficient / less MSE / higher BAR)
seed the minhash with cryptographic hash
variable size possible / but has to be mod(8)
data-dependant minhash signatures
- fixed size, variable shingle size
- variable size, fixed shingling

Neuropil zero search

compare mmh-signatures / push mmh signatures to bloom filters:

add mmh signatures to a bloom filter
calculation of jaccard similarity / containment
lead me to PPRL (privacy preserving record linkage)
- works also with encrypted strings
- it's about searching, so we can be a bit relaxed
- Schnell et.al. propose multibit trees

Neuropil zero search

CLKHash - Cryptographic Longterm Keys:

is basically a bloom filter
standardized set of identifieres (tbd for "search")
candidate for a search entry
natural fit with intent token / pheromone
- pheromone is able to capture time information
- intent token contains secured public data
still need to find the correct clustering

Neuropil zero search

CLKHash - Cryptographic Longterm Keys:

Searching means: Union and Intersection of BF
Currently slow in our initial implementation
Inspiration from Roaring Bitmaps (Lemire et. al.)
combine advantages of both approaches:
- Insert / Query speed from initial implementation
- Union / Intersection speed from roaring bitmaps
- Less Memory Consumption from roaring bitmaps
10x-20x faster // more items in search vector
What is the expected feature set size ?

Neuropil zero search

NEW

LSH - Locality Sensitive Hashing (based on minhash):

split mmh into n-rows and b-bands
efficiently reduce the amount of comparison
lots of variants: TreeLSH, BoundedLSH, EnsembleLSH, ...
but:
- designed for target threshold (1/b)^(1/r)
- works on a fixed set of hash tables
- use a variable length hash

Neuropil zero search

LPH - Locality Preserving Hashing:

used spam/malware detection: ssdeep / nilsimsa
- low false positive rate / robust against attacks
used in forensics: tlsh
- comparing which part of two documents are similar
- resulting hash based on threshold (median)
data dependant hash calculation
variable length hash

Neuropil zero search

More options for text analysis based LSH / LPH

LSH shingle size can now be adjusted
LPH support for text
- e.g. as an idea for URL's reognition:
  - https://www.neuropil.org/search/me
  - aaaaa://bbb.cccccccc.ddd/eeeeee/ff
  - aaaaa://bbb.cccccccc.ddd/eeeeee/ggg
  - ... /you
Search Analytics Mode
- Word frequency distribution (based on single words)
- TBD: frequence distribution for different modes of "shingles"

Neuropil zero search

NEW

can LSH and LPH work together?

data dependant hashing looks promising
avoid variable length encoding
querying for data, not a target probability
open for (dynamic) hash table count

Neuropil zero search

let's use a counting bloom filter to compare LSH table distribution!

revisit mmh signature / LSH: (b=8/r=1; t=0,125)

Neuropil zero search

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

assume we have a set of eight hash tables

revisit mmh signature / LSH: (b=4/r=2; t=0,5)

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH: (b=2/r=4; t=0,84)

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH: (b=1/r=8; t = 1,0)

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH: (b=2/r=4)

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH:

NGI Zero / part 2

L-Quartile

U-Quartile

Median

use the median to calculate relative importance of the eight tables

tables seven and eight are most important to query

revisit mmh signature / LSH:

NGI Zero / part 2

L-Quartile

U-Quartile

Median

use the median ?? experiments with different approaches

tables four, seven and eight are most important to query
burden of binominal frequency distribution with hamming distance

NEW

Neuropil zero search

using LSH and LPH together - 256bit hash value

relative importance of virtual tables can be compared
- locally the full hamming distance is used
- distribution is based in partial hamming distance
- is a kind of multi-index
easy to calculate, easy to distribute
- using octile values (3bits per octile / assuming 85 hash tables)
- uses a bktree implementation including binning (neighbour table seach)
- on table hit, CLKHash'es are compared
- can be extended with additional tables / np_index

Neuropil zero search

can LSH and LPH work together - 256bit hash value

will it work with in fully distributed mode?
- hash distance routing guarantees query of closest "table"
- nodes can detect the required hamming distance
- storage in multiple search nodes is guaranteed
until now: validated locally with 600k entries
reasonsable performance (running un-optimized code)
- parallel execution of queries
- code optimization (cache misses / network runtime / ...)
- using an embedded database for bf (bitmap) comparisons (?)

Neuropil zero search

what is a good "distributed" index?
- defined "search entry" data model
  - intent (CWT) token of data owner
  - claims to be used as attributes (extend e.g. for HTML ...)
  - CLKHash to represent the actual data set
  - can be used "in private": add PPAttributes (minhash)
- defined the distributed search index for a DHT
  - NPIndex: relativ importance based on our search entry
  - 256bit data dependant hash value

Neuropil zero search

what is a good "distributed" index?
- search entries will "disappear"
  - encoded time information will enable "forgetting"
  - automatically evicts malicious content
  - ensure actuality of information
- mutual exchange of interest
  - the searcher retrieves a list of possible data sources
  - the content provider retrives a list of searchers
  - no man-in-the-middle to prevent exchange

Neuropil zero search

what is a good "distributed" index?
- search entries will "disappear"
  - encoded time information will enable "forgetting"
  - automagically evicting malicious content
  - ensure actuality of information
- mutual exchange of interest
  - the searcher retrieves a list of possible data sources
  - the content provider retrives a list of searchers
  - no man-in-the-middle to prevent exchange

Neuropil zero search

How do nodes "discover" their peers?
- Each server node uses four interfaces
  - ability to announce itself: H("urn:np:search:node:v1")
  - ability to store a search entry: H("urn:np:search:entry:v1")
  - ability to query for a search entry: H("urn:np:search:query:v1")
  - ability to receive search result: H("urn:np:search:result:v1")
- Each of those four interfaces can and will be "tweaked"
  - the general subject also contains "peerid" as an attribute
  - the remaining interfaces always use hash concatenation, e.g:
    - H("urn:np:search:entry:v1") + H("np:search:peerid")
    - nodes only connect based on hash distance
    - "entry"/"query"/"result" automagically gain "onion" super-powers

NEW

Neuropil zero search

How do nodes "discover" their peers?
- The general subject can be tweaked with a seed value
  - e.g. category-based search nodes:
    - H("org") + H("urn:np:search:node:v1")
    - H("org") + H("science") + H("urn:np:search:node:v1")
    - H("org") + H("science") + H("search") H("urn:np:search:node:v1")
    - related categories of a search node can be added as an attribute to enable other to discover more search spaces
- allows to build "sub-search-spaces"
  - for and with specific search content
  - for specific search content providers
    - i.e. data-center, special hardware or content related)

NEW

Neuropil zero search

How do nodes "discover" their peers?
- The general subject can be tweaked with a seed value:
  - e.g. technical difference:
    - H("bm25")+H("org")+H("science")+H("urn:np:search:node:v1")
    - H("5kmer")+H("org")+H("science")+H("urn:np:search:node:v1")
    - H("ai-hi-fly")+H("org")+H("science")+H("urn:np:search:node:v1")
    - differrent spaces can be queried in parallel !
    - "reducing" of search results can be done at the query client
  - e.g. private search nodes:
    - H("my:private:secret") + H("urn:np:search:node:v1")
    - will effectively render your own search nodes invisible

NEW

Neuropil zero search

H("urn:np:search:node:v1")

SearchNode "Server"

SearchNode "Client"

H("urn:np:search:entry:v1")+

H("urn:np:search:peer:id")

H("urn:np:search:query:v1")+

H("urn:np:search:peer:id")

H("urn:np:search:result:v1")+

H("urn:np:search:peer:id")

SearchNode "Private"

urn:np:search:peer:id=H("/dev/random")

H("urn:np:search:node:v1") +

H("my:private:secret")

NEW

Neuropil zero search

bm25 - 5kmer - space

org-science-AI space

H256(X)

0...

8...

4...

c...

your-private space

NEW

Neuropil zero search

what is a good "distributed" index?
- define "search entry" attributes / data model (JSON/Ontologies)
- how can we distribute the search entries across a DHT?
map - reduce as a guiding principle
- but what and how to map (de-centralized) ?
  - currently looking into LSH/minhash (ANN/KNN) (for text search)
- but what and how to reduce (user specific) ?
  - define and use a "ranking" based on attributes

Neuropil zero search

Open Question: Identities
- if every participant can add search entry, how to establish trust?
- additional "curator" signature for SEO companies
- PKI / web of trust unsuited, but TSA is an answer
Open Question: Time
- all systems must have the same understanding of time
- index is attenuated, entries will disappear after a time
Open Question: Runtime
- Python Binding (full support), Lua / NodeJS (partial) ...
- WASM to execute user supplied map/reduce code?

Neuropil Zero Search

add curator

Open Questions: Search Pipeline Collaboration
- good understanding of the whole process is needed
- adding a search entry (BM25 / TF-IDF / ...)
- querying a search entry

select ranking

exec bm25

select curator

NGI Zero / part 2

now is the time for questions or a longer coffee break

NGI Zero / part 2

Demo

pi-lar GmbH

Kreuzgasse 2-4

50667 Köln

www.pi-lar.net

info@pi-lar.net

eliza@neuropil.org

www.neuropil.org

https://www.gitlab.com/pi-lar/neuropil

neuropil@pi-lar.net

Let's

chat !

Join Our Workshops!

NGI Zero Discovery@Neuropil

By Stephan Schwichtenberg

NGI Zero Discovery@Neuropil

a short introduction to the neuropil messaging layer

Stephan Schwichtenberg

Enterprise architect, integration expert, electric monk, eternal optimist, MBA, secure IoT advocate, Founder of pi-lar GmbH. Neuropil - the first messaging protocol with security and privacy by design. Enabling Secure Global Data Communication.

ZeroDiscovery

@nd

Neuropil

Hello from

Meet Eliza & Marvin

Security of the Past: Limitations

Security of the Future: ZeroTrust

Security for Ecosystems:

Zero Trust / IDSA / AccessPolicies

Our Approach

Milestones

neuropil & IDSA

neuropil & IDSA

Use Case:

Distributed Search Engine

NGI Zero Discovery@Neuropil

More from Stephan Schwichtenberg