ZeroDiscovery

@nd

Neuropil

 

OpenSearchSymposium 2021 / 12.10.2021

Stephan Schwichtenberg

Hello from

Meet Eliza & Marvin

Security of the Past: Limitations

only protection of bilateral IP connections

 

not protecting different data objects, but apis

 

unsuited for rapid change of data owners / new data channels

Security of the Future: ZeroTrust

trust perimeter has changed

 

fragmented information (flows) need protection

 

authn/authz must be possible everywhere

 

data objects governed by external/internal access policies (AP)

Security for Ecosystems:

Zero Trust / IDSA / AccessPolicies

data object interactions main driver for future IT architecture

 

devices produce and consume data at the same time

 

respect different data owners per device

 

if one fails, all suffer!

Our Approach

www.neuropil.org

OpenSource CyberSecurity Mesh

Milestones

development started in 2014

 

2016: first exhibition @FROSCON

 

2019: NGI Zero / EU funded

 

2021: beta-release HMI 2021

             looking for pilots & partners

Random Identities

User Identities

Intent Identities

H256(X)

0...

8...

4...

c...

neuropil & IDSA

why we joined:

 

rules to enforce data ownership / sovereignty

 

increase data quality and transparency

 

building european-wide ecosystems

neuropil & IDSA

neuropil@IDSA

 

decentralized (meta-data) broker

 

each application/device is a connector

 

decentralized MQTT

Use Case:

Distributed Search Engine

Neuropil is a project that wants to turn the tables on online search and discovery: instead of search solutions calling the shots, data owners decide what content is publicly searchable in the first place.

They can do this through a new messaging layer that is private and secure by design. Data owners can send cryptographic and unique so-called intent messages that state what specific information can be found where.

The access to the actual information or content is also controlled by data owners, for instance to provide either paid or public free content.

Broker

a4

93

82

2c

3d

4e

central broker structures 

_

b5

0a

1b

71

60

5f

  • has to grow with the size of connected instances
  • any central broker is attackable
  • information is doubled, and possibly outdated
    • ​crawling is waste of energy
    • legal aspects of copyright / data ownership
  • can withhold or change information
  • needs to understand many different languages (data models)
  • we would like to search for data, not URL's
  • who is the broker of all broker?
    • federated broker
    • distributed broker

central broker structures 

_

Broker

central broker structures ?

_

Broker

a4

93

82

2c

3d

4e

b5

0a

1b

71

60

5f

  • how can we prevent an unfair advantage of any participant?
    • ​leave the data / content where it is
    • distribution of search index in a way that follows math rules
  • how can we prevent malicious content?
    • ​add check before content is added
    • ​use "trust" signatures to mark reviewed content (attribute)
  • how can we protect the privacy of users?
    • use PPRL to share information about a document
    • encrypt data if transported, allow access control

_

de-central algorithms

NGI Zero / part 1

_

a4

93

82

2c

3d

4e

b5

0a

1b

71

60

5f

subject="urn:neuropil:photo:library:v1"

Id

N

I'

Identity Token

Node Token

Intent Token

{
  "iss": FP(Id),
  "sub": "mail:pseudonym@example.com",
  „pub“: <binary data>,

  ...
} + sig

{

    "partner_dhkey": FP(N),

    "attribute_1": "super_secret_sauce",

    "attribute_1": bin(x),

} + sig

=> H(sub)

NGI Zero / part 1

_

_

step 1 / obfuscate subject:

  • { "subject":"urn:neuropil:photo:library:v1" =>  0fa6472ba9813c56 }
  • serves as a rendezvous point

step 2 / send intent token:

  • intermediate nodes had to store/match/resend intent token
  • parsing / interpretation / validation of token (signature check)
  • DoS attack (flooding with intent token)

step 3 / messages exchange:

  • data is encrypted
  • routing is based on hash distance / hash table
  • if (receiver count > 1) => sender has to duplicate messages

NGI Zero / part 1 / the past

_

step 1 / obfuscate subject:

  • { "subject"="urn:neuropil:photo:library:v1" =>  0fa6472ba9813c56 }
  • serves as a virtual rendezvous point
  • create a "pheromone" to discover routing information
    • a pheromone is an attenuated, counting bloom filter (p 1:1000 / 128 subjects)
    • attenuated => capture age/hop count information / signal strength
    • counting => allows to remove entries

0fa6 472b a981 3c56 (32 bytes)

 

 

01010000 00100100 10001000 00010100 (3*4 bytes)

 

NGI Zero / part 1 / the future

_

step 2 / obfuscate attributes of intent token:

  • attributes = { "urn:osf:search:countrycode": "DE", ... }
  • build hash of { key, value } pairs, and add the result to a bloom filter
  • the resulting bloom filter (ABF/Attribute Bloom Filter) can act as:
    • a simple policy enforcement:
      • arriving intent token have to match the attribute bloom filter
      • the attribute filter can be derived from e.g. simple graphql query
    • tbd: a simple syntax validator
    • tbd: push the ABF also on the network layer
      • probabilistic event / object dissemination network

NGI Zero / part 1 / the future

_

step 2.5 / increase storage capabilities of ABF:

  • current approach is limited with memory size
  • new approach based on Roaring Bitmaps (Lemire et.al.)
  • abilty to encode far more items into ABF
  • more efficient union / intersection (10x-20x faster)
  • less memory usage

NGI Zero / part 1 / the future

NEW

_

step 2 / obfuscate attributes of intent token:

NGI Zero / part 1 / the future

{
  "iss": FP(Id),
  "sub": "urn:neuropil:photo:library:v1",
  „pub“: <binary data>,

  ...
} + sig

{

    "partner_dhkey": FP(N),

    "attribute_1": "super_secret_sauce",

    "attribute_1": bin(x),

    ...

} + sig

Object

=> BF(obj)

Object Fields

=> BF(attributes)

_

step 3 / discovery of best path and exchange security token

  • no interpretation of intent token in intermediate nodes
  • no DoS attack possible on virtual rendezvous point
  • no single point failure
  • discovery path uses signal strength as optimization
  • arriving intent token will be filtered

step 4 / messages exchange:

  • data is encrypted
  • routing along the pheromone trail (based on probability)
  • efficient pubsub / sender only has to send one message
  • arriving messages can be validated

NGI Zero / part 1 / the future

_

now is the time for questions or a short coffee break

NGI Zero / part 1 / questions

_

initial idea of the NGI Zero project:

  • use the virtual address space as a catchword index
  • "urn:osf:search:v1" =>  0fa6472ba9813c56
  • "mydocument.odt" =>  65c3189ab2746af0

approach works for single words / URL's / etc.:

  • documents contain more than one word: LSH / minhash signatures
  • what about pictures and other data sets (biology / chemistry / ...)?

 

not every node wants to be part of a specific search index

  • need additional subjects to manage search

Neuropil zero search

  • what is a good "distributed" index?
    • define "search entry" attributes / data model (JSON/Ontologies)
    • how can we distribute the search entries across a DHT?
  • map - reduce as a guiding principle
    • ​but what and how to map (de-centralized) ?
      • ​currently looking into LSH/minhash (ANN/KNN) (for text search)
    • but what and how to reduce (user specific) ?
      • define and use a "ranking" based on attributes

_

NGI Zero / part 2

  • what is a good "distributed" index?
    • define "search entry" attributes / data model (JSON/Ontologies)
    • how can we distribute the search entries across a DHT?
  • cryptographic longterm key hashing (Schnell et. al.)
    • construct a 256-bit hash value from a vector/dataset (or document)
    • discovery through address space
  • minhash signature / frequency mapping
    • use the minhash and its distribution to create a 256-bit hash
    • mapping to address space (hamming distance)

_

Neuropil zero search

minhash signatures:

  • split text into shingles / ngrams, hash each
  • min/maxhash (more efficient / less MSE / higher BAR)
  • seed the minhash with cryptographic hash
  • variable size possible / but has to be mod(8)
  • data-dependant minhash signatures
    • fixed size, variable shingle size
    • variable size, fixed shingling

_

Neuropil zero search

compare mmh-signatures / push mmh signatures to bloom filters:

  • add mmh signatures to a bloom filter
  • calculation of jaccard similarity / containment 
  • lead me to PPRL (privacy preserving record linkage)
    • works also with encrypted strings
    • it's about searching, so we can be a bit relaxed
    • Schnell et.al. propose multibit trees

_

Neuropil zero search

CLKHash - Cryptographic Longterm Keys:

  • is basically a bloom filter
  • standardized set of identifieres (tbd for "search")
  • candidate for a search entry
  • natural fit with intent token / pheromone
    • pheromone is able to capture time information
    • intent token contains secured public data
  • still need to find the correct clustering

_

Neuropil zero search

CLKHash - Cryptographic Longterm Keys:

  • Searching means: Union and Intersection of BF
  • Currently slow in our initial implementation
  • Inspiration from Roaring Bitmaps (Lemire et. al.)
  • combine advantages of both approaches:
    • Insert / Query speed from initial implementation
    • Union / Intersection speed from roaring bitmaps
    • Less Memory Consumption from roaring bitmaps
  • 10x-20x faster // more items in search vector
  • What is the expected feature set size ?

_

Neuropil zero search

NEW

LSH - Locality Sensitive Hashing (based on minhash):

  • split mmh into n-rows and b-bands
  • efficiently reduce the amount of comparison
  • lots of variants: TreeLSH, BoundedLSH, EnsembleLSH, ...
  • but:
    • designed for target threshold (1/b)^(1/r)
    • works on a fixed set of hash tables
    • use a variable length hash

_

Neuropil zero search

LPH - Locality Preserving Hashing:

  • used spam/malware detection: ssdeep / nilsimsa
    • low false positive rate / robust against attacks
  • used in forensics: tlsh
    • comparing which part of two documents are similar
    • resulting hash based on threshold (median)
  • data dependant hash calculation
  • variable length hash

_

Neuropil zero search

More options for text analysis based LSH / LPH

  • LSH shingle size can now be adjusted
  • LPH support for text
    • e.g. as an idea for URL's reognition:
      • https://www.neuropil.org/search/me
      • aaaaa://bbb.cccccccc.ddd/eeeeee/ff
      • aaaaa://bbb.cccccccc.ddd/eeeeee/ggg
      • ...                                                         /you
  • Search Analytics Mode
    • Word frequency distribution (based on single words)
    • TBD: frequence distribution for different modes of "shingles"

_

Neuropil zero search

NEW

can LSH and LPH work together?

  • data dependant hashing looks promising
  • avoid variable length encoding
  • querying for data, not a target probability
  • open for (dynamic) hash table count

_

Neuropil zero search

let's use a counting bloom filter to compare LSH table distribution!

revisit mmh signature / LSH:  (b=8/r=1; t=0,125)

_

Neuropil zero search

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

assume we have a set of eight hash tables

revisit mmh signature / LSH:  (b=4/r=2; t=0,5)

_

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH:  (b=2/r=4; t=0,84)

_

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH:  (b=1/r=8; t = 1,0)

_

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH:  (b=2/r=4)

_

NGI Zero / part 2

minhash(8): 15 - 54 - 9 - 23 - 823 - 547 - 3948 - 336

revisit mmh signature / LSH:

_

NGI Zero / part 2

L-Quartile

U-Quartile

Median

use the median to calculate relative importance of the eight tables

  • tables seven and eight are most important to query

00

01

10

11

10

01

00

01

00

01

11

11

revisit mmh signature / LSH:

_

NGI Zero / part 2

L-Quartile

U-Quartile

Median

use the median ?? experiments with different approaches 

  • tables four, seven and eight are most important to query
  • burden of binominal frequency distribution with hamming distance

00

01

10

11

10

01

00

01

00

01

11

11

NEW

_

Neuropil zero search

using LSH and LPH together - 256bit hash value

  • relative importance of virtual tables can be compared
    • locally the full hamming distance is used
    • distribution is based in partial hamming distance
    • is a kind of multi-index
  • easy to calculate, easy to distribute
    • using octile values (3bits per octile / assuming 85 hash tables)
    • uses a bktree implementation including binning (neighbour table seach)
    • on table hit, CLKHash'es are compared
    • can be extended with additional tables / np_index

_

Neuropil zero search

can LSH and LPH work together - 256bit hash value

  • will it work with in fully distributed mode?
    • hash distance routing guarantees query of closest "table"
    • nodes can detect the required hamming distance
    • storage in multiple search nodes is guaranteed
  • until now: validated locally with 600k entries
  • reasonsable performance (running un-optimized code)
    • parallel execution of queries
    • code optimization (cache misses / network runtime / ...)
    • using an embedded database for bf (bitmap) comparisons (?)

_

Neuropil zero search

  • what is a good "distributed" index?
    • defined "search entry" data model
      • intent (CWT) token of data owner
      • claims to be used as attributes (extend e.g. for HTML ...)
      • CLKHash to represent the actual data set
      • can be used "in private": add PPAttributes (minhash)
    • defined the distributed search index for a DHT
      • NPIndex: relativ importance based on our search entry
      • 256bit data dependant hash value

_

Neuropil zero search

  • what is a good "distributed" index?
    • search entries will "disappear"
      • encoded time information will enable "forgetting"
      • automatically evicts malicious content
      • ensure actuality of information
    • mutual exchange of interest
      • the searcher retrieves a list of possible data sources
      • the content provider retrives a list of searchers
      • no man-in-the-middle to prevent exchange

_

Neuropil zero search

  • what is a good "distributed" index?
    • search entries will "disappear"
      • encoded time information will enable "forgetting"
      • automagically evicting malicious content
      • ensure actuality of information
    • mutual exchange of interest
      • the searcher retrieves a list of possible data sources
      • the content provider retrives a list of searchers
      • no man-in-the-middle to prevent exchange

_

Neuropil zero search

  • How do nodes "discover" their peers?
    • Each server node uses four interfaces
      • ability to announce itself: H("urn:np:search:node:v1")
      • ability to store a search entry: H("urn:np:search:entry:v1")
      • ability to query for a search entry: H("urn:np:search:query:v1")
      • ability to receive search result: H("urn:np:search:result:v1")
    • Each of those four interfaces can and will be "tweaked"
      • the general subject also contains "peerid" as an attribute
      • the remaining interfaces always use hash concatenation, e.g:
        • H("urn:np:search:entry:v1") + H("np:search:peerid")
        • nodes only connect based on hash distance
        • "entry"/"query"/"result" automagically gain "onion" super-powers

NEW

_

Neuropil zero search

  • How do nodes "discover" their peers?
    • The general subject can be tweaked with a seed value
      • e.g. category-based search nodes:
        • H("org") + H("urn:np:search:node:v1")
        • H("org") + H("science") + H("urn:np:search:node:v1")
        • H("org") + H("science") + H("search") H("urn:np:search:node:v1")
        • related categories of a search node can be added as an attribute to enable other to discover more search spaces
    • allows to build "sub-search-spaces"
      • for and with specific search content
      • for specific search content providers
        • i.e. data-center, special hardware or content related)

NEW

_

Neuropil zero search

  • How do nodes "discover" their peers?
    • The general subject can be tweaked with a seed value:
      • e.g. technical difference:
        • H("bm25")+H("org")+H("science")+H("urn:np:search:node:v1")
        • H("5kmer")+H("org")+H("science")+H("urn:np:search:node:v1")
        • H("ai-hi-fly")+H("org")+H("science")+H("urn:np:search:node:v1")
        • differrent spaces can be queried in parallel !
        • "reducing" of search results can be done at the query client
      • e.g. private search nodes:
        • H("my:private:secret") + H("urn:np:search:node:v1")
        • will effectively render your own search nodes invisible

NEW

_

Neuropil zero search

H("urn:np:search:node:v1")

SearchNode "Server"

SearchNode "Client"

H("urn:np:search:entry:v1")+

H("urn:np:search:peer:id")

H("urn:np:search:query:v1")+

H("urn:np:search:peer:id")

H("urn:np:search:result:v1")+

H("urn:np:search:peer:id")

SearchNode "Private"

urn:np:search:peer:id=H("/dev/random")

urn:np:search:peer:id=H("/dev/random")

urn:np:search:peer:id=H("/dev/random")

H("urn:np:search:node:v1") +

H("my:private:secret")

NEW

_

Neuropil zero search

bm25 - 5kmer - space

org-science-AI space

H256(X)

0...

8...

4...

c...

your-private space

NEW

_

Neuropil zero search

  • what is a good "distributed" index?
    • define "search entry" attributes / data model (JSON/Ontologies)
    • how can we distribute the search entries across a DHT?
  • map - reduce as a guiding principle
    • ​but what and how to map (de-centralized) ?
      • ​currently looking into LSH/minhash (ANN/KNN) (for text search)
    • but what and how to reduce (user specific) ?
      • define and use a "ranking" based on attributes

_

Neuropil zero search

  • Open Question: Identities
    • if every participant can add search entry, how to establish trust?
    • additional "curator" signature for SEO companies
    • PKI / web of trust unsuited, but TSA is an answer
  • Open Question: Time
    • ​all systems must have the same understanding of time
    • index is attenuated, entries will disappear after a time
  • Open Question: Runtime
    • Python Binding (full support), Lua / NodeJS (partial) ...
    • WASM to execute user supplied map/reduce code?

_

Neuropil Zero Search

add curator

  • Open Questions: Search Pipeline Collaboration
    • good understanding of the whole process is needed
    • adding a search entry (BM25 / TF-IDF / ...)
    • querying a search entry

select ranking

exec bm25

select curator

_

NGI Zero / part 2

now is the time for questions or a longer coffee break

_

NGI Zero / part 2

Demo

pi-lar GmbH

Kreuzgasse 2-4

50667 Köln

 

www.pi-lar.net

info@pi-lar.net

eliza@neuropil.org

 

www.neuropil.org

https://www.gitlab.com/pi-lar/neuropil

neuropil@pi-lar.net

Let's

chat !

Join Our Workshops!

NGI Zero Discovery@Neuropil

By Stephan Schwichtenberg

NGI Zero Discovery@Neuropil

a short introduction to the neuropil messaging layer

  • 141