Policy Training

Indexing

Jason M. Coposky

@jason_coposky

Executive Director, iRODS Consortium

Policy Training

Indexing

June 25-28, 2019

iRODS User Group Meeting 2019

Utrecht, Netherlands

Packaged and supported solutions
Require configuration not code
Derived from the majority of use cases observed in the user community

iRODS Capabilities

Indexing

A policy framework that provides an asynchronous, scalable full text and metadata indexing service driven by collection assigned metadata

Indexing technology of choice is reached by delegating policy implementation

Document Type identification is delegated to a policy invocation

Indexing Policy Components

Document Type
Indexing Policy Implementation
- irods_policy_indexing_object_index_<technology>
- irods_policy_indexing_object_purge_<technology>
- irods_policy_indexing_metadata_index_<technology>
- irods_policy_indexing_metadata_purge_<technology>

<technology> is directly derived from metadata and is used to delegate the policy invocation

Core Competencies

Policy

Capabilities

Indexing Overview

Example Implementation

Getting Started

Installing the Indexing Plugins

As the ubuntu user

sudo apt install -y irods-rule-engine-plugin-document-type irods-rule-engine-plugin-elasticsearch irods-rule-engine-plugin-indexing

Install the Indexing packages

Configuring Indexing Plugins

        "rule_engines": [

            ...

            {
                "instance_name": "irods_rule_engine_plugin-indexing-instance",
                "plugin_name": "irods_rule_engine_plugin-indexing",
                "plugin_specific_configuration": {
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
                "plugin_name": "irods_rule_engine_plugin-elasticsearch",
                "plugin_specific_configuration": {
                    "hosts" : ["http://localhost:9200/"],
                    "bulk_count" : 100,
                    "read_size" : 4194304
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-document_type-instance",
                "plugin_name": "irods_rule_engine_plugin-document_type",
                "plugin_specific_configuration": {
                }
            },

As the irods user

Edit /etc/irods/server_config.json

Standing up elasticsearch

Exit the shell and log back in to evaluate the new group

sudo apt-get -y install docker.io

sudo usermod -aG docker $USER

As the ubuntu user

Install and configure docker

run docker ps to ensure you can do so without sudo

Standing up elasticsearch

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.4.2
docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.4.2

As the ubuntu user

curl -X PUT -H'Content-Type: application/json' http://localhost:9200/full_text_index

curl -X PUT -H'Content-Type: application/json' \
http://localhost:9200/full_text_index/_mapping/text?include_type_name \
 -d '{ "properties" : { "absolutePath" : { "type" : "keyword" }, 
                        "data" : { "type" : "text" } } }'

Create and initialize full_text index

Wait for the container to start ...

curl http://localhost:9200

Standing up elasticsearch

Create and initialize metadata index

curl -X PUT -H'Content-Type: application/json' http://localhost:9200/metadata_index

curl -X PUT -H'Content-Type: application/json' \
 http://localhost:9200/metadata_index/_mapping \
 -d'{ "properties" : { "url": { "type": "text" }, "zoneName": { "type": "keyword" }, 
                   "absolutePath": { "type": "keyword" }, "fileName": { "type": "text" },
                   "parentPath": { "type": "text" }, "isFile": { "type": "boolean" },
                   "dataSize": { "type": "long" }, "mimeType": { "type": "keyword" },
                   "lastModifiedDate": { "type": "date", "format": "epoch_second" },
                   "metadataEntries": { "type": "nested", "properties": { 
                   "attribute": { "type": "keyword" }, "value": { "type": "text" },
                   "unit": { "type": "keyword" } } } } }'

Tagging collections for indexing

Collections are tagged with metadata to indicate they should be indexed

A new AVU applied to a populated collection will schedule all objects for indexing

New objects placed into a collection with one or more indexing AVUs applied will also be indexed

Objects that are modified or moved into a collection with one or more indexing AVUs applied will also be indexed

Tagging collections for indexing

Indexing metadata takes the form:

A: irods::indexing::index
V: <index name>::<index type>
U: <technology>

index name is specific to your index configuration
index type is either: full_text or metadata
technology specifies which policy will be invoked to perform the indexing - currently elasticsearch

Tagging collections for indexing

Download some data

wget https://cdn.patricktriest.com/data/books.zip
unzip books.zip

imkdir indexed_collection

Create a collection to be indexed

iput -r ./books indexed_collection/books0

Put a directory of files into the collection to be indexed

As the irods user:

Tagging collections for indexing

imeta set -C indexed_collection irods::indexing::index full_text_index::full_text elasticsearch

Set the metadata on indexed_collection for full_text

id name
10222 {"collection-name":"/tempZone/home/rods/indexed_collection","index-name":"full_text_index","index-type":"full_text","indexer":"elastic","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"irods_policy_indexing_collection_index","user-name":"rods"}

A delayed execution job is scheduled which will then scan and schedule indexing jobs

Tagging collections for indexing

id name
10232 {"attribute":"","index-name":"full_text_index","index-type":"full_text","indexer":"elastic","object-path":"/tempZone/home/rods/indexed_collection/books0/120.txt","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"irods_policy_indexing_object_index","source-resource":"EMPTY_RESOURCE_NAME","units":"","user-name":"rods","value":""}

...

Each indexing job will batch upload data for the full_text indexing type

~$ iqstat | wc -l
81

Count the indexing jobs

Inspect the full_text_index in elasticsearch

curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/full_text_index/text/_search?pretty=true -d '

{

"from": 0, "size" : 500,

"_source":["absolutePath"],
"query": {
"wildcard": {
"absolutePath": {
"value": "*/books0/*",

"boost": 1.0, "rewrite": "constant_score"
}
}
}

Search the index for all object_paths which contain "books0"

Inspect the full_text_index in elasticsearch

curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/full_text_index/text/_search?pretty=true -d '

{

"from": 0, "size" : 500,

"_source" : ["absolutePath"],

"query" : {

"term" : { "data" : "the"}

}

Search the index for all contents which contain "the"

Add some more data to the indexed_collection

iput -r ./books indexed_collection/books1

iqstat | wc -l

Put a collection of data to the collection to be indexed

An indexing event was asynchronously scheduled given the existing metadata tag on indexed_collection

Configuring a collection for metadata indexing

Create a subcollection which indexes metadata

Any data put to this new collection will also be scheduled for full text index as its parent holds that metadata tag

imkdir indexed_collection/metadata_indexing

imeta set -C indexed_collection/metadata_indexing irods::indexing::index metadata_index::metadata elasticsearch

Configuring a collection for metadata indexing

Create a subcollection which indexes metadata

$iqstat
id     name
10426 {"collection-name":"/tempZone/home/rods/indexed_collection/metadata_indexing","index-name":"metadata_index","index-type":"metadata","indexer":"elasticsearch","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"
irods_policy_indexing_collection_index","user-name":"rods"}

Add some data to the new metadata collection

Add some metadata to an object in the collection

iput -r books indexed_collection/metadata_indexing/books2

imeta add -d indexed_collection/metadata_indexing/books2/33.txt attr0 val0 units0

Check the delayed execution queue

~$ iqstat | wc -l
100

Query the newly indexed metadata

Search for a match to the attribute field

curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/metadata_index/_search?pretty=true -d '
{
"from": 0, "size" : 500,
"_source" : ["absolutePath", "metadataEntries"],
"query" : {
"nested": {
"path": "metadataEntries",
"query": {
"bool": {
"must": [
{ "match": { "metadataEntries.attribute": "attr0" } }
]
}
}
}
}
}'

Add additional indexed metadata

Add some metadata to an object in the collection

imeta add -d indexed_collection/metadata_indexing/books2/844.txt attr1 val1 units1

imeta add -d indexed_collection/metadata_indexing/books2/844.txt attr2 val2 units2

imeta add -d indexed_collection/metadata_indexing/books2/844.txt attr3 val3 units3

Additional Queries

Wildcard query to match all the attribute fields

curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/metadata_index/_search?pretty=true -d '

{

"from": 0, "size" : 500,

"_source" : ["absolutePath","metadataEntries"],

"query" : {
"nested":{
"path":"metadataEntries",
"query":{
"wildcard":{
"metadataEntries.attribute":"attr*"
}
}
}

}

Configuring Indexing Resources

An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data.

In order to indicate a resource is available for indexing it may be annotated with metadata:

imeta add -R <resource name> irods::indexing::index true

If no resource be tagged it is assumed that all resources are available for indexing.

Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.

Implementing the Document Type Policy

Edit /etc/irods/document_type.re

irods_policy_indexing_document_type_elasticsearch(
*object_path, *source_resource, *document_type) {
# do something terribly interesting with external services

writeLine("serverLog", "Document Type [*object_path]")
*document_type = "text"
}

The Document Type is used as part of the index, which is referenced in the url for the search. Currently we have only indexed data in the default 'text' document type.

Configure the new rule base

Edit /etc/irods/server_config.json

{
"instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
"plugin_name": "irods_rule_engine_plugin-irods_rule_language",
"plugin_specific_configuration": {
"re_data_variable_mapping_set": [
"core"
],
"re_function_name_mapping_set": [
"core"
],
"re_rulebase_set": [
"document_type",
"core"
],

Remove the Document Type Plugin

Edit /etc/irods/server_config.json

"rule_engines": [
{
"instance_name": "irods_rule_engine_plugin-indexing-instance",
"plugin_name": "irods_rule_engine_plugin-indexing",
"plugin_specific_configuration": {
}
},
...
{
"instance_name": "irods_rule_engine_plugin-document_type-instance",
"plugin_name": "irods_rule_engine_plugin-document_type",
"plugin_specific_configuration": {
}
},
...

Test the Document Type Policy

Trigger a new indexing event

grep "Document Type" log/rodsLog*

<SNIP> inString = Document Type [/tempZone/home/rods/indexed_collection/file3]

Check for debug message

iput VERSION.json indexed_collection/file0

iqstat

Overriding the Indexing Policy

Policy Signatures - Implement these four policies to provide service to a new technology

irods_policy_indexing_object_index_<technology>(

*object_path, *source_resource, *index_name, *index_type)
irods_policy_indexing_object_purge_<technology>(

*object_path, *source_resource, *index_name, *index_type)

irods_policy_indexing_metadata_index_<technology>(

*object_path, *attribute, *value, *unit, *index_name)
irods_policy_indexing_metadata_purge_<technology>(

*object_path, *attribute, *value, *unit, *index_name)

Indexing Policy

The Indexing Policy provides a reactive framework to metadata attributes. Once the indexing technology policy is invoked, it may provide any implementation desired.

For instance, given a document type, a Solr implementation can implement geographic indexing rather than full text for the "full_text" type and ignore the "metadata" type.

An implementation for Jena would ignore the "full_text" type and only implement the metadata policies.

Questions?

Copy of UGM 2019 - Policy Training Indexing

Copy of UGM 2019 - Policy Training Indexing

Daniel Moore

Copy of UGM 2019 - Policy Training Indexing

More from Daniel Moore