Indexing
Jason M. Coposky
@jason_coposky
Executive Director, iRODS Consortium
Indexing
January 14-16 2020
CINES
Montpellier, France
- Packaged and supported solutions
- Require configuration not code
- Derived from the majority of use cases observed in the user community
iRODS Capabilities
Indexing
A policy framework that provides an asynchronous, scalable full text and metadata indexing service driven by collection assigned metadata
- Indexing technology of choice is reached by delegating policy implementation
- Document Type identification is delegated to a policy invocation
Indexing Policy Components
-
Document Type
-
Indexing Policy Implementation
-
irods_policy_indexing_object_index_<technology>
-
irods_policy_indexing_object_purge_<technology>
-
irods_policy_indexing_metadata_index_<technology>
-
irods_policy_indexing_metadata_purge_<technology>
-
<technology> is directly derived from metadata and is used to delegate the policy invocation
Core Competencies
Policy
Capabilities
Indexing Overview
Example Implementation
Getting Started
Installing the Indexing Plugins
wget http://people.renci.org/~jasonc/irods/irods-externals-cpr1.3.0-0_1.0~xenial_amd64.deb wget http://people.renci.org/~jasonc/irods/irods-externals-elasticlient0.1.0-0_1.0~xenial_amd64.deb wget http://people.renci.org/~jasonc/irods/irods-rule-engine-plugin-document-type.deb wget http://people.renci.org/~jasonc/irods/irods-rule-engine-plugin-elasticsearch.deb wget http://people.renci.org/~jasonc/irods/irods-rule-engine-plugin-indexing.deb
As the ubuntu user
Download the Indexing packages
sudo dpkg -i irods-externals-cpr1.3.0-0_1.0~xenial_amd64.deb irods-externals-elasticlient0.1.0-0_1.0~xenial_amd64.deb irods-rule-engine-plugin-document-type.deb irods-rule-engine-plugin-elasticsearch.deb irods-rule-engine-plugin-indexing.deb
Install the Indexing packages
Configuring Indexing Plugins
"rule_engines": [
...
{
"instance_name": "irods_rule_engine_plugin-indexing-instance",
"plugin_name": "irods_rule_engine_plugin-indexing",
"plugin_specific_configuration": {
}
},
{
"instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
"plugin_name": "irods_rule_engine_plugin-elasticsearch",
"plugin_specific_configuration": {
"hosts" : ["http://localhost:9200/"],
"bulk_count" : 100,
"read_size" : 4194304
}
},
{
"instance_name": "irods_rule_engine_plugin-document_type-instance",
"plugin_name": "irods_rule_engine_plugin-document_type",
"plugin_specific_configuration": {
}
},
As the irods user
Edit /etc/irods/server_config.json
Setting up Elastic Search
As the ubuntu user
curl -X PUT -H'Content-Type: application/json' http://localhost:9200/full_text_index curl -X PUT -H'Content-Type: application/json' http://localhost:9200/full_text_index/_mapping/text -d '{ "properties" : { "object_path" : { "type" : "text" }, "data" : { "type" : "text" } } }'
Create both full_text and metadata indicies
curl -X PUT -H'Content-Type: application/json' http://localhost:9200/metadata_index curl -X PUT -H'Content-Type: application/json' http://localhost:9200/metadata_index/_mapping/text -d '{ "properties" : { "object_path" : { "type" : "text" }, "attribute" : { "type" : "text" }, "value" : { "type" : "text" }, "unit" : { "type" : "text" } } }'
We will leverage the existing Elastic service from the Audit plugin's ELK container
Tagging collections for indexing
Collections are tagged with metadata to indicate they should be indexed
A new AVU applied to a populated collection will schedule all objects for indexing
New objects placed into a collection with one or more indexing AVUs applied will also be indexed
Objects that are modified or moved into a collection with one or more indexing AVUs applied will also be indexed
Tagging collections for indexing
Indexing metadata takes the form:
A: irods::indexing::index
V: <index name>::<index type>
U: <technology>
- index name is specific to your index configuration
- index type is either: full_text or metadata
- technology specifies which policy will be invoked to perform the indexing - currently elasticsearch
Tagging collections for indexing
Download some data
wget https://cdn.patricktriest.com/data/books.zip unzip books.zip
imkdir indexed_collection
Create a collection to be indexed
iput -r ./books indexed_collection/books0
Put a directory of files into the collection to be indexed
As the irods user:
Tagging collections for indexing
imeta set -C indexed_collection irods::indexing::index full_text_index::full_text elasticsearch
Set the metadata on indexed_collection for full_text
id name
10222 {"collection-name":"/tempZone/home/rods/indexed_collection","index-name":"full_text_index","index-type":"full_text","indexer":"elastic","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"irods_policy_indexing_collection_index","user-name":"rods"}
A delayed execution job is scheduled which will then scan and schedule indexing jobs
Tagging collections for indexing
id name
10232 {"attribute":"","index-name":"full_text_index","index-type":"full_text","indexer":"elastic","object-path":"/tempZone/home/rods/indexed_collection/books0/120.txt","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"irods_policy_indexing_object_index","source-resource":"EMPTY_RESOURCE_NAME","units":"","user-name":"rods","value":""}
...
Each indexing job will batch upload data for the full_text indexing type
~$ iqstat | wc -l
81
Count the indexing jobs
Inspect the full_text_index in elasticsearch
curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/full_text_index/text/_search?pretty=true -d '
{
"from": 0, "size" : 500,
"_source" : ["object_path"],
"query" : {
"term" : { "object_path" : "books0"}
}
}'
Search the index for all object_paths which contain "books0"
Inspect the full_text_index in elasticsearch
curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/full_text_index/text/_search?pretty=true -d '
{
"from": 0, "size" : 500,
"_source" : ["object_path"],
"query" : {
"term" : { "data" : "the"}
}
}'
Search the index for all contents which contain "the"
Add some more data to the indexed_collection
iput -r ./books indexed_collection/books1
iqstat | wc -l
Put a collection of data to the collection to be indexed
An indexing event was asynchronously scheduled given the existing metadata tag on indexed_collection
Configuring a collection for metadata indexing
Create a subcollection which indexes metadata
Any data put to this new collection will also be scheduled for full text index as its parent holds that metadata tag
imkdir indexed_collection/metadata_indexing
imeta set -C indexed_collection/metadata_indexing irods::indexing::index metadata_index::metadata elasticsearch
Add some data to the new metadata collection
Add some metadata to an object in the collection
iput -r books indexed_collection/metadata_indexing/books2
imeta add -d indexed_collection/metadata_indexing/books2/33.txt attr0 val0 units0
Check the delayed execution queue
~$ iqstat | wc -l
100
Query the newly indexed metadata
Search for a match to the attribute field
curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/metadata_index/text/_search?pretty=true -d '
{
"from": 0, "size" : 500,
"_source" : ["object_path", "attribute", "value", "units"],
"query" : {
"term" : {"attribute" : "attr0"}
}
}'
Add additional indexed metadata
Add some metadata to an object in the collection
imeta add -d indexed_collection/metadata_indexing/books2/33.txt attr1 val1 units1
imeta add -d indexed_collection/metadata_indexing/books2/33.txt attr2 val2 units2
imeta add -d indexed_collection/metadata_indexing/books2/33.txt attr3 val3 units3
Additional Queries
Wildcard query to match all the attribute fields
curl -X GET -H'Content-Type: application/json' HTTP://localhost:9200/metadata_index/text/_search?pretty=true -d '
{
"from": 0, "size" : 500,
"_source" : ["object_path", "attribute", "value", "units"],
"query" : {
"wildcard": {
"attribute": {
"value": "attr*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}'
Configuring Indexing Resources
An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data.
In order to indicate a resource is available for indexing it may be annotated with metadata:
imeta add -R <resource name> irods::indexing::index true
If no resource be tagged it is assumed that all resources are available for indexing.
Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.
Implementing the Document Type Policy
Edit /etc/irods/document_type.re
irods_policy_indexing_document_type_elastic(
*object_path, *source_resource, *document_type) {
# do something terribly interesting with external services
writeLine("serverLog", "Document Type [*object_path]")
*document_type = "text"
}
The Document Type is used as part of the index, which is referenced in the url for the search. Currently we have only indexed data in the default 'text' document type.
Configure the new rule base
Edit /etc/irods/server_config.json
{
"instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
"plugin_name": "irods_rule_engine_plugin-irods_rule_language",
"plugin_specific_configuration": {
"re_data_variable_mapping_set": [
"core"
],
"re_function_name_mapping_set": [
"core"
],
"re_rulebase_set": [
"document_type",
"core"
],
Remove the Document Type Plugin
Edit /etc/irods/server_config.json
"rule_engines": [
{
"instance_name": "irods_rule_engine_plugin-indexing-instance",
"plugin_name": "irods_rule_engine_plugin-indexing",
"plugin_specific_configuration": {
}
},
...
{
"instance_name": "irods_rule_engine_plugin-document_type-instance",
"plugin_name": "irods_rule_engine_plugin-document_type",
"plugin_specific_configuration": {
}
},
...
Test the Document Type Policy
Trigger a new indexing event
grep "Document Type" log/rodsLog*
<SNIP> inString = Document Type [/tempZone/home/rods/indexed_collection/file3]
Check for debug message
iput VERSION.json indexed_collection/file0
iqstat
Overriding the Indexing Policy
Policy Signatures - Implement these four policies to provide service to a new technology
irods_policy_indexing_object_index_<technology>(
*object_path, *source_resource, *index_name, *index_type)
irods_policy_indexing_object_purge_<technology>(
*object_path, *source_resource, *index_name, *index_type)
irods_policy_indexing_metadata_index_<technology>(
*object_path, *attribute, *value, *unit, *index_name)
irods_policy_indexing_metadata_purge_<technology>(
*object_path, *attribute, *value, *unit, *index_name)
Indexing Policy
The Indexing Policy provides a reactive framework to metadata attributes. Once the indexing technology policy is invoked, it may provide any implementation desired.
For instance, given a document type, a Solr implementation can implement geographic indexing rather than full text for the "full_text" type and ignore the "metadata" type.
An implementation for Jena would ignore the "full_text" type and only implement the metadata policies.
Questions?
CINES 2020 - Indexing
By jason coposky
CINES 2020 - Indexing
CINES 2020 Training Module
- 1,179