Advanced Training:
Indexing
May 28-31, 2024
iRODS User Group Meeting 2024
Amsterdam, Netherlands
Alan King, Senior Software Developer
Martin Flores, Software Developer
iRODS Consortium
iRODS Capabilities
Indexing
A policy framework that provides an asynchronous, scalable full text and metadata indexing service driven by collection assigned metadata.
Indexing technology of choice is reached by delegating policy implementation.
Indexing Policy Components
Indexing Policy Implementation
irods_policy_indexing_object_index_<technology>
irods_policy_indexing_object_purge_<technology>
irods_policy_indexing_metadata_index_<technology>
irods_policy_indexing_metadata_purge_<technology>
<technology> is directly derived from metadata and is used to delegate the policy invocation.
Core Competencies
Policy
Capabilities
Indexing Overview
Example Implementation
Getting Started
Installing the Indexing Plugin packages
sudo apt-get install -y \ irods-rule-engine-plugin-indexing \ irods-rule-engine-plugin-elasticsearch
As the ubuntu user, install the indexing rule engine plugin packages.
Configuring the Indexing Plugins
"rule_engines": [
{
"instance_name": "irods_rule_engine_plugin-indexing-instance",
"plugin_name": "irods_rule_engine_plugin-indexing",
"plugin_specific_configuration": {
}
},
{
"instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
"plugin_name": "irods_rule_engine_plugin-elasticsearch",
"plugin_specific_configuration": {
"hosts": ["http://localhost:9200"],
"bulk_count": 100,
"read_size": 4194304
}
},
As the irods user, edit /etc/irods/server_config.json
.
The elasticsearch plugin supports HTTPS and Basic authentication.
Standing up elasticsearch
Exit the shell and log back in to evaluate the new group.
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install -y docker-ce
sudo usermod -aG docker $USER
As the ubuntu user, install and configure docker (if not already done).
Run docker ps to ensure you can do so without sudo.
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Launching elasticsearch
docker run \ -d \ --name elasticsearch \ -p 9200:9200 \ -p 9300:9300 \ -e "discovery.type=single-node" \ -e "xpack.security.enabled=false" \ -e "xpack.security.http.ssl.enabled=false" \ elasticsearch:8.12.1
As the ubuntu user, launch the Docker container.
Wait for the container to start (~30 seconds) ...
Create the full_text index
As the ubuntu user, run the following.
curl -X PUT -H 'Content-Type: application/json' http://localhost:9200/full_text_index -d '{
"mappings": {
"properties": {
"absolutePath": {"type": "keyword"},
"data": {"type": "text"}
}
}
}'
Create the metadata index
As the ubuntu user, run the following.
curl -X PUT -H 'Content-Type: application/json' http://localhost:9200/metadata_index -d '{
"mappings": {
"properties": {
"url": {"type": "text"},
"zoneName": {"type": "keyword"},
"absolutePath": {"type": "keyword"},
"fileName": {"type": "text" },
"parentPath": {"type": "text"},
"isFile": {"type": "boolean"},
"dataSize": {"type": "long"},
"mimeType": {"type": "keyword"},
"lastModifiedDate": {"type": "date", "format": "epoch_second"},
"metadataEntries": {
"type": "nested",
"properties": {
"attribute": {"type": "keyword"},
"value": {"type": "text"},
"unit": {"type": "keyword"}
}
}
}
}
}'
Tagging collections for indexing
Tagging collections for indexing
Indexing metadata takes the form:
A: irods::indexing::index
V: <index_name>::<index_type>
U: <technology>
Tagging collections for indexing
Download some data.
wget https://cdn.patricktriest.com/data/books.zip unzip books.zip
imkdir indexed_collection
Create a collection to be indexed.
iput -r ./books indexed_collection/books0
Put a directory of files into the collection to be indexed.
As the irods user ...
Tagging collections for indexing
imeta set -C indexed_collection \ irods::indexing::index full_text_index::full_text elasticsearch
Set the metadata on indexed_collection for full_text.
$ iqstat
id name 10222 {"collection-name":"/tempZone/home/rods/indexed_collection","index-name":"full_text_index","index-type":"full_text","indexer":"elastic","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"irods_policy_indexing_collection_index","user-name":"rods"}
A delayed execution job is scheduled which will eventually scan and schedule indexing jobs.
Tagging collections for indexing
id name 10232 {"attribute":"","index-name":"full_text_index","index-type":"full_text","indexer":"elastic","object-path":"/tempZone/home/rods/indexed_collection/books0/120.txt","rule-engine-instance-name":"irods_rule_engine_plugin-indexing-instance","rule-engine-operation":"irods_policy_indexing_object_index","source-resource":"EMPTY_RESOURCE_NAME","units":"","user-name":"rods","value":""} ...
Each indexing job will batch upload data for the full_text indexing type.
$ iqstat | wc -l 103
Count the indexing jobs.
Inspect the full_text_index in elasticsearch
curl -X GET -H 'Content-Type: application/json' http://localhost:9200/full_text_index/_search?pretty=true -d '{ "from": 0, "size": 500, "_source": ["absolutePath"], "query": { "wildcard": { "absolutePath": { "value": "*/books0/*", "boost": 1.0, "rewrite": "constant_score" } } } }'
Search the index for all object paths which contain books0.
Inspect the full_text_index in elasticsearch
curl -X GET -H 'Content-Type: application/json' http://localhost:9200/full_text_index/_search?pretty=true -d '{ "from": 0, "size": 500, "_source" : ["absolutePath"], "query": { "term": {"data" : "the"} } }'
Search the index for all contents which contain the.
Add some more data to the indexed_collection
iput -r ./books indexed_collection/books1 iqstat | wc -l
Put a collection of data to the collection to be indexed.
An indexing event was asynchronously scheduled given the existing metadata tag on indexed_collection.
Configuring a collection for metadata indexing
Create a sub-collection which indexes metadata.
Any data put to this new collection will also be scheduled for full text indexing as its parent holds that metadata tag.
imkdir indexed_collection/metadata_indexing
imeta set -C indexed_collection/metadata_indexing \ irods::indexing::index metadata_index::metadata elasticsearch
Add some data to the new metadata collection
Add some metadata to an object in the collection.
iput -r books indexed_collection/metadata_indexing/books2
imeta add -d indexed_collection/metadata_indexing/books2/33.txt \ attr0 val0 units0
Check the delayed execution queue.
$ iqstat | wc -l 100
Query the newly indexed metadata
Search for a match to the attribute field.
curl -X GET -H 'Content-Type: application/json' http://localhost:9200/metadata_index/_search?pretty=true -d '{ "from": 0, "size" : 500, "_source": ["absolutePath", "metadataEntries"], "query": { "nested": { "path": "metadataEntries", "query": { "bool": { "must": [ {"match": {"metadataEntries.attribute": "attr0"}} ] } } } } }'
Add additional indexed metadata
Add some metadata to an object in the collection.
imeta add -d indexed_collection/metadata_indexing/books2/844.txt attr1 val1 units1 imeta add -d indexed_collection/metadata_indexing/books2/844.txt attr2 val2 units2 imeta add -d indexed_collection/metadata_indexing/books2/844.txt attr3 val3 units3
Additional Queries
Wildcard query to match all the attribute fields.
curl -X GET -H 'Content-Type: application/json' http://localhost:9200/metadata_index/_search?pretty=true -d '{
"query": {
"nested": {
"path": ["absolutePath", "metadataEntries"],
"query": {
"wildcard": {
"metadataEntries.attribute": "attr*"
}
}
}
}
}'
Configuring Indexing Resources
An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data.
In order to indicate a resource is available for indexing it may be annotated with metadata.
imeta add -R <resource_name> irods::indexing::index true
If no resource is tagged it is assumed that all resources are available for indexing.
Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.
Implementing the Document Type Policy
Edit /etc/irods/document_type.re
irods_policy_indexing_document_type_elastic( *object_path, *source_resource, *document_type) { # do something terribly interesting with external services writeLine("serverLog", "Document Type [*object_path]") *document_type = "text" }
The Document Type is used as part of the index, which is referenced in the url for the search. Currently we have only indexed data in the default 'text' document type.
Configure the new rule base
Edit /etc/irods/server_config.json
{
"instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
"plugin_name": "irods_rule_engine_plugin-irods_rule_language",
"plugin_specific_configuration": {
"re_data_variable_mapping_set": [
"core"
],
"re_function_name_mapping_set": [
"core"
],
"re_rulebase_set": [
"document_type",
"core"
],
Remove the Document Type Plugin
Edit /etc/irods/server_config.json
"rule_engines": [ { "instance_name": "irods_rule_engine_plugin-indexing-instance", "plugin_name": "irods_rule_engine_plugin-indexing", "plugin_specific_configuration": { } }, ... { "instance_name": "irods_rule_engine_plugin-document_type-instance", "plugin_name": "irods_rule_engine_plugin-document_type", "plugin_specific_configuration": { } }, ...
Test the Document Type Policy
Trigger a new indexing event
$ grep "Document Type" /var/log/irods/irods.log | jq '.log_message' "writeLine: inString = Document Type [/tempZone/home/rods/indexed_collection/file0]\n"
Check for debug message
iput version.json indexed_collection/file0 iqstat
Overriding the Indexing Policy
Policy Signatures - Implement these four policies to provide service to a new technology.
irods_policy_indexing_object_index_<technology>( *object_path, *source_resource, *index_name, *index_type) irods_policy_indexing_object_purge_<technology>( *object_path, *source_resource, *index_name, *index_type) irods_policy_indexing_metadata_index_<technology>( *object_path, *attribute, *value, *unit, *index_name) irods_policy_indexing_metadata_purge_<technology>( *object_path, *attribute, *value, *unit, *index_name)
Indexing Policy
The Indexing Policy provides a reactive framework to metadata attributes. Once the indexing technology policy is invoked, it may provide any implementation desired.
For instance, a Solr implementation can implement geographic indexing rather than full text for the full_text type and ignore the metadata type.
An implementation for Jena would ignore the full_text type and only implement the metadata policies.
Questions?