Capabilities

Indexing and Publishing

Jason M. Coposky

@jason_coposky

Executive Director, iRODS Consortium

Capabilities

Indexing and Publishing

June 25-28, 2019

iRODS User Group Meeting

University of Utrecht, NL

  • Packaged and supported solutions
  • Require configuration not code
  • Derived from the majority of use cases observed in the user community

iRODS Capabilities

Policy Composition and Capabilities

For example - Storage Tiering

  • Data Access Time
  • Identifying Violating Objects
  • Data Replication
  • Data Verification
  • Data Retention

The storage tiering capability - implemented as a composite which delegates each requirement out to separate policies.

Policy Composition and Capabilities

Policies composed into a Capability framework delegate by naming convention:

  • irods_policy_access_time
  • irods_policy_data_movement
  • irods_policy_data_replication
  • irods_policy_data_verification

Each policy may be overridden by another rule engine, or rule base to customize to future use cases or technologies

Each policy may now be reused and combined into new Capabilities

Indexing

A policy framework that provides an asynchronous, scalable full text and metadata indexing service driven by collection metadata

  • Indexing technology of choice is reached by delegating policy implementation

 

  • Document Type identification is delegated to a policy invocation

Indexing Policy Components

  • Document Type

  • Indexing Policy Implementation

    • irods_policy_indexing_object_index_<technology>

    • irods_policy_indexing_object_purge_<technology>

    • irods_policy_indexing_metadata_index_<technology>

    • irods_policy_indexing_metadata_purge_<technology>

<technology> is directly derived from metadata and is used to delegate the policy invocation

Core Competencies

Policy

Capabilities

Indexing Overview

Configuring Indexing Plugins

        "rule_engines": [

            ...

            {
                "instance_name": "irods_rule_engine_plugin-indexing-instance",
                "plugin_name": "irods_rule_engine_plugin-indexing",
                "plugin_specific_configuration": {
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
                "plugin_name": "irods_rule_engine_plugin-elasticsearch",
                "plugin_specific_configuration": {
                    "hosts" : ["http://localhost:9100/"],
                    "bulk_count" : 100,
                    "read_size" : 4194304
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-document_type-instance",
                "plugin_name": "irods_rule_engine_plugin-document_type",
                "plugin_specific_configuration": {
                }
            },

Edit /etc/irods/server_config.json

Tagging collections for indexing

Collections are tagged with metadata to indicate they should be indexed

A new AVU applied to a populated collection will schedule all objects for indexing

New objects placed into a collection with one or more indexing AVUs applied will also be indexed

Tagging collections for indexing

Objects that are modified or moved into a collection with one or more indexing AVUs applied will also be indexed

Indexing policy is inherited from parent collections:

a parent collection indexing metadata is also applied to any sub-collections

Tagging collections for indexing

Indexing metadata takes the form:

A:  irods::indexing::index
V:  <index name>::<index type>
U:  <technology>

  • index name is specific to your index configuration
  • index type is either: full_text or metadata
  • technology specifies which policy will be invoked to perform the indexing - currently elasticsearch

Configuring Indexing Resources

An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data.

In order to indicate a resource is available for indexing it may be annotated with metadata:

imeta add -R <resource name> irods::indexing::index true

If no resource be tagged it is assumed that all resources are available for indexing.

Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.

Overriding the Indexing Policy

Policy Signatures - Implement these four policies to provide service to a new technology

irods_policy_indexing_object_index_<technology>(

    *object_path, *source_resource, *index_name, *index_type)
irods_policy_indexing_object_purge_<technology>(

    *object_path, *source_resource, *index_name, *index_type)

irods_policy_indexing_metadata_index_<technology>(

    *object_path, *attribute, *value, *unit, *index_name)
irods_policy_indexing_metadata_purge_<technology>(

    *object_path, *attribute, *value, *unit, *index_name)

Indexing Policy

The Indexing Policy provides a reactive framework to metadata attributes.  Once the indexing technology policy is invoked, it may provide any implementation desired.

For instance, given a document type, a Solr implementation can implement geographic indexing rather than full text for the "full_text" type and ignore the "metadata" type.

An implementation for Jena would ignore the "full_text" type and only implement the metadata policies.

Questions?

Publishing

A policy framework that provides an asynchronous, scalable data publishing service driven by metadata

  • Publishing technology of choice is reached by delegating policy implementation
  • Persistent identifier generation is delegated to a policy invocation

Publishing Policy Components

  • Persistent Identifier
  • Publishing Policy Implementation
    • irods_policy_publishing_object_publish_<technology>
    • irods_policy_publishing_object_purge_<technology>
    • irods_policy_publishing_collection_publish_<technology>
    • irods_policy_publishing_collection_purge_<technology>

<technology> is directly derived from metadata and is used to delegate the policy invocation

Core Competencies

Policy

Capabilities

Publishing Overview

Configuring Publishing Plugins

        "rule_engines": [

            ...

            {
                "instance_name": "irods_rule_engine_plugin-dataworld-instance",
                "plugin_name": "irods_rule_engine_plugin-dataworld",
                "plugin_specific_configuration": {
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-publishing-instance",
                "plugin_name": "irods_rule_engine_plugin-publishing",
                "plugin_specific_configuration": {
                }
            },

            {
                "instance_name": "irods_rule_engine_plugin-persistent_identifier-instance",
                "plugin_name": "irods_rule_engine_plugin-persistent_identifier",
                "plugin_specific_configuration": {
                }
            },

Edit /etc/irods/server_config.json

Tagging collections for publishing

Collections and Data Objects are tagged with metadata to indicate they should be published

A new AVU applied to a populated collection will schedule all objects for publication

New objects cannot be placed into a collection with a publishing AVUs applied.  Nor can those object be modified with POSIX operations.

Tagging for publication

Publishing metadata takes the form:

A:  irods::publishing::publish
V:  <service>

The service name is directly applied the the policy name template, which dictates which policies are invoked.

Immutability of Published Content

remote addresses: 127.0.0.1 ERROR: rmUtil: rm error for /tempZone/home/irodsconsortium/published_file0, status = -35000 status = -35000 SYS_INVALID_OPR_TYPE
Level 0: object is published and now immutable [/tempZone/home/irodsconsortium/file3]

imeta rm -d file3 irods::publishing::publish dataworld

irm -f published_file0

Users cannot modify or delete published content

Users cannot remove publication metadata

remote addresses: 127.0.0.1 ERROR: Level 0: publishing metadata tags are immutable [/tempZone/home/irodsconsortium/file3]
remote addresses: 127.0.0.1 ERROR: rcModAVUMetadata failed with error -35000 SYS_INVALID_OPR_TYPE
Level 0: publishing metadata tags are immutable [/tempZone/home/irodsconsortium/file3]

Overriding the Persistent Identifier Policy

The data.world publication policy delegates the generation of persistent identifiers.

By default it is a base64 encoded UUID

irods_policy_publishing_persistent_identifier(

    *object_path, *service_name, *pid) {

    writeLine("serverLog", "Persistent Identifier - [*object_path]")

    *pid = "ABC123"

}

edit /etc/irods/persistent_identifier.re

Overriding the Publishing Policy

Policy Signatures - Implement these four policies to provide integration to a new publishing service

irods_policy_publishing_object_publish_<service>(

    *object_path, *user_name, *service_name)
irods_policy_publishing_object_purge_<service>(

    *object_path, *user_name, *service_name)

irods_policy_publishing_collection_index_<service>(

    *collection_name, *user_name, *service_name)
irods_policy_indexing_collection_purge_<service>(

    *collection_name, *user_name, *service_name

Publishing Policy

The Publishing Policy provides a reactive framework to metadata attributes.  Once the publishing service policy is invoked, it may provide any implementation desired.

For instance, some services may simply need a URI to the data set whereas others may require the data be uploaded, such as data.world.

The publishing service may require a specific submission package format, additional metadata or other requirements which would require the publishing job to wait until these needs are met.

Future Work - New services to support

Indexing

  • Solr - geographic indexing
  • Semantic indexing technologies
  • Tika data typing
  • Dataverse
  • Life science catalogs
  • Handle
  • DOI
  • Minid

Publishing

This should be a community discussion

Questions?

UGM 2019 - iRODS Capabilities Indexing and Publishing

By jason coposky

UGM 2019 - iRODS Capabilities Indexing and Publishing

iRODS User Group Meeting 2019 - Capabilities Overview

  • 1,496