Capabilities
Indexing and Publishing
Jason M. Coposky
@jason_coposky
Executive Director, iRODS Consortium
Capabilities
Indexing and Publishing
June 25-28, 2019
iRODS User Group Meeting
University of Utrecht, NL
- Packaged and supported solutions
- Require configuration not code
- Derived from the majority of use cases observed in the user community
iRODS Capabilities
Policy Composition and Capabilities
For example - Storage Tiering
- Data Access Time
- Identifying Violating Objects
- Data Replication
- Data Verification
- Data Retention
The storage tiering capability - implemented as a composite which delegates each requirement out to separate policies.
Policy Composition and Capabilities
Policies composed into a Capability framework delegate by naming convention:
- irods_policy_access_time
- irods_policy_data_movement
- irods_policy_data_replication
- irods_policy_data_verification
Each policy may be overridden by another rule engine, or rule base to customize to future use cases or technologies
Each policy may now be reused and combined into new Capabilities
Indexing
A policy framework that provides an asynchronous, scalable full text and metadata indexing service driven by collection metadata
- Indexing technology of choice is reached by delegating policy implementation
- Document Type identification is delegated to a policy invocation
Indexing Policy Components
-
Document Type
-
Indexing Policy Implementation
-
irods_policy_indexing_object_index_<technology>
-
irods_policy_indexing_object_purge_<technology>
-
irods_policy_indexing_metadata_index_<technology>
-
irods_policy_indexing_metadata_purge_<technology>
-
<technology> is directly derived from metadata and is used to delegate the policy invocation
Core Competencies
Policy
Capabilities
Indexing Overview
Configuring Indexing Plugins
"rule_engines": [
...
{
"instance_name": "irods_rule_engine_plugin-indexing-instance",
"plugin_name": "irods_rule_engine_plugin-indexing",
"plugin_specific_configuration": {
}
},
{
"instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
"plugin_name": "irods_rule_engine_plugin-elasticsearch",
"plugin_specific_configuration": {
"hosts" : ["http://localhost:9100/"],
"bulk_count" : 100,
"read_size" : 4194304
}
},
{
"instance_name": "irods_rule_engine_plugin-document_type-instance",
"plugin_name": "irods_rule_engine_plugin-document_type",
"plugin_specific_configuration": {
}
},
Edit /etc/irods/server_config.json
Tagging collections for indexing
Collections are tagged with metadata to indicate they should be indexed
A new AVU applied to a populated collection will schedule all objects for indexing
New objects placed into a collection with one or more indexing AVUs applied will also be indexed
Tagging collections for indexing
Objects that are modified or moved into a collection with one or more indexing AVUs applied will also be indexed
Indexing policy is inherited from parent collections:
a parent collection indexing metadata is also applied to any sub-collections
Tagging collections for indexing
Indexing metadata takes the form:
A: irods::indexing::index
V: <index name>::<index type>
U: <technology>
- index name is specific to your index configuration
- index type is either: full_text or metadata
- technology specifies which policy will be invoked to perform the indexing - currently elasticsearch
Configuring Indexing Resources
An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data.
In order to indicate a resource is available for indexing it may be annotated with metadata:
imeta add -R <resource name> irods::indexing::index true
If no resource be tagged it is assumed that all resources are available for indexing.
Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.
Overriding the Indexing Policy
Policy Signatures - Implement these four policies to provide service to a new technology
irods_policy_indexing_object_index_<technology>(
*object_path, *source_resource, *index_name, *index_type)
irods_policy_indexing_object_purge_<technology>(
*object_path, *source_resource, *index_name, *index_type)
irods_policy_indexing_metadata_index_<technology>(
*object_path, *attribute, *value, *unit, *index_name)
irods_policy_indexing_metadata_purge_<technology>(
*object_path, *attribute, *value, *unit, *index_name)
Indexing Policy
The Indexing Policy provides a reactive framework to metadata attributes. Once the indexing technology policy is invoked, it may provide any implementation desired.
For instance, given a document type, a Solr implementation can implement geographic indexing rather than full text for the "full_text" type and ignore the "metadata" type.
An implementation for Jena would ignore the "full_text" type and only implement the metadata policies.
Questions?
Publishing
A policy framework that provides an asynchronous, scalable data publishing service driven by metadata
- Publishing technology of choice is reached by delegating policy implementation
- Persistent identifier generation is delegated to a policy invocation
Publishing Policy Components
- Persistent Identifier
- Publishing Policy Implementation
- irods_policy_publishing_object_publish_<technology>
- irods_policy_publishing_object_purge_<technology>
- irods_policy_publishing_collection_publish_<technology>
- irods_policy_publishing_collection_purge_<technology>
<technology> is directly derived from metadata and is used to delegate the policy invocation
Core Competencies
Policy
Capabilities
Publishing Overview
Configuring Publishing Plugins
"rule_engines": [
...
{
"instance_name": "irods_rule_engine_plugin-dataworld-instance",
"plugin_name": "irods_rule_engine_plugin-dataworld",
"plugin_specific_configuration": {
}
},
{
"instance_name": "irods_rule_engine_plugin-publishing-instance",
"plugin_name": "irods_rule_engine_plugin-publishing",
"plugin_specific_configuration": {
}
},
{
"instance_name": "irods_rule_engine_plugin-persistent_identifier-instance",
"plugin_name": "irods_rule_engine_plugin-persistent_identifier",
"plugin_specific_configuration": {
}
},
Edit /etc/irods/server_config.json
Tagging collections for publishing
Collections and Data Objects are tagged with metadata to indicate they should be published
A new AVU applied to a populated collection will schedule all objects for publication
New objects cannot be placed into a collection with a publishing AVUs applied. Nor can those object be modified with POSIX operations.
Tagging for publication
Publishing metadata takes the form:
A: irods::publishing::publish
V: <service>
The service name is directly applied the the policy name template, which dictates which policies are invoked.
Immutability of Published Content
remote addresses: 127.0.0.1 ERROR: rmUtil: rm error for /tempZone/home/irodsconsortium/published_file0, status = -35000 status = -35000 SYS_INVALID_OPR_TYPE
Level 0: object is published and now immutable [/tempZone/home/irodsconsortium/file3]
imeta rm -d file3 irods::publishing::publish dataworld
irm -f published_file0
Users cannot modify or delete published content
Users cannot remove publication metadata
remote addresses: 127.0.0.1 ERROR: Level 0: publishing metadata tags are immutable [/tempZone/home/irodsconsortium/file3]
remote addresses: 127.0.0.1 ERROR: rcModAVUMetadata failed with error -35000 SYS_INVALID_OPR_TYPE
Level 0: publishing metadata tags are immutable [/tempZone/home/irodsconsortium/file3]
Overriding the Persistent Identifier Policy
The data.world publication policy delegates the generation of persistent identifiers.
By default it is a base64 encoded UUID
irods_policy_publishing_persistent_identifier(
*object_path, *service_name, *pid) {
writeLine("serverLog", "Persistent Identifier - [*object_path]")
*pid = "ABC123"
}
edit /etc/irods/persistent_identifier.re
Overriding the Publishing Policy
Policy Signatures - Implement these four policies to provide integration to a new publishing service
irods_policy_publishing_object_publish_<service>(
*object_path, *user_name, *service_name)
irods_policy_publishing_object_purge_<service>(
*object_path, *user_name, *service_name)
irods_policy_publishing_collection_index_<service>(
*collection_name, *user_name, *service_name)
irods_policy_indexing_collection_purge_<service>(
*collection_name, *user_name, *service_name
Publishing Policy
The Publishing Policy provides a reactive framework to metadata attributes. Once the publishing service policy is invoked, it may provide any implementation desired.
For instance, some services may simply need a URI to the data set whereas others may require the data be uploaded, such as data.world.
The publishing service may require a specific submission package format, additional metadata or other requirements which would require the publishing job to wait until these needs are met.
Future Work - New services to support
Indexing
- Solr - geographic indexing
- Semantic indexing technologies
- Tika data typing
- Dataverse
- Life science catalogs
- Handle
- DOI
- Minid
Publishing
This should be a community discussion
Questions?
UGM 2019 - iRODS Capabilities Indexing and Publishing
By jason coposky
UGM 2019 - iRODS Capabilities Indexing and Publishing
iRODS User Group Meeting 2019 - Capabilities Overview
- 1,469