Tracking provenance with the PROV standard

DATA71011 Understanding Data and their Environment

 

Stian Soiland-Reyes

Intended Learning Outcomes

  1. Understanding why and when to use provenance standards
  2. Ability to distinguish PROV concepts
  3. Knowledge of considerations for modelling choices 
  4. Ability to write a machine-readable metadata language
  5. Skill of modelling processes in a formal language (in lab, assessment)

Motivation

Why using a standard for

machine-readable provenance?

Why use a Standard?

Standards enhance data interoperability, transparency, and reproducibility across various domains.

 

Existing tooling and guidance can be used directly.

 

(Meta)data can be moved between systems or combined

 

Try to use existing standards!

Using provenance standards

Examples from industry and academia

<?xml version="1.0" encoding="UTF-8"?>
<order:orderMessage
    xmlns:order="urn:gs1:ecom:order:xsd:3"
    xmlns:sh="http://www.unece.org/cefact/namespaces/StandardBusinessDocumentHeader"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:gs1:ecom:order:xsd:3 ../Schemas/gs1/ecom/Order.xsd">
    <sh:StandardBusinessDocumentHeader>
        <sh:HeaderVersion>1.0</sh:HeaderVersion>
        <sh:Sender>
            <!-- Retailer Information : Ex. SuperStore -->
            <sh:Identifier Authority="GS1"/>
        </sh:Sender>
        <sh:Receiver>
            <!-- Shipper information -->
            <sh:Identifier Authority="GS1"/>
        </sh:Receiver>
        <sh:DocumentIdentification>
            <sh:Standard>GS1</sh:Standard>
            <sh:TypeVersion>3.3</sh:TypeVersion>
            <sh:InstanceIdentifier>100002</sh:InstanceIdentifier>
            <sh:Type>order</sh:Type>
            <sh:MultipleType>false</sh:MultipleType>
            <sh:CreationDateAndTime>2011-04-08T14:58:56.591Z</sh:CreationDateAndTime>
        </sh:DocumentIdentification>
    </sh:StandardBusinessDocumentHeader>
    <!-- ********************************************************************************** -->
    <!-- NOTE : Comments for a field appear AFTER the field -->
    <!-- This is a Purchase Order, submitted from a Retailer to a Supplier.
         Currently, retailer sends a copy ("shadows") of the PO, in this format to the Blockchain.
         Every field here, unless indicated as OPTIONAL are Mandatory!  -->
    <!-- ********************************************************************************** -->
    <order>
        <creationDateTime>2011-04-08T14:58:56.591Z</creationDateTime>
        <!-- MANDATORY: Purchase Order CreationDate And Time : Created by Retailer
             UTC time (ISO 8601) when the PO was created. -->
        <documentStatusCode>ORIGINAL</documentStatusCode>
        <!-- Do not change. -->
        <orderIdentification>
            <entityIdentification>urn:epcglobal:cbv:bt:5412345000037:3352</entityIdentification>
            <!--MANDATORY-->
            <!-- urn:epcglobal:cbv:bt:ShipToGLN:PONumber -->
            <!-- Retailer Purchase Order Number: The format for this is urn:epcglobal:cbv:bt:<gln>:<po-number>,
                 where <gln> is the "shipTo" GLN (following) AND  <po-number> should NOT contain a ":" character.  
                 Using this notation allows a supplier to put in a reference to this PO (using the same format)
                 from other EPCIS events and Business Txn documents.
                 Reference: https://www.gs1.org/sites/default/files/docs/epc/CBV-Standard-1-2-1-r-2017-05-05.pdf  [Section 8.5.2] -->
            <!-- IBM Blockchain Transparent Supply Transaction ID format: urn:ibm:ift:bt:<Company Prefix>.<Location Reference>.<Transaction Id>-->
            <!-- where <Company Prefix>.<Location Reference> are for the "shipTo" location-->
        </orderIdentification>
        <orderTypeCode>220</orderTypeCode>
        <!-- Code for buyer to order (220 is default).
             For other codes, refer: http://apps.gs1.org/GDD/Pages/clDetails.aspx?semanticURN=urn:gs1:gdd:cl:OrderTypeCode&release=2 -->
        <buyer>
            <gln>5412345000013</gln>
            <!-- MANDATORY: Retailer Corporate Identity GLN -->
        </buyer>
        <seller>
            <gln>4098765000010</gln>
            <!-- Seller Corporate Identity gln to be communicated from the seller (shipper) to the buyer (retailer).  
            MANDATORY for the buyer to provide visibility of the PO to the seller; OPTIONAL otherwise (seller will
            not be able to see the PO) -->
        </seller>
        <!-- NOTE: <seller>, </seller> tags should not be omitted even if <gln> is omitted. -->
        <orderLogisticalInformation>
            <shipFrom>
                <gln>4098765000010</gln>
                <!-- OPTIONAL: Shipper Dispatch location GLN (factory). This is mandatory in the associated DA(s) -->
            </shipFrom>
            <shipTo>
                <gln>5412345000037</gln>
                <!-- MANDATORY: Retailer Receiving location gln (Distribution Centre) -->
            </shipTo>
            <orderLogisticalDateInformation>
                <requestedDeliveryDateTime>
                    <date>2011-04-11</date>
                    <!-- MANDATORY: Requested Delivery date (ISO8601 i.e. yyyy-mm-dd) at Retailer Receiving location when the PO was created. -->
                    <time>10:32:56.321Z</time>
                    <!-- OPTIONAL: Requested Delivery time (ISO8601 i.e. hh:mm:ss.sssZ) . GS1 DateOptionalTime Type-->
                </requestedDeliveryDateTime>
            </orderLogisticalDateInformation>
        </orderLogisticalInformation>
        <!-- NOTE: We will also reflect the orderLogisticalInformation at a LineItem level for future/other retailers. -->
        <referencedOrder>
            <entityIdentification>urn:epcglobal:cbv:bt:5412345000037:PO4487</entityIdentification>
            <!-- MANDATORY: Top-level referenced purchase order identifier -->
            <!-- urn:epcglobal:cbv:bt:<gln>:<po-number> -->
            <!-- IBM Blockchain Transparent Supply Transaction ID format: urn:ibm:ift:bt:<Company Prefix>.<Location Reference>.<Transaction Id>-->
            <lineItemNumber>2</lineItemNumber>
            <!-- OPTIONAL: Related line item number -->
            <orderRelationship>RELATED</orderRelationship>
            <!-- MANDATORY: Relationship between the purchase orders. -->
            <!-- Must be one of code values from http://apps.gs1.org/GDD/Pages/clDetails.aspx?semanticURN=urn:gs1:gdd:cl:OrderRelationshipTypeCode -->
        </referencedOrder>
        <!-- OPTIONAL: Reference to a related purchase order. -->
        <extension>
            <isReturnOrder>true</isReturnOrder>
        </extension>
        <!-- OPTIONAL: "true" indicates the purchase order is a return order.-->
        <orderLineItem>
            <lineItemNumber>1</lineItemNumber>
            <!-- MANDATORY: Numerical Sequential number for items in the PO -->
            <requestedQuantity measurementUnitCode="EA">48</requestedQuantity>
            <!-- MANDATORY: item requested/ordered Quantity by the Retailer with measurement Unit Attribute-->
            <!-- Two or three-character codes from UN/CEFACT Recommendation 20.-->
            <!-- Examples: EA (each), LBR (pound), CS (case), KGM (kilogram).-->
            <!-- See https://www.unece.org/fileadmin/DAM/cefact/recommendations/rec20/rec20_rev3_Annex2e.pdf and-->
            <!-- https://www.unece.org/fileadmin/DAM/cefact/recommendations/rec20/rec20_rev3_Annex3e.pdf.-->
            <itemPriceBaseQuantity measurementUnitCode="KGM">48</itemPriceBaseQuantity>
            <!-- OPTIONAL: item requested/ordered price base quantity with measurement Unit Attribute. -->
            <transactionalTradeItem>
                <gtin>40987650000223</gtin>
                <!-- MANDATORY: GS1-14 representation of item ordered by the Retailer-->
            </transactionalTradeItem>
        </orderLineItem>
        <orderLineItem>
            <lineItemNumber>2</lineItemNumber>
            <!-- Numerical Sequential number for next item in the PO -->
            <requestedQuantity measurementUnitCode="EA">24</requestedQuantity>
            <!-- item requested/ordered Quantity by the Retailer with measurement Unit Attribute-->
            <transactionalTradeItem>
                <gtin>40987650000346</gtin>
                <!-- GS1-14 representation of item ordered by the Retailer-->
            </transactionalTradeItem>
            <referencedOrder>
                <entityIdentification>urn:epcglobal:cbv:bt:5412345000037:PO4488</entityIdentification>
                <!-- MANDATORY: Line-level referenced purchase order identifier -->
                <!-- urn:epcglobal:cbv:bt:<gln>:<po-number> -->
                <!-- IBM Blockchain Transparent Supply Transaction ID format: urn:ibm:ift:bt:<Company Prefix>.<Location Reference>.<Transaction Id>-->
                <lineItemNumber>2</lineItemNumber>
                <!-- OPTIONAL: Related line item number -->
                <orderRelationship>RELATED</orderRelationship>
                <!-- MANDATORY: Relationship between the purchase orders. -->
                <!-- Must be one of code values from http://apps.gs1.org/GDD/Pages/clDetails.aspx?semanticURN=urn:gs1:gdd:cl:OrderRelationshipTypeCode -->
            </referencedOrder>
            <!-- OPTIONAL: Reference to a related purchase order. Overrides top-level referencedOrder for line item if it exists. -->
            <returnReasonCode>27</returnReasonCode>
            <!-- OPTIONAL: The reason code for returning items. -->
            <!-- Must be one of code values from http://www.unece.org/fileadmin/DAM/trade/untdid/d18a/tred/tred7007.htm -->
            <extension>
                <epcList>
                    <epc>urn:epc:id:sgtin:0614141.107346.2017</epc>
                    <epc>urn:epc:id:sgtin:0614141.107346.2018</epc>
                </epcList>
                <!--OPTIONAL: List of instance-level objects (SSCC, SGTIN) expected to be returned.-->
                <quantityList>
                    <quantityElement>
                        <epcClass>urn:epc:class:lgtin:0614141.107346.101</epcClass>
                        <!--MANDATORY for quantityElement. Class-level EPCs like LGTINs. -->
                        <quantity>10</quantity>
                        <!--OPTIONAL for quantityElement.-->
                        <!--Meaning: 10 cases of LGTIN '0614141.107346' belonging to lot '101'-->
                        <uom>CS</uom>
                        <!--OPTIONAL for quantityElement. Item quantity unit of measurement. "CS" = Case.-->
                        <!--Two or three-charater codes from UN/CEFACT Recommendation 20.-->
                        <!--Other examples: EA (each), LBR (pound), KGM (kilogram).-->
                        <!--See https://www.unece.org/fileadmin/DAM/cefact/recommendations/rec20/rec20_rev3_Annex2e.pdf and-->
                        <!--https://www.unece.org/fileadmin/DAM/cefact/recommendations/rec20/rec20_rev3_Annex3e.pdf.-->
                    </quantityElement>
                    <quantityElement>
                        <epcClass>urn:epc:class:lgtin:0614141.107346.102</epcClass>
                        <quantity>20</quantity>
                        <uom>CS</uom>
                    </quantityElement>
                </quantityList>
                <!--OPTIONAL: List of class-level objects like LGTINS (GTIN+lot) which are expected to be returned.-->
            </extension>
            <!-- OPTIONAL: extension for return order details-->
        </orderLineItem>
    </order>
</order:orderMessage>

Using XML as a standard

 

Syntax is interoperable

 

..effectively make a custom domain model

--> not interoperable

IBM ProvLake

Recording machine learning provenance

https://research.ibm.com/projects/provlake

IBM ProvLake

Recording machine learning provenance

https://research.ibm.com/projects/provlake

Distributed provenance chains

Fast Healthcare Interoperability Resources
PROV as part of standard for
health care data exchange

https://hl7.org/fhir/provenance.html

Using PROV programmatically

PROV is a conceptual model with several machine-readable formats (PROV-O, PROV-N, PROV-JSON etc)

import prov.model as prov
import datetime

document = prov.ProvDocument()
document.set_default_namespace('http://anotherexample.org/')
document.add_namespace('ex', 'http://example.org/')

e2 = document.entity('e2', (
    (prov.PROV_TYPE, "File"),
    ('ex:path', "/shared/crime.txt"),
    ('ex:creator', "Alice"),
    ('ex:content', "There was a lot of crime in London last month"),
))
a1 = document.activity('a1', datetime.datetime.now(), None, {prov.PROV_TYPE: "edit"})
document.wasGeneratedBy(e2, a1, None, {'ex:fct': "save"})
document.wasAssociatedWith('a1', 'ag2', None, None, {prov.PROV_ROLE: "author"})
document.agent('ag2', {prov.PROV_TYPE: 'prov:Person', 'ex:name': "Bob"})

document.get_provn()
document
  default <http://anotherexample.org/>
  prefix ex <http://example.org/>
  entity(e2, [prov:type="File", ex:creator="Alice",
              ex:content="There was a lot of crime in London last month",
              ex:path="/shared/crime.txt"])
  activity(a1, 2014-07-09T16:39:38.795839, -, [prov:type="edit"])
  wasGeneratedBy(e2, a1, -, [ex:fct="save"])
  wasAssociatedWith(a1, ag2, -, [prov:role="author"])
  agent(ag2, [prov:type="prov:Person", ex:name="Bob"])
endDocument

Visualising PROV

document
  prefix ex <http://example.com/back-to-the-future/>
  
  entity(ex:results)
  entity(ex:data)
  entity(ex:interviews)

  wasDerivedFrom(ex:results, ex:data)
  wasDerivedFrom(ex:data, ex:interviews)
  wasDerivedFrom(ex:interviews, ex:results)
endDocument
$ provconvert -infile test.provn -outfile test.svg

PROV specifications

Three views for provenance modelling

Responsibility view – who was attributed for what?

  Entity → Agent

  Activity → Agent


Data flow view – how did the information move from one piece of data to another?
  Entity → Entity


Process view – what activity consumed/produced the data?
  Activity ↔ Entity

In practical class

  1. PROV-N intro
  2. PROV-N practical group exercise
  3. PROV Tools: Demo and help with installation
  4. PROV Q&A
  5. Intro and Q&A on PROV Assessed Work

 

Week 4 To Do List before practical (see Canvas)

  1. Try following the instructions for installing PROVToolbox for Windows or for MacOS on your computer (you can ask for help in Q&A)