Open Refine

Dimitri Brosens & Peter Desmet

Belgian Biodiversity Platform & Research Institute for Nature and Forest

Open Refine

Dimitri Brosens & Peter Desmet

Belgian Biodiversity Platform & Research Institute for Nature and Forest

Based on:

https://docs.google.com/presentation/d/16mbF9J6gJgEEEFDOLNbEU00sqjGyTsY8gXVQGvSn4FQ/edit (Desmet & Brosens)

 

http://www.datacarpentry.org/OpenRefine-ecology-lesson/00-getting-started.html

(Data Carpentry)

 

Open Refine Tutorial 

(Desmet & Brosens)

 

https://drawingbynumbers.org/sites/drawingbynumbers.org/files/workthroughs/googlerefine/index.html#/step-1

Drawing by numbers

 

https://github.com/OpenRefine/OpenRefine/wiki/Recipes

Open Refine

Google Refine is a power tool which  can do all kinds of fancy
things such as merging your data with other datasets,
linking it to databases and converting data into other formats.

 

Google Refine is a tool used to clean up datasets by finding and correcting errors and inconsistencies in the data.

Open Refine

For Biodiversity Data

@bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http://goo.gl/UniR6 #cleanup #tools" view tweet

 

@Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much easier." view tweet

 

@kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet

 

@litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http://goo.gl/8pvKB #datageek" view tweet

 

@Salesient: "Google refine is awesome. Never before have I been home this early." view tweet

 

@roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet

 

 

Open Refine

For Biodiversity Data

  • Data is often messy: Open Refine makes cleaning a lot easier.
  • All steps are saved: your data cleaning becomes repeatable and reversible.
  • Complex tasks made easy, e.g. clustering, facetting...
  • You only work on what you have!
  • Supported by an open source community: lots of tutorials + plugins available
  • Works quite well to up to 100.000 rows of data.
  • More powerful then a spreadsheet;
  • Interactive and a visual interface -> scripting
  • Open Source

 

 

Open Refine

For Biodiversity Data

Open Refine

For Biodiversity Data

Installation:

  • Go to http://openrefine.org, download and install
    • Choose Google Refine 2.5 (stable version)

  • Run open refine in a modern browser
  • No internet connection needed (except when running services)

Open Refine

For Biodiversity Data

You should have something like this:

Open Refine

For Biodiversity Data

Learning Objectives

  • Create a new OpenRefine project
  • How facets can be used to have an overview of the data
  • Use sort and include/exclude to identify and understand possible mistakes in the data
  • Apply clustering algorithms to correct multiple typos at once
  • Split columns into multiple columns, rename and reorder columns
  • Transform columns
  • Add collumns based on ther original column
  • Add columns by fetching urls  
  • Georeference your data
  • Extract Species information
  • Extract useable values from JSON
  • Reconciliation services, add url's to species information
  • Basic use of General Expression Language (GREL)

Open Refine

For Biodiversity Data

Creating a Projct

  • Start the program
  • To create a project,
      - click Create... and it will bring you to "Get data from this computer."
      - Click Browse, find occurrence.txt (Your data)
      - Click next to open occurrence.txt
      - Refine gives you a preview (define parsing)
      - Set uncoding to UTF-8
      - If all looks well, click _Create Project._

Open Refine

For Biodiversity Data

Open Refine

For Biodiversity Data

Facetting

accessing information organized according to a faceted classification system

 

- Scroll over to the scientificName column
  - Click the down arrow and choose > Facet > Text facet
  - Allows targetted editing of your data (i.e. only on a subset)
  - Facet on (Text, Time, Date..).

 

 

Open Refine

For Biodiversity Data

Clustering

Allows to automatically group different but similar values
- Scroll over to the scientificNameAuthorship column
  - Click the down arrow and choose > Facet > Text facet
  - Click on CLuster
  - Correct the wrong or similar values

 

Linnaeus, 1758 --> (Linnaeus, 1758) 

Changed for the complete dataset

 

 

Open Refine

For Biodiversity Data

EDIT CELLS

Allows to ´transform` values (i.e. to number, lowercase, text, date...) in a column 
- Scroll over to the eventDate column
  - Click the down arrow and choose > EDIT CELLS > Choose common transform
  - Transform eventData to type ´Date`
  - Common transforms > Trim leading and trailing whitespace
-Transforms > Using GREL

 

Changed for the complete dataset

 

 

Open Refine

For Biodiversity Data

EDIT Columns

Allows to ´split` columns (by separator or field lenght)
- Scroll over to the eventDate column
  - Click the down arrow and choose > EDIT COLUMNS > Choose split in several columns
  - Choose separator or field length
Allows to ´add` columns based on this column or by fetching URLs
Allows to ´rename, remove or reorder` columns 

Open Refine

For Biodiversity Data

RECONCILIATION 

Allows to ´COMPARE` columns with similar columns from other databases
- Scroll over to the scientificName column
  - Click the down arrow and choose > RECONCILE > Choose 'start reconciling'
  - Add or choose reconciliation service (Add EOL reconciliation service) 
-Choose 'start reconciliation' and wait (coffee)
-!!! Actions (in this case) Match each cell to its best candidate

EOL Service http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php

http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php

 

http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html

Open Refine

For Biodiversity Data

GENERAL REFINE expression language

Allows to add or change columns by using 'code'
More info: https://github.com/OpenRefine/OpenRefine

Choose 'add column based on this column'

Type your expression 'cell.recon.match.id'


Some expressions:

if(value.length() > 20, "long", "short")

replace(value,'!','')

if(row.starred, "yes", "no")
value.splitByLengths(6,4,2,5)
cells["col1"].value + ", " + cells["col2"].value

https://github.com/OpenRefine/OpenRefine/wiki/Recipes

Open Refine

For Biodiversity Data

GENERAL REFINE expression language

Allows to add or change columns by using 'code'
More info: https://github.com/OpenRefine/OpenRefine
https://github.com/OpenRefine/OpenRefine/wiki/Recipes

Convert to Decimal Latitude or Longitude

For a string holding longitude/latitude information in the format XX degrees, XX minutes, XX.XX seconds the following can be used.

   forNonBlank(value[0,2], v, v.toNumber(), 0) +
   forNonBlank(value[2,4], v, v.toNumber()/60, 0) +
   forNonBlank(value[4,6], v, v.toNumber()/3600, 0) +
   forNonBlank(value[6,8], v, v.toNumber()/360000, 0)

Open Refine

For Biodiversity Data

ADD SPECIES URL (Encyclopedia of LIFE) 

Allows to insert a weblink to EOL species page for each species
- We extracted the scientificName ID's from the EOL database
- Add 'general link' to this value (EDIT CELLS --> TRANSFORM)

"http://eol.org/pages/"+ value + "/overview/

 

Open Refine

For Biodiversity Data

GET SPECIES INFORMATION BY FETCHING URLs

Allows to retrieve species information from external databases
- - Scroll over to the scientificName column, split in 2 columns (" " as separator)
  - Click the down arrow on [scientificName1] and choose > add column by fetching URLs
  - Expression: "http://api.gbif.org/v1/species?name="+value+"%20"+cells["nameColumn2"]  
  - (We use the GBIF api, webservices)
  - Set throttle delay on 500 ms
-GBIF will return JSON by scientificName

Open Refine

For Biodiversity Data

GET SPECIES INFORMATION BY FETCHING URLs

Allows to retrieve species information from external databases (Alt for EOL)
- - Click the down arrow on [scientificName] {reconciled} and choose > add column by fetching URLs
  - Expression: "http://eol.org/api/pages/1.0.json?batch=false&id=" + cell.recon.match.id  
  - (We use the EOL api, webservices)
  - Set throttle delay on 500 ms
-GBIF will return JSON by scientificName

Open Refine

For Biodiversity Data

GET SPECIES INFORMATION BY FETCHING URLs

-Now we need to filter out the correct information 
  - Add column based on the GBIF JSON columns
  - Use the correct parsing expression: value.parseJson()["results"][0]["kingdom"]
  - Retreive the information on the scientificAuthoship
  - Retreive the information with a specific datasetKey (&datasetkey=39653f3e-8d6b-4a94-a202-859359c164c5) (Belgian Species Checklist)

 

 

 

Open Refine

For Biodiversity Data

GEOREFERENCE  BY FETCHING URLs

Allows to georeference geographic information using google API (or openstraatmaps)
- Click the down arrow on what you want to georeference [ex municipality] and choose > add column by fetching URLs
  - Expression: "https://maps.googleapis.com/maps/api/geocode/json?address=" + value + "&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous" (You need a Developper Key from Google)
  - (We use the Google api, webservices)
  - Set throttle delay on 500 ms
-GOOGLE will return JSON by scientificName
- Parse Google JSON in new column

Open Refine

For Biodiversity Data

INVERSE GEOREFERENCE  BY FETCHING URLs

Allows to georeference geographic information using google API (or openstraatmaps)
- Click the down arrow on what you want to georeference [ex municipality] and choose > add column by fetching URLs
  - Expression: 
  https://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&key=YOUR_API_KEY
 (You need a Developper Key from Google)
  - (We use the Google api, webservices)
  - Set throttle delay on 500 ms
-GOOGLE will return JSON by decimal lat & long
- Parse Google JSON in new column

Open Refine

For Biodiversity Data

CROSS referencing

Allows to cross reference data from one to another dataset.
  • Merge data from the two projects by creating a new column from values from an existing column within one project that are used to index into a similar column in the other project

    • cell.cross("datasetname.csv","scientificName").cells["order"].value[0]

Open Refine

For Biodiversity Data

Extract operation history

 

Extract and save parts of your operation history as JSON that you can apply to this or other projects in the future.

 

Export your data

 

Export your data by using the Export button, choose the Custom Tabular Exporter

- Order and Select columns you wan't to export

- Select the Data/time format and options for ID

- Choose the download format & Download

 

Open Refine

For Biodiversity Data

Standardization to Darwin Core

  1. Upload your messy dataset  

  2. Use to DarwinCore JSON script (Make sure you have the initial column" 

[
  {
    "op": "core/column-addition",
    "description": "Create column type at index 3 based on column To DwC using expression grel:\"Event\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "type",
    "columnInsertIndex": 3,
    "baseColumnName": "To DwC",
    "expression": "grel:\"Event\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column language at index 4 based on column type using expression grel:\"en\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "language",
    "columnInsertIndex": 4,
    "baseColumnName": "type",
    "expression": "grel:\"en\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column license at index 5 based on column language using expression grel:\"http://creativecommons.org/publicdomain/zero/1.0/\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "license",
    "columnInsertIndex": 5,
    "baseColumnName": "language",
    "expression": "grel:\"http://creativecommons.org/publicdomain/zero/1.0/\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column rightsHolder at index 6 based on column license using expression grel:\"INBO\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "rightsHolder",
    "columnInsertIndex": 6,
    "baseColumnName": "license",
    "expression": "grel:\"INBO\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column accessRights at index 7 based on column rightsHolder using expression grel:\"http://www.inbo.be/en/norms-for-data-use\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "accessRights",
    "columnInsertIndex": 7,
    "baseColumnName": "rightsHolder",
    "expression": "grel:\"http://www.inbo.be/en/norms-for-data-use\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column institutionCode at index 8 based on column accessRights using expression grel:\"INBO\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "institutionCode",
    "columnInsertIndex": 8,
    "baseColumnName": "accessRights",
    "expression": "grel:\"INBO\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column datasetName at index 9 based on column institutionCode using expression grel:\"insertDatasetName\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "datasetName",
    "columnInsertIndex": 9,
    "baseColumnName": "institutionCode",
    "expression": "grel:\"insertDatasetName\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column ownerInstitutionCode at index 10 based on column datasetName using expression grel:\"INBO\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "ownerInstitutionCode",
    "columnInsertIndex": 10,
    "baseColumnName": "datasetName",
    "expression": "grel:\"INBO\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column basisOfREcord at index 11 based on column ownerInstitutionCode using expression grel:\"http://rs.tdwg.org/dwc/terms/#basisOfRecord\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "basisOfREcord",
    "columnInsertIndex": 11,
    "baseColumnName": "ownerInstitutionCode",
    "expression": "grel:\"http://rs.tdwg.org/dwc/terms/#basisOfRecord\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column countryCode at index 12 based on column basisOfREcord using expression grel:\"http://rs.tdwg.org/dwc/terms/#countryCode\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "countryCode",
    "columnInsertIndex": 12,
    "baseColumnName": "basisOfREcord",
    "expression": "grel:\"http://rs.tdwg.org/dwc/terms/#countryCode\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column geodeticDatum at index 13 based on column countryCode using expression grel:\"WGS84\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "geodeticDatum",
    "columnInsertIndex": 13,
    "baseColumnName": "countryCode",
    "expression": "grel:\"WGS84\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column kingdom at index 14 based on column geodeticDatum using expression grel:\"http://rs.tdwg.org/dwc/terms/#kingdom\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "kingdom",
    "columnInsertIndex": 14,
    "baseColumnName": "geodeticDatum",
    "expression": "grel:\"http://rs.tdwg.org/dwc/terms/#kingdom\"",
    "onError": "set-to-blank"
  },
  {
    "op": "core/column-addition",
    "description": "Create column nomenclaturalCode at index 15 based on column kingdom using expression grel:\"http://rs.tdwg.org/dwc/terms/#nomenclaturalCode\"",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "newColumnName": "nomenclaturalCode",
    "columnInsertIndex": 15,
    "baseColumnName": "kingdom",
    "expression": "grel:\"http://rs.tdwg.org/dwc/terms/#nomenclaturalCode\"",
    "onError": "set-to-blank"
  }
]

Open Refine

For Biodiversity Data

REUSE operation history

 

Start a new project

Add the column where youre initial script is based on (To DwC)

Import parts of your operation history as JSON and apply to this or another

ready!!

 

 

TDWG Refine 4 DwC

By Dimitri Brosens

TDWG Refine 4 DwC

Short introduction on Google Refine, on biodiversity data

  • 392
Loading comments...

More from Dimitri Brosens