Open Refine
Dimitri Brosens & Peter Desmet
Belgian Biodiversity Platform & Research Institute for Nature and Forest
Open Refine
Dimitri Brosens & Peter Desmet
Belgian Biodiversity Platform & Research Institute for Nature and Forest
Based on:
https://docs.google.com/presentation/d/16mbF9J6gJgEEEFDOLNbEU00sqjGyTsY8gXVQGvSn4FQ/edit (Desmet & Brosens)
http://www.datacarpentry.org/OpenRefine-ecology-lesson/00-getting-started.html
(Data Carpentry)
Open Refine Tutorial
(Desmet & Brosens)
https://drawingbynumbers.org/sites/drawingbynumbers.org/files/workthroughs/googlerefine/index.html#/step-1
Drawing by numbers
https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Open Refine
Google Refine is a power tool which can do all kinds of fancy
things such as merging your data with other datasets,
linking it to databases and converting data into other formats.
Google Refine is a tool used to clean up datasets by finding and correcting errors and inconsistencies in the data.
Open Refine
For Biodiversity Data
@bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http://goo.gl/UniR6 #cleanup #tools" view tweet
@Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much easier." view tweet
@kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet
@litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http://goo.gl/8pvKB #datageek" view tweet
@Salesient: "Google refine is awesome. Never before have I been home this early." view tweet
@roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet
Open Refine
For Biodiversity Data
- Data is often messy: Open Refine makes cleaning a lot easier.
- All steps are saved: your data cleaning becomes repeatable and reversible.
- Complex tasks made easy, e.g. clustering, facetting...
- You only work on what you have!
- Supported by an open source community: lots of tutorials + plugins available
- Works quite well to up to 100.000 rows of data.
- More powerful then a spreadsheet;
- Interactive and a visual interface -> scripting
- Open Source
Open Refine
For Biodiversity Data
Open Refine
For Biodiversity Data
Installation:
- Go to http://openrefine.org, download and install
-
Choose Google Refine 2.5 (stable version)
-
- Run open refine in a modern browser
- No internet connection needed (except when running services)
Open Refine
For Biodiversity Data
You should have something like this:
Open Refine
For Biodiversity Data
Learning Objectives
- Create a new OpenRefine project
- How facets can be used to have an overview of the data
- Use sort and include/exclude to identify and understand possible mistakes in the data
- Apply clustering algorithms to correct multiple typos at once
- Split columns into multiple columns, rename and reorder columns
- Transform columns
- Add collumns based on ther original column
- Add columns by fetching urls
- Georeference your data
- Extract Species information
- Extract useable values from JSON
- Reconciliation services, add url's to species information
- Basic use of General Expression Language (GREL)
Open Refine
For Biodiversity Data
Creating a Projct
- Start the program
-
To create a project, - click Create... and it will bring you to "Get data from this computer." - Click Browse, find occurrence.txt (Your data) - Click next to open occurrence.txt - Refine gives you a preview (define parsing) - Set uncoding to UTF-8 - If all looks well, click _Create Project._
Open Refine
For Biodiversity Data
Open Refine
For Biodiversity Data
Facetting
accessing information organized according to a faceted classification system
- Scroll over to the scientificName column - Click the down arrow and choose > Facet > Text facet - Allows targetted editing of your data (i.e. only on a subset) - Facet on (Text, Time, Date..).
Open Refine
For Biodiversity Data
Clustering
Allows to automatically group different but similar values - Scroll over to the scientificNameAuthorship column - Click the down arrow and choose > Facet > Text facet - Click on CLuster - Correct the wrong or similar values
Linnaeus, 1758 --> (Linnaeus, 1758)
Changed for the complete dataset
Open Refine
For Biodiversity Data
EDIT CELLS
Allows to ´transform` values (i.e. to number, lowercase, text, date...) in a column - Scroll over to the eventDate column - Click the down arrow and choose > EDIT CELLS > Choose common transform - Transform eventData to type ´Date` - Common transforms > Trim leading and trailing whitespace
-Transforms > Using GREL
Changed for the complete dataset
Open Refine
For Biodiversity Data
EDIT Columns
Allows to ´split` columns (by separator or field lenght) - Scroll over to the eventDate column - Click the down arrow and choose > EDIT COLUMNS > Choose split in several columns - Choose separator or field length Allows to ´add` columns based on this column or by fetching URLs
Allows to ´rename, remove or reorder` columns
Open Refine
For Biodiversity Data
RECONCILIATION
Allows to ´COMPARE` columns with similar columns from other databases - Scroll over to the scientificName column - Click the down arrow and choose > RECONCILE > Choose 'start reconciling' - Add or choose reconciliation service (Add EOL reconciliation service) -Choose 'start reconciliation' and wait (coffee)
-!!! Actions (in this case) Match each cell to its best candidate
EOL Service http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html
Open Refine
For Biodiversity Data
GENERAL REFINE expression language
Allows to add or change columns by using 'code'
More info: https://github.com/OpenRefine/OpenRefine Choose 'add column based on this column'
Type your expression 'cell.recon.match.id'
Some expressions:
if(value.length() > 20, "long", "short")
replace(value,'!','')
if(row.starred, "yes", "no")
value.splitByLengths(6,4,2,5)
cells["col1"].value + ", " + cells["col2"].value
https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Open Refine
For Biodiversity Data
GENERAL REFINE expression language
Allows to add or change columns by using 'code'
More info: https://github.com/OpenRefine/OpenRefine https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Convert to Decimal Latitude or Longitude
For a string holding longitude/latitude information in the format XX degrees, XX minutes, XX.XX seconds the following can be used.
forNonBlank(value[0,2], v, v.toNumber(), 0) + forNonBlank(value[2,4], v, v.toNumber()/60, 0) + forNonBlank(value[4,6], v, v.toNumber()/3600, 0) + forNonBlank(value[6,8], v, v.toNumber()/360000, 0)
Open Refine
For Biodiversity Data
ADD SPECIES URL (Encyclopedia of LIFE)
Allows to insert a weblink to EOL species page for each species - We extracted the scientificName ID's from the EOL database
- Add 'general link' to this value (EDIT CELLS --> TRANSFORM)
"http://eol.org/pages/"+ value + "/overview/
Open Refine
For Biodiversity Data
GET SPECIES INFORMATION BY FETCHING URLs
Allows to retrieve species information from external databases - - Scroll over to the scientificName column, split in 2 columns (" " as separator) - Click the down arrow on [scientificName1] and choose > add column by fetching URLs - Expression: "http://api.gbif.org/v1/species?name="+value+"%20"+cells["nameColumn2"] - (We use the GBIF api, webservices)
- Set throttle delay on 500 ms
-GBIF will return JSON by scientificName
Open Refine
For Biodiversity Data
GET SPECIES INFORMATION BY FETCHING URLs
Allows to retrieve species information from external databases (Alt for EOL)
- - Click the down arrow on [scientificName] {reconciled} and choose > add column by fetching URLs
- Expression: "http://eol.org/api/pages/1.0.json?batch=false&id=" + cell.recon.match.id
- (We use the EOL api, webservices)
- Set throttle delay on 500 ms
-GBIF will return JSON by scientificName
Open Refine
For Biodiversity Data
GET SPECIES INFORMATION BY FETCHING URLs
-Now we need to filter out the correct information
- Add column based on the GBIF JSON columns
- Use the correct parsing expression: value.parseJson()["results"][0]["kingdom"]
- Retreive the information on the scientificAuthoship - Retreive the information with a specific datasetKey (&datasetkey=39653f3e-8d6b-4a94-a202-859359c164c5) (Belgian Species Checklist)
Open Refine
For Biodiversity Data
GEOREFERENCE BY FETCHING URLs
Allows to georeference geographic information using google API (or openstraatmaps)
- Click the down arrow on what you want to georeference [ex municipality] and choose > add column by fetching URLs
- Expression: "https://maps.googleapis.com/maps/api/geocode/json?address=" + value + "&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous" (You need a Developper Key from Google)
- (We use the Google api, webservices)
- Set throttle delay on 500 ms
-GOOGLE will return JSON by scientificName
- Parse Google JSON in new column
Open Refine
For Biodiversity Data
INVERSE GEOREFERENCE BY FETCHING URLs
Allows to georeference geographic information using google API (or openstraatmaps) - Click the down arrow on what you want to georeference [ex municipality] and choose > add column by fetching URLs - Expression:
https://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&key=YOUR_API_KEY
(You need a Developper Key from Google)
- (We use the Google api, webservices)
- Set throttle delay on 500 ms
-GOOGLE will return JSON by decimal lat & long
- Parse Google JSON in new column
Open Refine
For Biodiversity Data
CROSS referencing
Allows to cross reference data from one to another dataset.
-
Merge data from the two projects by creating a new column from values from an existing column within one project that are used to index into a similar column in the other project
-
cell.cross("datasetname.csv","scientificName").cells["order"].value[0]
-
Open Refine
For Biodiversity Data
Extract operation history
Extract and save parts of your operation history as JSON that you can apply to this or other projects in the future.
Export your data
Export your data by using the Export button, choose the Custom Tabular Exporter
- Order and Select columns you wan't to export
- Select the Data/time format and options for ID
- Choose the download format & Download
Open Refine
For Biodiversity Data
Standardization to Darwin Core
-
Upload your messy dataset
-
Use to DarwinCore JSON script (Make sure you have the initial column"
[
{
"op": "core/column-addition",
"description": "Create column type at index 3 based on column To DwC using expression grel:\"Event\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "type",
"columnInsertIndex": 3,
"baseColumnName": "To DwC",
"expression": "grel:\"Event\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column language at index 4 based on column type using expression grel:\"en\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "language",
"columnInsertIndex": 4,
"baseColumnName": "type",
"expression": "grel:\"en\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column license at index 5 based on column language using expression grel:\"http://creativecommons.org/publicdomain/zero/1.0/\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "license",
"columnInsertIndex": 5,
"baseColumnName": "language",
"expression": "grel:\"http://creativecommons.org/publicdomain/zero/1.0/\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column rightsHolder at index 6 based on column license using expression grel:\"INBO\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "rightsHolder",
"columnInsertIndex": 6,
"baseColumnName": "license",
"expression": "grel:\"INBO\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column accessRights at index 7 based on column rightsHolder using expression grel:\"http://www.inbo.be/en/norms-for-data-use\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "accessRights",
"columnInsertIndex": 7,
"baseColumnName": "rightsHolder",
"expression": "grel:\"http://www.inbo.be/en/norms-for-data-use\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column institutionCode at index 8 based on column accessRights using expression grel:\"INBO\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "institutionCode",
"columnInsertIndex": 8,
"baseColumnName": "accessRights",
"expression": "grel:\"INBO\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column datasetName at index 9 based on column institutionCode using expression grel:\"insertDatasetName\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "datasetName",
"columnInsertIndex": 9,
"baseColumnName": "institutionCode",
"expression": "grel:\"insertDatasetName\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column ownerInstitutionCode at index 10 based on column datasetName using expression grel:\"INBO\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "ownerInstitutionCode",
"columnInsertIndex": 10,
"baseColumnName": "datasetName",
"expression": "grel:\"INBO\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column basisOfREcord at index 11 based on column ownerInstitutionCode using expression grel:\"http://rs.tdwg.org/dwc/terms/#basisOfRecord\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "basisOfREcord",
"columnInsertIndex": 11,
"baseColumnName": "ownerInstitutionCode",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#basisOfRecord\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column countryCode at index 12 based on column basisOfREcord using expression grel:\"http://rs.tdwg.org/dwc/terms/#countryCode\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "countryCode",
"columnInsertIndex": 12,
"baseColumnName": "basisOfREcord",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#countryCode\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column geodeticDatum at index 13 based on column countryCode using expression grel:\"WGS84\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "geodeticDatum",
"columnInsertIndex": 13,
"baseColumnName": "countryCode",
"expression": "grel:\"WGS84\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column kingdom at index 14 based on column geodeticDatum using expression grel:\"http://rs.tdwg.org/dwc/terms/#kingdom\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "kingdom",
"columnInsertIndex": 14,
"baseColumnName": "geodeticDatum",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#kingdom\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column nomenclaturalCode at index 15 based on column kingdom using expression grel:\"http://rs.tdwg.org/dwc/terms/#nomenclaturalCode\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "nomenclaturalCode",
"columnInsertIndex": 15,
"baseColumnName": "kingdom",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#nomenclaturalCode\"",
"onError": "set-to-blank"
}
]
Open Refine
For Biodiversity Data
REUSE operation history
Start a new project
Add the column where youre initial script is based on (To DwC)
Import parts of your operation history as JSON and apply to this or another
ready!!
TDWG Refine 4 DwC
By Dimitri Brosens
TDWG Refine 4 DwC
Short introduction on Google Refine, on biodiversity data
- 3,615