Dimitri Brosens & Peter Desmet
Belgian Biodiversity Platform & Research Institute for Nature and Forest
Dimitri Brosens & Peter Desmet
Belgian Biodiversity Platform & Research Institute for Nature and Forest
Based on:
https://docs.google.com/presentation/d/16mbF9J6gJgEEEFDOLNbEU00sqjGyTsY8gXVQGvSn4FQ/edit (Desmet & Brosens)
http://www.datacarpentry.org/OpenRefine-ecology-lesson/00-getting-started.html
(Data Carpentry)
Open Refine Tutorial
(Desmet & Brosens)
https://drawingbynumbers.org/sites/drawingbynumbers.org/files/workthroughs/googlerefine/index.html#/step-1
Drawing by numbers
https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Google Refine is a power tool which can do all kinds of fancy
things such as merging your data with other datasets,
linking it to databases and converting data into other formats.
Google Refine is a tool used to clean up datasets by finding and correcting errors and inconsistencies in the data.
For Biodiversity Data
@bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http://goo.gl/UniR6 #cleanup #tools" view tweet
@Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much easier." view tweet
@kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet
@litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http://goo.gl/8pvKB #datageek" view tweet
@Salesient: "Google refine is awesome. Never before have I been home this early." view tweet
@roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet
For Biodiversity Data
For Biodiversity Data
For Biodiversity Data
Installation:
For Biodiversity Data
You should have something like this:
For Biodiversity Data
For Biodiversity Data
To create a project, - click Create... and it will bring you to "Get data from this computer." - Click Browse, find occurrence.txt (Your data) - Click next to open occurrence.txt - Refine gives you a preview (define parsing) - Set uncoding to UTF-8 - If all looks well, click _Create Project._
For Biodiversity Data
For Biodiversity Data
accessing information organized according to a faceted classification system
- Scroll over to the scientificName column - Click the down arrow and choose > Facet > Text facet - Allows targetted editing of your data (i.e. only on a subset) - Facet on (Text, Time, Date..).
For Biodiversity Data
Allows to automatically group different but similar values - Scroll over to the scientificNameAuthorship column - Click the down arrow and choose > Facet > Text facet - Click on CLuster - Correct the wrong or similar values
Linnaeus, 1758 --> (Linnaeus, 1758)
Changed for the complete dataset
For Biodiversity Data
Allows to ´transform` values (i.e. to number, lowercase, text, date...) in a column - Scroll over to the eventDate column - Click the down arrow and choose > EDIT CELLS > Choose common transform - Transform eventData to type ´Date` - Common transforms > Trim leading and trailing whitespace
-Transforms > Using GREL
Changed for the complete dataset
For Biodiversity Data
Allows to ´split` columns (by separator or field lenght) - Scroll over to the eventDate column - Click the down arrow and choose > EDIT COLUMNS > Choose split in several columns - Choose separator or field length Allows to ´add` columns based on this column or by fetching URLs
Allows to ´rename, remove or reorder` columns
For Biodiversity Data
Allows to ´COMPARE` columns with similar columns from other databases - Scroll over to the scientificName column - Click the down arrow and choose > RECONCILE > Choose 'start reconciling' - Add or choose reconciliation service (Add EOL reconciliation service) -Choose 'start reconciliation' and wait (coffee)
-!!! Actions (in this case) Match each cell to its best candidate
EOL Service http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol.php
http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html
For Biodiversity Data
Allows to add or change columns by using 'code'
More info: https://github.com/OpenRefine/OpenRefine Choose 'add column based on this column'
Type your expression 'cell.recon.match.id'
Some expressions:
if(value.length() > 20, "long", "short")
replace(value,'!','')
if(row.starred, "yes", "no")
value.splitByLengths(6,4,2,5)
cells["col1"].value + ", " + cells["col2"].value
https://github.com/OpenRefine/OpenRefine/wiki/Recipes
For Biodiversity Data
Allows to add or change columns by using 'code'
More info: https://github.com/OpenRefine/OpenRefine https://github.com/OpenRefine/OpenRefine/wiki/Recipes
For a string holding longitude/latitude information in the format XX degrees, XX minutes, XX.XX seconds the following can be used.
forNonBlank(value[0,2], v, v.toNumber(), 0) + forNonBlank(value[2,4], v, v.toNumber()/60, 0) + forNonBlank(value[4,6], v, v.toNumber()/3600, 0) + forNonBlank(value[6,8], v, v.toNumber()/360000, 0)
For Biodiversity Data
Allows to insert a weblink to EOL species page for each species - We extracted the scientificName ID's from the EOL database
- Add 'general link' to this value (EDIT CELLS --> TRANSFORM)
"http://eol.org/pages/"+ value + "/overview/
For Biodiversity Data
Allows to retrieve species information from external databases - - Scroll over to the scientificName column, split in 2 columns (" " as separator) - Click the down arrow on [scientificName1] and choose > add column by fetching URLs - Expression: "http://api.gbif.org/v1/species?name="+value+"%20"+cells["nameColumn2"] - (We use the GBIF api, webservices)
- Set throttle delay on 500 ms
-GBIF will return JSON by scientificName
For Biodiversity Data
Allows to retrieve species information from external databases (Alt for EOL)
- - Click the down arrow on [scientificName] {reconciled} and choose > add column by fetching URLs
- Expression: "http://eol.org/api/pages/1.0.json?batch=false&id=" + cell.recon.match.id
- (We use the EOL api, webservices)
- Set throttle delay on 500 ms
-GBIF will return JSON by scientificName
For Biodiversity Data
-Now we need to filter out the correct information
- Add column based on the GBIF JSON columns
- Use the correct parsing expression: value.parseJson()["results"][0]["kingdom"]
- Retreive the information on the scientificAuthoship - Retreive the information with a specific datasetKey (&datasetkey=39653f3e-8d6b-4a94-a202-859359c164c5) (Belgian Species Checklist)
For Biodiversity Data
Allows to georeference geographic information using google API (or openstraatmaps)
- Click the down arrow on what you want to georeference [ex municipality] and choose > add column by fetching URLs
- Expression: "https://maps.googleapis.com/maps/api/geocode/json?address=" + value + "&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous" (You need a Developper Key from Google)
- (We use the Google api, webservices)
- Set throttle delay on 500 ms
-GOOGLE will return JSON by scientificName
- Parse Google JSON in new column
For Biodiversity Data
Allows to georeference geographic information using google API (or openstraatmaps) - Click the down arrow on what you want to georeference [ex municipality] and choose > add column by fetching URLs - Expression:
https://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&key=YOUR_API_KEY
(You need a Developper Key from Google)
- (We use the Google api, webservices)
- Set throttle delay on 500 ms
-GOOGLE will return JSON by decimal lat & long
- Parse Google JSON in new column
For Biodiversity Data
Allows to cross reference data from one to another dataset.
Merge data from the two projects by creating a new column from values from an existing column within one project that are used to index into a similar column in the other project
cell.cross("datasetname.csv","scientificName").cells["order"].value[0]
For Biodiversity Data
Extract and save parts of your operation history as JSON that you can apply to this or other projects in the future.
Export your data by using the Export button, choose the Custom Tabular Exporter
- Order and Select columns you wan't to export
- Select the Data/time format and options for ID
- Choose the download format & Download
For Biodiversity Data
Upload your messy dataset
Use to DarwinCore JSON script (Make sure you have the initial column"
[
{
"op": "core/column-addition",
"description": "Create column type at index 3 based on column To DwC using expression grel:\"Event\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "type",
"columnInsertIndex": 3,
"baseColumnName": "To DwC",
"expression": "grel:\"Event\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column language at index 4 based on column type using expression grel:\"en\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "language",
"columnInsertIndex": 4,
"baseColumnName": "type",
"expression": "grel:\"en\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column license at index 5 based on column language using expression grel:\"http://creativecommons.org/publicdomain/zero/1.0/\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "license",
"columnInsertIndex": 5,
"baseColumnName": "language",
"expression": "grel:\"http://creativecommons.org/publicdomain/zero/1.0/\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column rightsHolder at index 6 based on column license using expression grel:\"INBO\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "rightsHolder",
"columnInsertIndex": 6,
"baseColumnName": "license",
"expression": "grel:\"INBO\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column accessRights at index 7 based on column rightsHolder using expression grel:\"http://www.inbo.be/en/norms-for-data-use\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "accessRights",
"columnInsertIndex": 7,
"baseColumnName": "rightsHolder",
"expression": "grel:\"http://www.inbo.be/en/norms-for-data-use\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column institutionCode at index 8 based on column accessRights using expression grel:\"INBO\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "institutionCode",
"columnInsertIndex": 8,
"baseColumnName": "accessRights",
"expression": "grel:\"INBO\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column datasetName at index 9 based on column institutionCode using expression grel:\"insertDatasetName\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "datasetName",
"columnInsertIndex": 9,
"baseColumnName": "institutionCode",
"expression": "grel:\"insertDatasetName\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column ownerInstitutionCode at index 10 based on column datasetName using expression grel:\"INBO\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "ownerInstitutionCode",
"columnInsertIndex": 10,
"baseColumnName": "datasetName",
"expression": "grel:\"INBO\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column basisOfREcord at index 11 based on column ownerInstitutionCode using expression grel:\"http://rs.tdwg.org/dwc/terms/#basisOfRecord\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "basisOfREcord",
"columnInsertIndex": 11,
"baseColumnName": "ownerInstitutionCode",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#basisOfRecord\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column countryCode at index 12 based on column basisOfREcord using expression grel:\"http://rs.tdwg.org/dwc/terms/#countryCode\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "countryCode",
"columnInsertIndex": 12,
"baseColumnName": "basisOfREcord",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#countryCode\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column geodeticDatum at index 13 based on column countryCode using expression grel:\"WGS84\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "geodeticDatum",
"columnInsertIndex": 13,
"baseColumnName": "countryCode",
"expression": "grel:\"WGS84\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column kingdom at index 14 based on column geodeticDatum using expression grel:\"http://rs.tdwg.org/dwc/terms/#kingdom\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "kingdom",
"columnInsertIndex": 14,
"baseColumnName": "geodeticDatum",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#kingdom\"",
"onError": "set-to-blank"
},
{
"op": "core/column-addition",
"description": "Create column nomenclaturalCode at index 15 based on column kingdom using expression grel:\"http://rs.tdwg.org/dwc/terms/#nomenclaturalCode\"",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"newColumnName": "nomenclaturalCode",
"columnInsertIndex": 15,
"baseColumnName": "kingdom",
"expression": "grel:\"http://rs.tdwg.org/dwc/terms/#nomenclaturalCode\"",
"onError": "set-to-blank"
}
]
For Biodiversity Data
Start a new project
Add the column where youre initial script is based on (To DwC)
Import parts of your operation history as JSON and apply to this or another
ready!!