Digital Methods for Art Historians

Cleaning Data + APIs​

ryan clement

2020-02-24

data services librarian

Who am I?

  • I'm Ryan Clement, the Data Services Librarian
  • he/him/his
  • I work with economics, geography, sociology, anthropology, and philosophy
  • But I help people from all over campus when they're working with data!
  • You can find my contact info and schedule time to meet at go/ryan/

What are we doing today?

  • What is dirty data?
  • What is clean data?
  • Cleaning data using OpenRefine
  • Break!
  • What is an API?
  • How can you use an API to get data from databases?
  • Using OpenRefine to access the Harvard Art Museum API

Dirty data?

What are some sources of dirty data?

  • Open fields in data entry/collection process
    • surveys
    • typing to enter data
  • Digitization errors
  • Chaining of data errors: not using original data (or from reputable provenance)
  • Time

What are some kinds of errors in data?

  • Incorrect spelling or punctuation
  • Missing Data
  • Data in the wrong field
  • Incomplete Data
  • Duplicated Data
  • Non-standardized data

Empty responses

Open response fields lead to messy data

"Against Cleaning" (Rawson & Muñoz, 2016)

"Color Survey Results" (Munroe, 2010)

Color coding

Leading and trailing space

Unnecessary information in the data values

What is 'tidy' data?

  1. Each variable you measure should be in one column
  2. Each different observation of that variable should be in a different row
  3. There should be one table for each “kind” of variable
  4. If you have multiple tables, they should include a column in the table that allows them to be linked (i.e. a  unique ID)

More data cleaning steps...

Other Considerations:

  • Survey data should have the questions associated with the data
  • You may want an associated “Read Me” file or code book explaining the meaning of the variables.
  • Spell checking
  • Removing duplicates
  • Normalizing case
  • Normalizing Dates and times

Cleaning data + ethics

  1. “Against Cleaning” (Rawson & Muñoz, 2016)
    1. Preserving data
    2. Preserving labor
    3. Research, administrative data - differences
  2. Data cleaning to protect privacy
    1. Top coding/bottom coding
      1. Greeking - http://www.datamasker.com/DataMasking_WhatYouNeedToKnow.pdf
  3. Permission to share
  4. Thinking about sharing - who are you helping, who are you hurting?
  5. Shared data - how much do you know, so how much can you clean?​

More data cleaning resources...

harc0355 data cleaning and APIs

By Ryan Clement

harc0355 data cleaning and APIs

  • 702