Digital Methods for Art Historians
Cleaning Data + APIs
ryan clement
2020-02-24
data services librarian
Who am I?
- I'm Ryan Clement, the Data Services Librarian
- he/him/his
- I work with economics, geography, sociology, anthropology, and philosophy
- But I help people from all over campus when they're working with data!
- You can find my contact info and schedule time to meet at go/ryan/
What are we doing today?
- What is dirty data?
- What is clean data?
- Cleaning data using OpenRefine
- Break!
- What is an API?
- How can you use an API to get data from databases?
- Using OpenRefine to access the Harvard Art Museum API
Dirty data?
What are some sources of dirty data?
- Open fields in data entry/collection process
- surveys
- typing to enter data
- Digitization errors
- Chaining of data errors: not using original data (or from reputable provenance)
- Time
What are some kinds of errors in data?
- Incorrect spelling or punctuation
- Missing Data
- Data in the wrong field
- Incomplete Data
- Duplicated Data
- Non-standardized data
Empty responses
Open response fields lead to messy data
"Against Cleaning" (Rawson & Muñoz, 2016)
"Color Survey Results" (Munroe, 2010)
Color coding
Leading and trailing space
Unnecessary information in the data values
What is 'tidy' data?
- Each variable you measure should be in one column
- Each different observation of that variable should be in a different row
- There should be one table for each “kind” of variable
- If you have multiple tables, they should include a column in the table that allows them to be linked (i.e. a unique ID)
From The Elements of Data Analytic Style (Leek, 2015)
More data cleaning steps...
Other Considerations:
- Survey data should have the questions associated with the data
- You may want an associated “Read Me” file or code book explaining the meaning of the variables.
- Spell checking
- Removing duplicates
- Normalizing case
- Normalizing Dates and times
Cleaning data + ethics
-
“Against Cleaning” (Rawson & Muñoz, 2016)
- Preserving data
- Preserving labor
- Research, administrative data - differences
- Data cleaning to protect privacy
- Top coding/bottom coding
- Permission to share
- Thinking about sharing - who are you helping, who are you hurting?
- Shared data - how much do you know, so how much can you clean?
More data cleaning resources...
- OpenRefine Recipes: https://github.com/OpenRefine/OpenRefine/wiki/Recipes
- OpenRefine: http://openrefine.org/documentation.html
- Regex Browser: https://regexr.com/
harc0355 data cleaning and APIs
By Ryan Clement
harc0355 data cleaning and APIs
- 702