Why all the different assignments???
- I select data for its problems
- Different data has different needs
- Human readable vs too big to read
- General principles, not specific mechanics (so you can ask Prof. Google later)
Explore
Prepare
Analyze
Present
Data work is a feedback loop
- Exploration will lead to questions
- Exploration reveals preparation issues
- Preparation differs depending on the question
- Content model enables certain questions
- The content model can change depending on the question
Abstraction
- Everything has a structure (table of contents, data table)
- The same structure can be represented in different ways (fruit baskets, data visualization)
- Data is a representation, not the thing
- Sections of the data can be represented abstractly
- (website urls, battleship, ranges, facets, filters, parts of text)

Primary vs secondary data


BAD
GOOD
Best Practices
- Keep an original, clean version
- Keep a backup off-site
- Keep your data in a non-proprietary format (csv good, Excel and Access bad)
- One sheet, one purpose
- One column, one purpose
- Zero is not the same as absent
- Formatting is not data
- Break up dates
Data -> Questions -> Data
- Abstraction means stepping back to ask what you're trying to do, not getting lost in the weeds
- Manipulation is not a dirty word!
- Data type determines what you can do with it
- Text vs Numbers, Discrete vs Continuous, Categorical vs unstructured
Right Tool for the Right Job
Spreadsheets: good for entry, bad for cleaning and analysis
OpenRefine: good for cleaning, bad for entry and analysis
Tableau: good for analysis, bad for entry and cleaning
Python: good for analysis and cleaning, bad for visualization
D3: good for visualization, bad for entry and analysis
Principles of Cleaning
- Keep an untouched original
- Keep a record of your cleaning (export and save steps when done)
- Create a controlled vocabulary in categorical text
- Explore and find outliers--typos or data?
- One column, one purpose
- One row, one purpose
- One cell, one purpose
- Different questions will need different cleaning
- Only you know when you're done!
Thinking with Data
Content type
Discrete: a category or type of thing. Gender, occupation, educational level, and race are all discrete data because there are separate sub-types which do not overlap
Continuous: a spectrum which is connected. Dates, ages, counts, and money are all continuous data because data can fall on any arbitrary point of a spectrum.
Nominal, Textual, Qualitative or Dimensional
- Textual
- Usually categorical
- Usually mutually exclusive
- Can't be quantified
- Can have a controlled vocabulary
- Can't usually be ordered
- Ex: Fruit boxes, marriage status, item type, hair color, gender
Ordinal, Numeric, Quantitative or Measurable
- Numeric
- Usually can be counted or measured
- Can't have a controlled vocabulary
- Orderable
- Ex: Age, month, amount
Discrete Measure
- Limited number of possible values
- Whole, non-divisible items
- Countable but not measurable
- Usually orderable
- Dates (1808 vs 1809)
- Counts of items (5 children, 4 beavers, 3 skittles, 1 keg)
- Can find a median, average, and sum
Continuous Measure
- Numeric
- Infinite number of possible values
- Can't have a controlled vocabulary
- Always orderable
- Weight, financial cost, distance, time
- Can find a median or average but can't find a sum (average weight of people in class vs total weight of people in class)
Stop and ask
What the fuck am I trying to do?
deck
By mkane
deck
- 264