Why all the different assignments???

  • I select data for its problems
  • Different data has different needs
  • Human readable vs too big to read
  • General principles, not specific mechanics (so you can ask Prof. Google later)

Explore

Prepare

Analyze

Present

Data work is a feedback loop

  • Exploration will lead to questions
  • Exploration reveals preparation issues
  • Preparation differs depending on the question
  • Content model enables certain questions
  • The content model can change depending on the question

Abstraction

  • Everything has a structure (table of contents, data table)
  • The same structure can be represented in different ways (fruit baskets, data visualization)
  • Data is a representation, not the thing
  • Sections of the data can be represented abstractly
    • (website urls, battleship, ranges, facets, filters, parts of text)

Primary vs secondary data

BAD

GOOD

Best Practices

  • Keep an original, clean version
  • Keep a backup off-site
  • Keep your data in a non-proprietary format (csv good, Excel and Access bad)
  • One sheet, one purpose
  • One column, one purpose
  • Zero is not the same as absent
  • Formatting is not data
  • Break up dates

Data -> Questions -> Data

  • Abstraction means stepping back to ask what you're trying to do, not getting lost in the weeds
  • Manipulation is not a dirty word!
  • Data type determines what you can do with it
    • Text vs Numbers, Discrete vs Continuous, Categorical vs unstructured

Right Tool for the Right Job

 

Spreadsheets: good for entry, bad for cleaning and analysis

 

OpenRefine: good for cleaning, bad for entry and analysis

 

Tableau: good for analysis, bad for entry and cleaning

 

Python: good for analysis and cleaning, bad for visualization

 

D3: good for visualization, bad for entry and analysis

 

Principles of Cleaning

  • Keep an untouched original
  • Keep a record of your cleaning (export and save steps when done)
  • Create a controlled vocabulary in categorical text
  • Explore and find outliers--typos or data?
  • One column, one purpose
  • One row, one purpose
  • One cell, one purpose
  • Different questions will need different cleaning
  • Only you know when you're done!

Thinking with Data

Content type

Discrete: a category or type of thing. Gender, occupation, educational level, and race are all discrete data because there are separate sub-types which do not overlap

 

Continuous: a spectrum which is connected.  Dates, ages, counts, and money are all continuous data because data can fall on any arbitrary point of a spectrum.

Nominal, Textual, Qualitative or Dimensional

  • Textual
  • Usually categorical
  • Usually mutually exclusive
  • Can't be quantified
  • Can have a controlled vocabulary
  • Can't usually be ordered
  • Ex: Fruit boxes, marriage status, item type, hair color, gender

Ordinal, Numeric, Quantitative or Measurable

  • Numeric
  • Usually can be counted or measured
  • Can't have a controlled vocabulary
  • Orderable
  • Ex: Age, month, amount

Discrete Measure

  • Limited number of possible values
  • Whole, non-divisible items
  • Countable but not measurable
  • Usually orderable
  • Dates (1808 vs 1809)
  • Counts of items (5 children, 4 beavers, 3 skittles, 1 keg)
  • Can find a median, average, and sum

Continuous Measure

  • Numeric
  • Infinite number of possible values
  • Can't have a controlled vocabulary
  • Always orderable
  • Weight, financial cost, distance, time
  • Can find a median or average but can't find a sum (average weight of people in class vs total weight of people in class)

Stop and ask

What the fuck am I trying to do?

deck

By mkane

deck

  • 264