Coming Up

  • Mondays in person, Wednesdays on Zoom
  • Wed March 8: No class meeting, Hum 354 11:30-1 Shirts Powdered Red┬ádiscussion
    • Zoom stream and recording available
  • Final project: Think about the process and methods you enjoy, not the topic

Why all the different assignments???

  • I select data for its problems
  • Different data has different needs
  • Human readable vs too big to read
  • General principles, not specific mechanics (so you can ask Prof. Google later)





Data work is a feedback loop

  • Exploration will lead to questions
  • Exploration reveals preparation issues
  • Preparation differs depending on the question
  • Content model enables certain questions
  • The content model can change depending on the question


  • Everything has a structure (HTML, data table)
  • The same structure can be represented in different ways (CSS, data visualization, fruit baskets)
  • Data is a representation, not the thing
  • Sections of the data can be represented abstractly
    • (File paths, battleship, ranges, facets, filters, parts of text)

Data -> Questions -> Data

  • Abstraction means stepping back to ask what you're trying to do, not getting lost in the weeds
  • Data type determines what you can do with it
    • Text vs Numbers, Discrete vs Continuous, Categorical vs unstructured

Principles of Cleaning

  • Keep an untouched original
  • Keep a record of your cleaning (export and save steps when done)
  • Create a controlled vocabulary in categorical text
  • Explore and find outliers--typos or data?
  • One column, one purpose
  • One row, one purpose
  • One cell, one purpose
  • Different questions will need different cleaning
  • Only you know when you're done!

Intro to Dataviz

Nominal, Textual, Qualitative or Dimensional

  • Textual
  • Usually categorical
  • Usually mutually exclusive
  • Can't be quantified
  • Can have a controlled vocabulary
  • Can't usually be ordered
  • Fruit boxes, marriage status, item type, hair color

Ordinal, Numeric, Quantitative or Measurable

  • Numeric
  • Usually can be counted or measured
  • Can't have a controlled vocabulary
  • Orderable

Discrete Measure

  • Limited number of possible values
  • Whole, non-divisible items
  • Countable but not measurable
  • Usually orderable
  • Dates (1808 vs 1809)
  • Counts of items (5 children, 4 beavers, 3 skittles, 1 keg)
  • Can find a median, average, and sum

Continuous Measure

  • Numeric
  • Infinite number of possible values
  • Can't have a controlled vocabulary
  • Always orderable
  • Weight, financial cost, distance, time
  • Can find a median or average but can't find a sum (average weight of people in class vs total weight of people in class)


By mkane


  • 168