SO FAR
Coming Up
- Mondays in person, Wednesdays on Zoom
- Wed March 8: No class meeting, Hum 354 11:30-1 Shirts Powdered Red discussion
- Zoom stream and recording available
- Final project: Think about the process and methods you enjoy, not the topic
Why all the different assignments???
- I select data for its problems
- Different data has different needs
- Human readable vs too big to read
- General principles, not specific mechanics (so you can ask Prof. Google later)
Prepare
Explore
Analyze
Present
Data work is a feedback loop
- Exploration will lead to questions
- Exploration reveals preparation issues
- Preparation differs depending on the question
- Content model enables certain questions
- The content model can change depending on the question
Abstraction
- Everything has a structure (HTML, data table)
- The same structure can be represented in different ways (CSS, data visualization, fruit baskets)
- Data is a representation, not the thing
- Sections of the data can be represented abstractly
- (File paths, battleship, ranges, facets, filters, parts of text)
Data -> Questions -> Data
- Abstraction means stepping back to ask what you're trying to do, not getting lost in the weeds
- Data type determines what you can do with it
- Text vs Numbers, Discrete vs Continuous, Categorical vs unstructured
Principles of Cleaning
- Keep an untouched original
- Keep a record of your cleaning (export and save steps when done)
- Create a controlled vocabulary in categorical text
- Explore and find outliers--typos or data?
- One column, one purpose
- One row, one purpose
- One cell, one purpose
- Different questions will need different cleaning
- Only you know when you're done!
Intro to Dataviz
Nominal, Textual, Qualitative or Dimensional
- Textual
- Usually categorical
- Usually mutually exclusive
- Can't be quantified
- Can have a controlled vocabulary
- Can't usually be ordered
- Fruit boxes, marriage status, item type, hair color
Ordinal, Numeric, Quantitative or Measurable
- Numeric
- Usually can be counted or measured
- Can't have a controlled vocabulary
- Orderable
Discrete Measure
- Limited number of possible values
- Whole, non-divisible items
- Countable but not measurable
- Usually orderable
- Dates (1808 vs 1809)
- Counts of items (5 children, 4 beavers, 3 skittles, 1 keg)
- Can find a median, average, and sum
Continuous Measure
- Numeric
- Infinite number of possible values
- Can't have a controlled vocabulary
- Always orderable
- Weight, financial cost, distance, time
- Can find a median or average but can't find a sum (average weight of people in class vs total weight of people in class)
deck
By mkane
deck
- 232