SO FAR
Coming Up
Mondays in person, Wednesdays on Zoom
Wed March 8: No class meeting, Hum 354 11:30-1
Shirts Powdered Red
discussion
Zoom stream and recording available
Final project: Think about the process and methods you enjoy, not the topic
Why all the different assignments???
I select data for its problems
Different data has different needs
Human readable vs too big to read
General principles, not specific mechanics (so you can ask Prof. Google later)
Prepare
Explore
Analyze
Present
Data work is a feedback loop
Exploration will lead to questions
Exploration reveals preparation issues
Preparation differs depending on the question
Content model enables certain questions
The content model can change depending on the question
Abstraction
Everything has a structure (HTML, data table)
The same structure can be represented in different ways (CSS, data visualization, fruit baskets)
Data is a representation, not the thing
Sections of the data can be represented abstractly
(File paths, battleship, ranges, facets, filters, parts of text)
Data -> Questions -> Data
Abstraction means stepping back to ask what you're trying to do, not getting lost in the weeds
Data type determines what you can do with it
Text vs Numbers, Discrete vs Continuous, Categorical vs unstructured
Principles of Cleaning
Keep an untouched original
Keep a record of your cleaning (export and save steps when done)
Create a controlled vocabulary in categorical text
Explore and find outliers--typos or data?
One column, one purpose
One row, one purpose
One cell, one purpose
Different questions will need different cleaning
Only you know when you're done!
Intro to Dataviz
Nominal, Textual, Qualitative or Dimensional
Textual
Usually categorical
Usually mutually exclusive
Can't be quantified
Can have a controlled vocabulary
Can't usually be ordered
Fruit boxes, marriage status, item type, hair color
Ordinal, Numeric, Quantitative or Measurable
Numeric
Usually can be counted or measured
Can't have a controlled vocabulary
Orderable
Discrete Measure
Limited number of possible values
Whole, non-divisible items
Countable but not measurable
Usually orderable
Dates (1808 vs 1809)
Counts of items (5 children, 4 beavers, 3 skittles, 1 keg)
Can find a median, average, and sum
Continuous Measure
Numeric
Infinite number of possible values
Can't have a controlled vocabulary
Always orderable
Weight, financial cost, distance, time
Can find a median or average but can't find a sum (average weight of people in class vs total weight of people in class)
Made with Slides.com