-Not just the first step, iterative
- Encompasses a truly breathtaking amount of tasks
- Giving you a comprehensive overview of concepts/tools could easily fill two-three graduate courses.
- Data viz and statistical modeling (e.g. hypothesis testing) most focused on in teaching, represents 20% of work
- Students are not prepared for original research or real world experience
Source: xkcd
“[Of statistics/data analysis] Total disconnect between what people need to actually understand data and what was being taught.” - Hadley Wickam
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Hadley Wickham - "Like families, tidy datasets are all alike but every messy dataset is messy in its own way."
Taken from Hadley Wickham's Tidy Data
Taken from Hadley Wickham's Tidy Data
Taken from Hadley Wickham's Tidy Data
Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means) e.g. total current tax delinquency.
Transform and aggregate - total current delinquency for owners with an LLC. Often these techniques will be combined or chained together.
^ Another example of multiple values in one column
Taken from Hadley Wickham's Tidy Data
Joins
Left Join
Multiple keys
Concat
- There are many joins. I will discuss the two most common - inner and left.
- INNER JOIN
An inner join only returns those records that have “matches” in both tables.
- LEFT JOIN
A left join returns all the records in the “left” table (T1) whether they have a match in the right table or not..
- If the table you are merging has multiple rows per ID, all will be joined.
- Reason why data should be tidy.
- Excel cannot do many to many
- Example of Join on Multiple Columns
- e.g. Owner Occupancy per year.
Get coordinates for street addresses
Identify purpose
choose geocoder (Texas A&M or Geocodio fail on large > 2500 datasets)
preprocess data
geocode
verfiy output
import to viz tool
"Allow you to combine information from different tables by using spatial relationships as the join key."
"Much of what we think of as “standard GIS analysis” can be expressed as spatial joins" - Postgis Documentation
(Besides the feeling of superiority it gives you.)
You only need one: "Python as Super Glue for the Modern Scientific Workflow" Data analysis requires you to port data to and from different programs and tools. Python can (almost) do it all.
Excel Hell: "A place of torment and misery caused by using Excel as your primary data manipulation tool."