(1) Please make and display your name cards!

 

(2) Make sure you're on Slack:

https://ledeprogram.slack.com/

 

(3) Keep PollEverywhere open

https://pollev.com/dmil

 

If you have any questions, just raise your hand 🖐!

Welcome! Let's get rolling!

Reporting II  - Day 5

  • Merging Data Sources
  • Exploratory Data Visualization

Announcements

  • Deadline Policy Reminder
     
  • Project Deliverables [tonight]
  • Upcoming Project Deliverables
    • [Part 2] - Reporting Check-In (around 9/16)
    • [Part 3] - Final Draft (around 9/21)
       
  • Reminder about emojis that require a response or rewrite... (⛔️, ❓, 🤯)

Project Timeline

  • Upcoming Project Deliverables
    • [Part 2] 10/16 Reporting Check-In / Data Viz
    • Aim for Rough Draft around 10/18-10/19
    • [Part 3] - Final Draft (around 10/21) 

<--- Rough Draft ---->

<--- Final Draft ---->

<- Respond to Edits ->

<--- Reporting / Data Analysis ---->

Pitch Approval

Interviews

  • Interviews are an important part of data reporting
    • Domain experts
    • Experts in your dataset
    • People impacted by the issue
    • etc...
  • They help you know where to look

Pitch

Report

Produce

Learning Objectives:

Today

  • Lecture: Merging Data Sources
     
  • Assignment 3
    • Discussion
    • Guest Speaker (Derek Willis of ProPublica)
       
  • Lecture: Exploratory Data Visualization
     
  • Project Time
    • some in-class time to get started on your project work

Merging Data

 

 

1. Working with relational data from one source

2. Merging data from different sources

What Is Relational Data?

 

  1. Data is stored in tables 
    (rows & columns)

     
  2. Each row is a record
     
  3. Records are identified by primary keys
     
  4. Tables are related, or connected to one another, by those primary keys

raw-polls.csv

An Example Database Schema

Poll

  • poll_id
  • start_date
  • end_date
  • pollster_id 🔑
  • partisan 
  • ...

 

 

 

Pollster

  • pollster_id  🔑
  • website_url

 

Question

  • question_id  
  • poll_id 🔑
  • race_id 🔑

Race

  • race_id
  • election_date
  • location

Answer

  • answer_id
  • candidate_id 🔑
  • percent

Candidate

  • candidate_id
  • name
  • party
  • gender
  • birthday...

Questions Answers

  • answer_id 🔑
  • question_id 🔑

Types of Relations

 

one-to-one

one-to-many

many-to-many

 

One-To-One

Poll

  • poll_id
  • start_date
  • end_date
  • pollster_id 🔑
  • partisan 
  • ...

 

 

 

Pollster

  • pollster_id  🔑
  • website_url

 

Question

  • question_id  
  • poll_id 🔑
  • race_id 🔑

Race

  • race_id
  • election_date
  • location

Answer

  • answer_id
  • candidate_id 🔑
  • percent

Candidate

  • candidate_id
  • name
  • party
  • biography_id🔑

Questions Answers

  • answer_id 🔑
  • question_id 🔑

Biography

  • biography_id
  • candidate_id🔑
  • gender
  • birthday

One-To-Many

Poll

  • poll_id
  • start_date
  • end_date
  • pollster_id 🔑
  • partisan 
  • ...

 

 

 

Pollster

  • pollster_id  🔑
  • website_url

 

Question

  • question_id  
  • poll_id 🔑
  • race_id 🔑

Race

  • race_id
  • election_date
  • location

Answer

  • answer_id
  • candidate_id 🔑
  • percent

Candidate

  • candidate_id
  • name
  • party
  • biography_id🔑

Questions Answers

  • answer_id 🔑
  • question_id 🔑

Biography

  • biography_id
  • candidate_id🔑
  • gender
  • birthday

Poll

  • poll_id
  • start_date
  • end_date
  • pollster_id 🔑
  • partisan 
  • ...

 

 

 

Pollster

  • pollster_id  🔑
  • website_url

 

Question

  • question_id  
  • poll_id 🔑
  • race_id 🔑

Race

  • race_id
  • election_date
  • location

Answer

  • answer_id
  • candidate_id 🔑
  • percent

Candidate

  • candidate_id
  • name
  • party
  • biography_id🔑

Questions Answers

  • answer_id 🔑
  • question_id 🔑

Biography

  • biography_id
  • candidate_id🔑
  • gender
  • birthday

One-To-Many

Poll

  • poll_id
  • start_date
  • end_date
  • pollster_id 🔑
  • partisan 
  • ...

 

 

 

Pollster

  • pollster_id  🔑
  • website_url

 

Question

  • question_id  
  • poll_id 🔑
  • race_id 🔑

Race

  • race_id
  • election_date
  • location

Answer

  • answer_id
  • candidate_id 🔑
  • percent

Candidate

  • candidate_id
  • name
  • party
  • biography_id🔑

Questions Answers

  • answer_id 🔑
  • question_id 🔑

Biography

  • biography_id
  • candidate_id🔑
  • gender
  • birthday

One-To-Many

(Through)

Poll

  • poll_id
  • start_date
  • end_date
  • pollster_id 🔑
  • partisan 
  • ...

 

 

 

Pollster

  • pollster_id  🔑
  • website_url

 

Question

  • question_id  
  • poll_id 🔑
  • race_id 🔑

Race

  • race_id
  • election_date
  • location

Answer

  • answer_id
  • candidate_id 🔑
  • percent

Candidate

  • candidate_id
  • name
  • party
  • biography_id🔑

Questions Answers

  • answer_id 🔑
  • question_id 🔑

Biography

  • biography_id
  • candidate_id🔑
  • gender
  • birthday

Many-To-Many

Referencing Relations in Google Sheets / Excel...

 

Documentation:

https://support.google.com/docs/answer/3093318?hl=en

 

VLOOKUP(search_key, range, index, [is_sorted])

 

Bonus Learning Objective

- [ ✔️ ] Reading technical documentation

Bonus Learning Objective

- [ ✔️ ] Reading Technical Blog Posts (Like StackOverflow)

 

(╯°□°)╯︵ ┻━┻

Joins

 

 

  • Left Join
  • Right Join
  • Inner Join
  • Outer Join

Left Outer Join

https://docs.microsoft.com/en-us/power-query/merge-queries-left-outer

Excel / Google Sheets

=VLOOKUP(B1, 'Right Table!B2:C4, 2 ,FALSE)"

SQL

SELECT * FROM left_table lt LEFT JOIN right_table rt ON lt.CountryID = rt.ID

Python / Pandas

left_table.merge(right_table, how='left', left_on='CountryID', right_on='ID')

R / TidyR

left_table %>% inner_join(right_table, by = c("CountryId" = "Country")

 

Left Outer Join

Right Outer Join

https://docs.microsoft.com/en-us/power-query/merge-queries-right-outer

Inner Join

 

 

https://docs.microsoft.com/en-us/power-query/merge-queries-inner

Full Outer Join

 

https://docs.microsoft.com/en-us/power-query/merge-queries-full-outer

Merging Data

Standardized Identifers

Within a Dataset

  • race_id
  • pollster_id
  • etc...

 

 

Across Datasets

Did Democratic House Members Spend More Than Republican House Members On Office Expenses in 2021 Q2?

 

Cities Spend Millions On Police Misconduct Every Year. Here’s Why It’s So Difficult to Hold Departments Accountable.
 

Story

https://fivethirtyeight.com/features/police-misconduct-costs-cities-millions-every-year-but-thats-where-the-accountability-ends/

 

 

Data + Caveats ⚠️

https://github.com/fivethirtyeight/police-settlements/

Non-Standard Identifiers

Atlanta

From Laura Bronner's slides about quant editing

What the heck is Atlanta?

Standardized Identifers

Within a Dataset

  • race_id
  • pollster_id
  • etc...

 

 

Across Datasets

Other things to think about when merging data...

  • What is the "universe" of each dataset?
     
  • Are there null values after merging?
     
  • Did I drop any observations? (always count)

tidylog (R), pandas-log (python)

This is NOT

Homework Review

ft. Derek Willis (formerly @ ProPublica)

Exploratory Data Visualization 📊

 





  • There are lots of observations
     
  • You want to see the "shape" of a variable
     
  • There are multiple variables you want to compare and a pivot table just isn't doing the job...

When to do exploratory viz?

When it's worth it...for me, almost always.

Types of Data

Types of Plots

Chart Taxonomy

Type of Data

Which one do I need?

Type of Plot

Inspect Columns

  • Do I know what each column means?
    • write down the unanswered questions
  • How was it collected?
  • What kinds of errors should I be looking out for? 👀
  • Are there any blanks? Null values? Repeated values?

Inspect Columns

  • What are the types of data in each column?
    • Categorical Columns
      • What are the categories?
      • Could they have been grouped differently?
         
    • Continuous Columns
      • range (min / max)
      • distribution (plot)
      • central tendency (mean / median)
      • variability (stdev / IQR)

Overall Description

  • Categorical variables
    • frequencies per category
       
  • Continuous variables
    • Summary stas
    • Distributions

Asking Questions

  • Does the bias vary by election type?
  • Are polls more or less biased over time?
  • Are polls more biased in some years than others?

Plotting 1 column

  • What are the types of data in each column?
    • Categorical Columns
      • pivot table w/ count
      • bar chart
    • Continuous Columns
      • dot plot
      • histogram
      • box and whiskers

Plotting 2 columns

  • What are the types of data in each column?
    • Categorical vs Categorical
      • pivot table
      • XY heatmap
         
    • Continuous vs Continuous
      • scatter plot (with trendline?)
         
    • Continuous vs Categorical
      • ​categorical variable on axis
      • categorical variable on facet

Plotting many columns

  • X axis
  • Y axis
  • color
  • fill
  • size
  • facet
  • facet_x, facet_y
  • shape
  • opacity

Do you need a model?

Is it regression time?

  • Continuous variables

 

 

  • Categorical variables

?

Reusing Visual Encodings

  • X axis
  • Y axis
  • color
  • fill
  • size
  • facet
    • facet_x
    • facet_y
  • shape
  • opacity

Annotations

  • Part to whole
  • Split into buckets

 

 

Annotate with text labels OR

 

  • color
  • fill
  • size
  • facet
  • shape
  • opacity
  • etc...

Inspect Columns

  • What are the types of data in each column?
    • Categorical Columns
      • "bar" in a bar chart
      • "facets" in a small multiples chart
      • colors
      •  
      • range (min / max)
      • distribution (plot)
      • central tendency (mean / median)
      • variability (stdev / IQR)
  • GeoSpatial data
    • does distance / size matter? 

Project Time!

Anything you want to run by Dhrumil & Aseem?

Homework

 

  1. The Wall Street Journal Guide to Information Graphics
    
  2. The Quartz guide to bad data
    
  3. Financial Times Visual Vocabulary
    https://ft.com/vocabulary
  4. Very short Datawrapper assignment (Assignment 3.5❓👀❓👀)

Project Work [Tonight-ish]

Project Reporting / Data Analysis

Reporting II  - Day 5

By Dhrumil Mehta

Reporting II  - Day 5

  • 280