Pitfalls in data
and how to avoid them
Maarten Lambrechts
Pitfalls in
metadata
statistics
visualisation
PITFALLS
Metadata
"... in Brussels, nearly 62 percent is of foreign origin"
Metadata = data about the data
Collected by whom?
Collected how?
Collected why?
Collected when?
Definitions?
Metadata
Determine validity of conclusions
Describe limitations on the use of the data
Determine comparability
Without correct units numbers are meaningless
PITFALLS
Statistics
Percentages & percentage points
"2.1% of mask wearers tested positive for COVID-19, versus 1.8% of non-mask-wearers. The difference is only 0.3%."
This is an difference of 0.3 percentage points
Or:
(2.1 - 1.8)/2.1 = 14% decrease in positive tests
% - % = percentage point
(new - old)/old = % change
That's not normal
Top 5 EU electricity consumers
Country |
---|
1. Germany |
2. France |
3. UK |
4. Italy |
5. Spain |
Electricity consumption (Gwh) |
---|
517.377 |
442.372 |
303.903 |
286.027 |
232.515 |
Congrats: a population ranking!
Please divide by the number of people
Top 5 EU electricity consumers
Country |
---|
1. Iceland |
2. Norway |
3. Finland |
4. Sweden |
5. Luxembourg |
Elektricity consumption (Mwh/cap) |
---|
49,7 |
21,5 |
14,7 |
12,6 |
10,6 |
Especially relevant for maps
Make numbers comparable
(=normalising):
per capita, per surface area, ...
Mean vs median
How many soulmates do you think a person can have?
The median?
Rank the data: the median is the middle value
The median is less sensitive to outliers. Use it!
Distributions
Summary statistics rarely describe somebody's lived experience and never ring true for the whole population
Data are much more than averages
Use distributions whenever you can
Correlation vs causality
Correlation. Is. Not. Causation.
Confidence
intervals
"The margin of error is 3,2 percent."
Uncertainty is inherent to survey results. Keep the margins of error in mind
Big & small chances
People who ate 76 grams of red and processed meat per day had a 20% higher chance of developing colorectal cancer compared to others, who ate about 21 grams a day.
Of 10.000 people in the study who ate 21 grams of red or processed meat each day, 40 developed colorectal cancer. Among those who ate 76 gram per day, 48 did so.
21 grams => 40/10.000 = 0.4%
76 grams => 48/10.000 = 0.48%
+20%, or +0.08 percentage point
+20% of a small chance is still a small chance
The researchers said B.1.1.7 led to 227 deaths in a sample of 54,906 patients. That compares with 141 deaths in roughly the same number of patients who were infected with other strains.
Earlier strains: 141/54906 = 0,26%
UK strain: 227/54906 = 0,41%
But still 86 more deaths
But relatively small differences can be meaningful
Apples & oranges
Compare
regions to regions
countries to countries
apples to apples
oranges to oranges
Exponential growth
Exponential
≠
skyrocketing
Percent & percentage points
That's not normal
Mean vs median
Distributions
Correlation vs causality
Confidence intervals
Big & small chances
Apples and oranges
Exponential growth
PITFALLS
Visualisation
Keep the pies for dessert
Don't cut bars
Respect proportions
Don't cut time axes
Compare apples to apples
Scale circles by surface area
Don't do 3D
Double the axes,
double the mischief
Every map is a lie
Keep the pies for dessert
Don't cut bars
Respect proportions
Don't cut time axes
Compare apples to apples
Scale circles op by surface area
Don't do 3D
Avoid double axes
Every map is a lie
Thanks!
slides.com/maartenzam/pitfalls-mediahuis
Pitfalls in data
By maartenzam
Pitfalls in data
- 3,559