Introduction to 

Statistics

Javier Sajuria
UCL Political Science

Session 1
25 January 2014

Agenda

  1. Administrative information
  2. Why statistics?
  3. What is statistics?
  4. Statistical Terminology
  5. Types of Variables
  6. Relationships and Hypothesis testing
  7. Sampling
  8. Maths and statistical notation
  9. Descriptive statistics
  10. Frequency distributions
  11. Measures of central tendency
  12. Measures of dispersion


Administrative information


Course tutor:
Javier Sajuria
j.sajuria@ucl.ac.uk

  • The course will be taught in 4-hours session, for four Saturdays since 25/1.
  • Each session consists in 2 hours of lecture and 2 hours of seminars (flexible)
  • Three different rooms:
    • 25/1 - Malet Street Building Room 457
    • 1/2 - Malet Street Building Room 536
    • 8/2 & 15/2 Malet Street Building 422

Readings


The readings are not required, but STRONGLY recommended. You should choose only ONE of these:

  • Agresti, Alan, and Barbara Finlay. 2009. Statistical Methods for the Social Sciences. New Jersey: Pearson Education. International edition, 4th (or 3rd) ed. 
  • Kellstedt, Paul M., and Guy D. Whitten.  2013.  The Fundamentals of Political Science Research.  Cambridge:  Cambridge University Press.

For the seminars, we will use:

  • Acock, Alan C. 2010. A Gentle Introduction to Stata. College Station, Texas: Stata Press.  3rd ed.  

Readings


  • People might be afraid of mathematical notations, and the readings can be a bit heavy on those. Hence, don't focus on the equations!  

  • It's ok if you don't understand the readings the first time. It's not ok if you continue not understanding after you finished this course, at least in part.

  • If you don't have time for all of them, prioritise Agresti & Finlay or Kelstedt & Whitten

practical seminars

  • We will use Stata, mainly for pedagogical reasons. If someone is interested in other packages (e.g. R), please let me know, so I can give you some guidance.

  • The datasets are hosted  in...

  • We will try to cover a lot of ground in four weeks, so it's extremely important to practice on your own.

  • Don't be afraid of Stata! It doesn't bite! It can a be a bit complicated at the beginning, just like some people ;)

Why Statistics?

  • It's one of the most widely used methods for analysing (quantitative) data

  • It will be useful if you are planning to use statistics for your own research.

  • If you are not planning to use stats, you will most certainly find yourself reading papers using stats, and you need to understand and evaluate what are they doing

  • It's fun! (Not joking). But requires some hard work.

Why statistics?

  • Some people are afraid of learning stats, but you shouldn't. This is just like learning a new language, but sometimes even simpler.

  • Don't focus on the equations. My job is to help you navigate though them.

  • I can't do the job for you. I'm here to facilitate your own learning process, not to learn for you. It doesn't matter how many hours I stand in front of the room, you need to read, work and ask questions (!)

Why statistics?

  • Do I need Maths?
    • Some
    • If you feel unsure, visit this website (particularly sections 1-4)

  • As with everything related to maths, repetition is key.

what is statistics?

Three main facets:

  1. Design
    1. How do we collect data?

  2. Description
    1. Describing the data
    2. Summarising the data

  3. Inference
    1. Making predictions to a wider population or the future

statistical terminology


Population:

The entire possible set of elements we wish study. That can be people, states, organisations, regions, etc.

Sample:

The subset of elements we actually study

statistical terminology


Parameter:

A numerical summary of the population
We don't usually know population parameters.

Statistic(s):

A numerical summary of our sample

This summary usually contains two buts of information:

  • Measures of central tendency
  • Measures of variability


statistical terminology


Variable:

Anything we can measure about the subjects on our sample


Variables vary (d'oh!). That can be across the sample, and/or across time.

Statistical terminology

Dependent Variable:

  • The variable to be explained, described, understood.
  • Is often referred as the outcome variable, and denoted with the letter Y
  • It should be dependent upon something, but must not affect the independent variables
  • If the dependent variable does not vary (i.e. remains constant), then we cannot explain the effect of other variables on it

statistical terminology

Independent Variable:

  • Should be independent from  the dependent variable, but not necessarily from other independent variables

  • Presumed as the determinant of an effect or an outcome

  • Usually denoted as X, explanatory or predictor variable

Types of variables

Quantitative and Qualitative/Categorical Variables:

  • Quantitative variables occur naturally in a numerical way
    • We usualy look it in terms of ranges from low to high, which also correspond to high and low values in reality
    • e.g. age, income, years of education, GDP,  etc.

  • Categorical variables might be expressed in numbers, but they do not mean any quantities, just codes
    • e.g. gender, colours, marital status, etc.

Types of variables

Discrete and Continuous variables:

  • Discrete variables are much like categorical variables
  • They are usually counts, which can be expressed in integer numbers (e.g. number of conflicts, count of people in a meeting, etc.)

  • Continuous variables have a theoretical infinite number of outcomes(e.g. height and weight)

Types of variables

Nominal Variables:

  • Values that communicate differences between subjects on the characteristic being measured
  • They refer to discrete set of categories to distinguish units of analysis --> categorical variables.
  • e.g. race, gender, parti affiliation
  • The labels need to be exclusive.
  • When they are dichotomous, they are often called dummy or binary variables.

Types of variables

Ordinal Variables:

  • They communicate relative differences between subjects, in an ordered way (e.g. more or less). Usually related to qualities
  • The values do not occur naturally, and they do not reflect quantity
  • Examples:
    • Educational level (not years)
    • Strongly agree - Agree - Neither Agree nor Disagree - Disagree - Strongly Disagree

types of variables

Interval Variables (Continuous):

  • Given values that communicate exact differences between subjects on the measured characteristic
  • The intervals between categories have exact meaning
  • Can determine how much smaller or larger the difference is between units
  • Are usually in a naturally occurring scale of some kind
  • The range of values is usually fairly large (age in years, feeling thermometers, etc.)

types of variables

Ratio variables (Continuous):

  • Similar to interval, but the zero in ratio variables actually denotes the absence of some quality
    • E.g. Zero degrees Celsius does not mean there is no air temperature, it just denotes a different level of warmness or coolness. However, temperature expressed in Kelvin is a ratio scale because zero denotes the absence of any thermal movement
  • Ratio variables and interval variables are typically collapsed into the term “continuous” variables, and sometimes both are called interval-level

A quick note on levels of measurement

We usually use the most precise level of measurement. However, sometimes recoding is needed. 

When doing it, make sure you have a good theoretical or empirical reason to lose information.

The type of the dependent variable almost always determine the type of analysis you should/can perform.

Relationships and Hypothesis Testing

Relationships are present in most research endeavours. In statistics, we are usually interested in things that causality/covariation.

Hypothesis testing is more relevant for deductive approaches.

Simple/concise relationships between two variables are preferred to more complex ones (principle or parsimony).

Hypotheses



Hypotheses are statements of relationships or difference that can be tested empirically. Hypotheses should be falsifiable ("how do you know you are wrong?")

Hypotheses can be stated in terms of if/then, more/less, etc.


Null and research hypothesis


Researchers set out to disprove their null hypothesis. They establish no relationship/no difference. It is not possible to distinguish between no relationship or a relationship based on chance.

We NEVER prove a null hypothesis, we only fail to reject it. Also, failing to reject a null hypothesis does not mean that there is no effect/relationship, it's just that we cannot establish it with our data/methods.

Null hypotheses are usually denoted as H0.

Research hypothesis


The statement of relationship or difference that the researcher is interested on. They are usually called "alternative hypotheses" and denoted H1, H2, etc.

 


If the null hypothesis can be rejected, we can claim that the research hypothesis is supported.



Sampling

There are several ways in which we can sample:

  • stratified
  • simple random
  • multistage
  • snowballing

In order to draw inferences, statistics needs random samples. In that way, we assume that the error in our data is also random, and can be estimated.

If the sample is not random, then the error is not as well. Hard to estimate points and errors.

random sampling

By simple random we mean that every subject in the population has the same probability of being selected for the sample.


First, we need to establish a sampling frame: the overall population from which we will draw our sample.


Random sample is usually conducted using computer programs.

bias vs random error

Random sample assumes that the error will also be random. Hence, we can estimate it with a certain level of certainty given the sample size and variation. In this way, we can now how accurate are our estimates about the general population.


Bias refers to systematic differences between the population parameters and our estimates. This can be minimised with random sampling, but is not always eliminated.



Math refresher


Please Excuse My Dear Aunt Sally

P – Parentheses
E – Exponents
M – Multiply
D – Divide
A – Add
S – Subtract

math refresher


statistical notation

Like learning a language, signs have meanings:


± means add and subtract

  square root

 less than or equal to

 more than or equal to   

n usually refers to sample size. Capital N is usually reserved for population size

 (sigma) to sum everything that comes after it

Π (pi) to multiply what comes after it   

statistical notation


x is the independent variable

y is the dependent variable

x̄ is the mean of x, the independent variable
ȳ is the mean of y, the dependent variable

x1 is the symbol for a particular observation in the independent variable


Descriptive statistics?


Univariate

  • Analysis of only one variable on some characteristic
  • Frequency distributions – essentially a count or distribution of values on some single variable
  • Other descriptive statistics – some summary measure that describes the data in a way not obvious by looking at the frequency distribution



descriptive statistics?


Bivariate

  • Analysis of two variables – can be simple scatter plots or histograms


Multivariate

  • Analysis of three or more variables

descriptive statistics


They are not used to draw inferences, and their analytical power is low.


They are a good starting point when we first look at the data, and can complement inferential analysis in our research.


"It does what it says on the tin" i.e. it describes the data.

Frequency Distributions

Can be represented numerically or graphically.

IQ scores: 94, 97, 97, 100, 100, 100, 112, 118, 123, 134


frequency distributions

Sometimes, displaying data graphically is not very useful. Why?


Frequency distributions


measures of central tendency and dispersion

They are the most common way to measure quantitative data. They tell us where our data is centred and how spread out it is (dispersion).

Different levels of data (nominal, ordinal, interval) require different measures of central tendency and dispersion

central tendency


Central tendency tells us the "typical" observation in our sample. It tells us where our data is centred or clustered.


They can be helpful to get comparisons between groups, and to understand what is "normal" or "expectable" from our sample.

central tendency

Each level of data requires a different measure of central tendency


Nominal: Mode

The most common observation

Ordinal: Median

The middle observation when the data is placed in numerical order

Interval: Mean

The average across observations


central tendency: nominal

To find the central tendency of nominal data we look for the modal category

It doesn’t make sense to look for the mean w/ nominal data. If in the IQ example there were 10 observations, 3 males, coded 1, and 7 females, coded 2, the mean is 1.7, which means nothing other than there were more females in the study.

We know that there are more females in the study more easily by finding the mode; requires no calculation


central tendency: nominal

Let’s consider the following example from a larger sample dataset containing information about nurses.  Suppose we want to know the typical gender of nurses.  We can easily look for the modal category by running a frequency distribution. 

.

central tendency: nominal

We can also look at 3 or more categories. Consider the following survey statement:

"Are we spending too much, about right, or too little on foreign aid?"
Too much: 35% of respondents (coded 1)
About right: 45% of respondents (coded 2)
Too little :20% of respondents (coded 3)

Modal response is ‘about right’ = the most commonly cited category

central tendency: ordinal

The mode becomes less helpful when looking at ordinal data


It can still provide us with information, but since the categories are ranked, or ordered, we can learn more with the median. The median divides the data in half; half the observations are above the median, half are below.


The median requires the data to be ordered from lowest to highest by the variable of interest.

central tendency: ordinal


The median can be calculated through this formula:


If our dataset contains 15 observations, we add 1 and divide by 2 or (16/2 = 8) --> The 8th observation is the median

If our dataset contains 10 observations (an even number), we add 1 and divide by 2 or (11/2 = 5.5)

We look at observations 5 and 6 and find the average of those two numbers


central tendency: interval

The most common measure of central tendency is the mean, or average. We use it for interval, or continuous-level, data

The sum of all the values divided by the number of observations


The mean can be subject to extreme scores, so it is important to look at a frequency distribution.
Unlike the mode and median, the mean can be a value not contained in the data.

central tendency: interval

Suppose we have the following exam scores:
54, 67, 71, 62, 50, 59, 73, 60, 64, 57
These added give us: 617
Then divide by the number of observations (n=10)
617/10; mean is: 61.7
The mean does not represent any actual score, but tells us across all the observations, what is most“typical”

Subject to outliers above and below mean

measures of dispersion

Measures of dispersion (variability) measure the opposite of central tendency. They look for how spread out the data is, how much it varies.


Again, we use different measures of dispersion for different levels of measurement.


  • Nominal: categorical percentages
  • Ordinal:  range and inter-quartile range
  • Interval:  standard deviation and variance


dispersion: nominal

We look at the modal category (and the other categories) to see what proportion or percentage of the scores fall in that category

In the foreign aid spending example from earlier the dispersion was

  • Too much: 35%
  • About right: 45%
  • Too little:  20%

What would the percentages be if we had perfect dispersion across the categories?

What would the percentages be if we had no dispersion at all?


dispersion: nominal



There is no absolute value that we are looking for: values can range from perfectly dispersed – the same percent of observations in each category, to no dispersion – all the observations in one category.


dispersion: ordinal

When we look at ordered categories, we are interested in measures of the range. Range is the lowest observation subtracted from the observation with the highest value

If we use the previous example of exam scores:

  • 54, 67, 71, 62, 50, 59, 73, 60, 64, 57
  • The range would be: 73 – 50 = 23

Therefore, the range depends upon the units in which the variables are measured.

Again, no absolute value that indicates a wide or narrow range, it's all in relative terms. 

dispersion: ordinal

We can also look at the interquartile range. This can be done for interval-level data as well, but is not as informative as our other measures of dispersion for interval data.
  • The interquartile range eliminates the upper and lower 25% of observations.
  • We are left with the middle 50% (line in box is median; 50th percentile
  • Eliminates outliers in skewed distributions

dispersion: interval



When we have interval-level data, we can use two primary tools for discussing the dispersion: standard deviation and variance.



standard deviation


  • Standard deviation tells us how much any observation is likely to deviate from the mean
  • Or, on average, how much does an observation differ from the average
  • Dependent upon the scale in which the variable is measured
    • Smaller numbers indicate less average deviation from the average
    • Larger numbers indicate more average deviation from the average


Standard deviation formula


In words: The square root of the sum of each observation of x minus the mean of x, squared, and divided by n minus 1.

We can work this out by hand for small samples; but for large samples, Stata easily does this for us.

You may see this in slightly different forms depending upon the source.


Calculating standard deviation step by step

1. Calculate mean

2. Subtract mean from each observation 

3.  Square each residual

4.  Sum all residuals

5.  Divide by n-1 

6.  Take the square root of this number

standard deviation example

Let’s look at the following 5 IQ scores:


109, 100, 98, 126, 131


First, we find the mean: 112.8

Then we subtract the mean from each observation:

standard deviation example



a note on n-1

Sometimes you will see the formula for standard deviation with only in the denominator, but technically, this is used for populations.

We use n-1 in samples to note that we are missing information in our sample.

What happens when you increase the number in the denominator of a fraction?  Decrease it?

12/4 = 3    or 12/3 = 4

n – 1 makes the denominator smaller and helps us estimate error more accurately (cautiously)




Feeling thermometer example


Work through this one one your own and interpret output.


You have the following 10 feeling thermometer scores for David Cameron:

50, 25, 25, 75, 65, 100, 45, 0, 35, 90

Find the standard deviation…

thinking graphically


dispersion: variance


Variance is very similar to standard deviation but, rather than being the average distance from the average, variance measure the overall distance of the observations from the mean

It is not interpreted in terms of each or any observations, but ALL observations.

In some ways, less informative than standard deviation

No absolute value to look for: larger numbers mean more overall variance in the data, smaller numbers mean less variance

formula for variance




  • How is this different than the formula for standard deviation?
  • The process for finding the variance is the same as the process for finding the standard deviation except for the last step
  • Sometimes the variance is called the “squared standard deviation” hence the notation

variance example


Feeling thermometer scores again:

50, 25, 25, 75, 65, 100, 0, 45, 35, 90


Find the variance and interpret the results

choosing the appropriate measure


Sometimes the mean can be misleading if there are some outliers
A good example is income
Income distribution: £70K, 40K, 30K, 30K, 25K, 25K, 25K
Mean = 35K
Median= 30K
Mode= 25K
Income distribution: £200K, 40K, 30K, 30K, 25K, 25K, 25K
Mean = 54K
Median= 30K
Mode= 25K

why stata?



Why Stata?

  • It is more powerful than other programs, such as SPSS

  • It is less programming oriented than other programs like R

  • Will take a few seminars, but is actually quite straightforward  
Always try to attend seminar

  • Make use of Acock! And the help menus!

Intro to Stats - Session 1

By Javier Sajuria

Intro to Stats - Session 1

Prepared for the Birkbeck Workshop in Social Research Methods

  • 2,128