Summarising Data

Recap: List of Topics

Descriptive Statistics

Probability Theory

Inferential Statistics

Different types of data

Different types of plots

Measures of centrality and spread

Sample spaces, events, axioms

Discrete and continuous RVs

Bernoulli, Uniform, Normal dist.

Sampling strategies

Interval Estimators

Hypothesis testing (z-test, t-test)

ANOVA, Chi-square test

Linear Regression

What is the effect of transformations on percentiles?

Learning Objectives

What are percentiles?

What are the different measures of spread?

How do you compute the percentile rank of a value in the data?

What are box plots and how to use them to visualise some measures of centrality and spread?

What is the effect of transformations on measures of spread?

What are some frequently used percentiles?

What are percentiles?

Intuition: Percentiles

But ... ...

Suppose you scored 45 out of 100 on a test, how would you rate your performance? Good or bad?

Example

Is it bad? (because you scored less than 50%)

What if the questions were really hard?

What if the time provided was insufficient?

Intuition: Percentiles

Suppose you scored 45 out of 100 on a test. Out of 100 students, only 2 scored greater than 45. How do you rate your performance?

Example

Does it look good now?

Yes, it does ... ...

You can proudly say that you lie in the top 98 percentile of your class (the score of 98% of students was less than or equal to your score)

Percentiles

44, 43, 37, 68, 55, 46, 19, 59, 34, 46, 51, 62, 47, 52, 44, 28, 36, 56, 65, 60, 55, 66, 54, 48, 62

A university conducts a written test for 25 students and decides to call those students for an interview whose score is above the 70th percentile

Example

Can you Identify which students will be called for the interview?

Percentiles

25 students (sorted scores)

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{70\% of the values in the data}}

70-th percentile

sorted data values

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

The p percentile of a sample is a value such that p perentage of the values in the data are less than or equal to this value

Percentiles

25 students (sorted scores)

L_p = \frac{p}{100} (n + 1) = \frac{70}{100} (25 + 1) = 18.2

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

Sort the data

Compute location of the p-th percentile

The 70th percentile lies at location 18.2 !

Percentiles

25 students (sorted scores)

56~~+

0.2 * (59 - 56)

= 56.6

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 17 elements}}

60, 62, 62, 65, 66, 68

\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 6 elements}}

\underbrace{}_{\textit{18}}

56

59

\underbrace{}_{\textit{19}}

Where is the position 18.2?

The 70th percentile should be between 56 and 59, greater than 56 but closer to 56

18.2 is between 18 and 19, closer to 18

Percentiles

What is the overall procedure?

L_p = \frac{p}{100} (n + 1) = \frac{70}{100} (25 + 1) = 18.2

Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

Sort the data

Compute location of the p-th percentile

integer part of

L_p = i_p

fractional part of

L_p = f_p

18

0.2 Compute p-th percentile as

Percentiles (some more intuition)

Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

Y_p = x_{i_p} + f_p * (x_{i_{p+1}}) - f_p * x_{i_p}

Y_p = (1 - f_p) * x_{i_p} + f_p * x_{i_{p+1}}

f_p

if is high then the weightage given to will be lower than that given to and vice versa

f_p

x_{i_p}

x_{i_p + 1}

Percentiles

Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

p = 70

L_p = \frac{p}{100} (n + 1) = \frac{70}{100} (25 + 1) = 18.2

i_p = 18, f_p = 0.2

Y_{70} = x_{18} + 0.2 * (x_{19} - x_{18}) = 56.6

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

Percentiles

Y_{70} = 56.6

A university conducts a written test for 25 students and decides to call those students for an interview whose score is above the 70th percentile

Example

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

The school will invite only those 7 students whose score was greater than 56.6

Percentiles

L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (25 + 1) = 20.8

i_p = 20, f_p = 0.8

Y_{80} = x_{20} +

0.8~*

(x_{21} - x_{20})

=61.6

Suppose the school changes its decision and now only wants to invite students who scored greater than 80 percentile

Example

(p = 80)

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

\underbrace{}_{\textit{20}}

60

62

\underbrace{}_{\textit{21}}

62, 65, 66, 68

\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}

Percentiles

Suppose the school changes its decision and now only wants to invite students who scored greater than 80 percentile

Example

(p = 80)

L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (25 + 1) = 20.8

i_p = 20, f_p = 0.8

\underbrace{}_{\textit{20}}

60

62 19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

62, 65, 66, 68

\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}

\underbrace{}_{\textit{21}}

Y_{80} = x_{20} +

0.8~*

(x_{21} - x_{20})

=61.6

Percentiles

L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (25 + 1) = 20.8

\underbrace{}_{\textit{20}}

60

62 19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

62, 65, 66, 68

\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}

\underbrace{}_{\textit{21}}

Why did we have to compute ? Wasn't knowing enough to identify the shortlisted students?

Y_p

L_p

Yes, it was, but the university may also be required to declare the cut-off score, hence we need to compute. also

L_p

Y_{80} = 60 + 0*(62-60)

Percentiles (special case: )

f_p = 0

L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (24 + 1) = 20

i_p = 20, f_p = 0

In such cases the percentile would actually correspond to a value in the data

Suppose there were only 24 students and p=80

Example

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

\underbrace{}_{\textit{20}}

60 62, 62, 65, 66

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}

What are some alternative methods found in textbooks?

Alternative 1 (we already saw this)

Sort the data

Compute location of the p-th percentile

L_p = \frac{p}{100} (n + 1)

integer part of

L_p = i_p

If~L_p~is~an~integer:

If~L_p~is~~not~an~integer:

Y_{p} = x_{i_{p}}

Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

Alternative 2

L_p = i_p

Sort the data

Compute location of the p-th percentile

L_p = \frac{p}{100} (n)

(note the use of n instead of n+1)

integer part of

If~L_p~is~an~integer:

Y_{p} = \frac{x_{L_p} + x_{L_{p+1}}}{2}

If~L_p~is~~not~an~integer:

Y_{p} = x_{i_{p+1}}

Alternative 2

L_{70} = \frac{p}{100} (n) = \frac{70}{100}*25=17.5

L_{80} = \frac{p}{100} (n) = \frac{80}{100}*25=20

Marks of 25 students and p=70 or p=80

Example

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

L_{80}~is~an~integer:

Y_{80} = \frac{x_{20} + x_{21}}{2}=\frac{60+62}{2}

L_{70}~is~~not~an~integer:

Y_{70} = x_{18} = 56

Alternative 2: intuition

the p-th percentile is that value in the data such that at least p percentage of the values are less than or equal to it and at least (100-p) percentage of the values are greater than or equal to it

L_{70} = \frac{p}{100} (n) = \frac{70}{100}*25=17.5

Location 18 (i.e., ) is the only location which satisfies both conditions

i_{p}+1

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

At least 17.5 values should be less than or equal to it (so the location should be 18 or higher)

At least 7.5 values should be greater than or equal to it (so the location should be 18 or lower)

Alternative 2: intuition

the p-th percentile is that value in the data such that at least p percentage of the values are less than or equal to it and at least (100-p) percentage of the values are greater than or equal to it

L_{80} = \frac{p}{100} (n) = \frac{80}{100}*25=20

Both locations 20 and 21 (i.e., ) satisfy the above conditions so just take an average of these two values

L_p~\&~L_{p}+1

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

Y_{80} = \frac{x_{20} + x_{21}}{2}=\frac{60+62}{2}=61

Alternative 3

Sort the data

Compute location of the p-th percentile

L_p = \frac{p}{100} (n + 1)

(same as alternative 1)

integer part of

L_p = i_p

If~L_p~is~an~integer:

If~L_p~is~~not~an~integer:

Y_{p} = x_{i_{p}}

(same as alternative 1)

Y_p = x_{i_p} + 0.5 * (x_{i_{p+1}} - x_{i_p})

(same as alternative 1 except that use 0.5 instead of )

f_p

Comparison

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

Alternative 1

Alternative 2

Alternative 3

Y_p = x_{L_p}

Y_p = \dfrac{x_{L_p} + x_{L_{p+1}}}{2}

Y_p = x_{L_p}

Y_p = x_{i_p} + f_p*(x_{i_{p+1}} - x_{i_p})

Y_p = x_{i_{p+1}}

Y_p = x_{i_p} + 0.5*(x_{i_{p+1}} - x_{i_p})

P = 70

P = 80

Case 2

is not integer

L_p

Case 1

is integer

L_p

L_p = \dfrac{p}{100}*(n+1)

L_p = \dfrac{p}{100}*n

L_p = \dfrac{p}{100}*(n+1)

L_{70} = 18.2

Y_{70} = 56.6

L_{80} = 20.8

Y_{80} = 61.6

L_{70} = 17.5

Y_{70} = 56

L_{80} = 20

Y_{80} = 61

L_{70} = 18.2

Y_{70} = 57.5

L_{80} = 20.8

Y_{80} = 61

What are some frequently used percentiles?

Quartiles

19, 28, 34, 36, 37

43, 44, 44, 46, 46

47, 48, 51, 52, 54

55, 55, 56, 59, 60

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{first 25\% data}}

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{third 25\% data}}

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{second 25\% data}}

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{last 25\% data}}

\underbrace{Q_1}_{25\text{-}th~percentile}

\underbrace{Q_2}_{50\text{-}th~percentile}

\underbrace{Q_3}_{75\text{-}th~percentile}

Quartiles divide the data into four equal parts

Quartiles: Example

Shikhar Dhawan T20I scores (50 sorted scores)

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

L_{25} = \frac{25}{100}*(50 + 1) = 12.75

Q_1 = Y_{25} = x_{12} + 0.75 * (x_{13} - x_{12})

L_{50} = \frac{50}{100}*(50 + 1) = 25.5

Q_2 = Y_{50} = x_{25} + 0.5 * (x_{26} - x_{25})

L_{75} = \frac{75}{100}*(50 + 1) = 38.25

Q_3 = Y_{75} = x_{38} + 0.25 * (x_{39} - x_{38})

= 5

= 15.5

= 42.25

Median is same as Q2

median

\underbrace{x_{\frac{n+1}{2}}}_{n~is~odd}~or~ \underbrace{\frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}}_{n~is~even}

L_{50} = \frac{50}{100}*(n + 1) = \frac{n+1}{2}

Q_2 = Y_{50} = x_{i_p} + 0.5 * (x_{i_p+1} - x_{i_p})

Q_2

Are they the same?

(of course, the are !)

But why do the formulae look different?

Median is same as Q2

L_{50} = \frac{50}{100}*(n + 1) = \frac{n+1}{2}

Case 1: n is odd

L_{50}~will~be~an~integer

(\because n+1~is~even)

i_p=L_p = \frac{n+1}{2}, f_p = 0

Q_2 = Y_{50} = x_{i_p} + 0 * (x_{i_p+1} - x_{i_p}) = x_{\frac{n+1}{2}}

Case 1: n is even

L_{50}= \frac{n}{2} + \frac{1}{2} = i_p + 0.5

i_p= \frac{n}{2}, f_p = 0.5

Q_2 = Y_{50} = x_{i_p} + 0.5 * (x_{i_p+1} - x_{i_p}) = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1} }{2}

Median is same as Q2

Q_2 = Y_{50} = x_{i_p} + 0.5 * (x_{i_p+1} - x_{i_p})

= x_{\frac{n}{2}} + 0.5*(x_{\frac{n}{2}+1} - x_{\frac{n}{2}})~~(\because i_p = \frac{n}{2}) \\~\\

= 0.5*x_{\frac{n}{2}} + 0.5*x_{\frac{n}{2}+1}

= x_{\frac{n}{2}} + 0.5*x_{\frac{n}{2}+1} - 0.5*x_{\frac{n}{2}}

= \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1} }{2}

Quintiles

19, 28, 34, 36

37, 43, 44, 44

46, 46, 47, 48

55, 56, 59, 60

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{first 20\% data}}

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{second 20\% data}}

51, 52, 54, 55

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{third 20\% data}}

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{fourth 20\% data}}

\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{fifth 20\% data}}

\underbrace{P_1}_{20\text{-}th~percentile}

\underbrace{P_2}_{40\text{-}th~percentile}

\underbrace{P_3}_{60\text{-}th~percentile}

\underbrace{P_4}_{80\text{-}th~percentile}

Quintiles divide the data into five equal parts

Quintiles: Example

Shikhar Dhawan T20I scores (50 sorted scores)

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

L_{60} = \frac{60}{100}*(50 + 1) = 30.6

P_3 = Y_{60} = x_{30} + 0.6 * (x_{31} - x_{30})

= 27.8