Recap: List of Topics

Descriptive Statistics

Probability Theory

Inferential Statistics

Different types of data

Different types of plots

Measures of centrality and spread

Sample spaces, events, axioms

Discrete and continuous RVs

Bernoulli, Uniform, Normal dist.

Sampling strategies

Interval Estimators

Hypothesis testing (z-test, t-test)

ANOVA, Chi-square test

Linear Regression

What is the effect of transformations on percentiles?

Learning Objectives

What are percentiles?

What are the different measures of spread?

How do you compute the percentile rank of a value in the data?

What are box plots and how to use them to visualise some measures of centrality and spread?

What is the effect of transformations on measures of spread?

What are some frequently used percentiles?

What are percentiles?

Intuition: Percentiles

Suppose you scored 45 out of 100 on a test, how would you rate your performance? Good or bad?

Example

 

Is it bad? (because you scored less than 50%)

But ... ...

What if the questions were really hard?

What if the time provided was insufficient?

Intuition: Percentiles

Suppose you scored 45 out of 100 on a test. Out of 100 students, only 2 scored greater than 45. How do you rate your performance?

Example

 

Does it look good now? 

Yes, it does ... ...

You can proudly say that you lie in the top 98 percentile of your class (the score of 98% of students was less than or equal to your score)

 

Percentiles

A university conducts a written test for 25 students and decides to call those students for an interview whose score is above the 70th percentile

Example

 

44, 43, 37, 68, 55, 46, 19, 59, 34, 46, 51, 62, 47, 52, 44, 28, 36, 56, 65, 60, 55, 66, 54, 48, 62

 

Can you Identify which students will be called for the interview?

Percentiles

25 students (sorted scores)

 

The p percentile of a sample is a value such that p perentage of the values in the data are less than or equal to this value

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{70\% of the values in the data}}

70-th percentile

 

sorted data values

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 

Percentiles

25 students (sorted scores)

 

Sort the data

 

Compute location of the p-th percentile

 
L_p = \frac{p}{100} (n + 1) = \frac{70}{100} (25 + 1) = 18.2

The 70th percentile lies at location 18.2 !

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 

Percentiles

25 students (sorted scores)

 

Where is the position 18.2?

 
\underbrace{}_{\textit{18}}

56

 

59

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 17 elements}}

60, 62, 62, 65, 66, 68

 
\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 6 elements}}

18.2 is between 18 and 19, closer to 18

 
\underbrace{}_{\textit{19}}
56~~+

The 70th percentile should be between 56 and 59, greater than 56 but closer to 56

 
0.2 * (59 - 56)
= 56.6

Percentiles

What is the overall procedure?

 

Sort the data

 

Compute location of the p-th percentile

 
L_p = \frac{p}{100} (n + 1) = \frac{70}{100} (25 + 1) = 18.2

integer part of

 
L_p = i_p

fractional part of

 
L_p = f_p

18

 

0.2

 

Compute p-th percentile as

 
Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

Percentiles (some more intuition)

Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})
Y_p = x_{i_p} + f_p * (x_{i_{p+1}}) - f_p * x_{i_p}
Y_p = (1 - f_p) * x_{i_p} + f_p * x_{i_{p+1}}
f_p

if      is high then the weightage given to  will be lower than that given to            and vice versa

 
f_p
x_{i_p}
x_{i_p + 1}

Percentiles

Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 
p = 70
L_p = \frac{p}{100} (n + 1) = \frac{70}{100} (25 + 1) = 18.2
i_p = 18, f_p = 0.2
Y_{70} = x_{18} + 0.2 * (x_{19} - x_{18}) = 56.6

Percentiles

A university conducts a written test for 25 students and decides to call those students for an interview whose score is above the 70th percentile

Example

 

The school will invite only those 7 students whose score was greater than 56.6

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 
Y_{70} = 56.6

Percentiles

Suppose the school changes its decision and now only wants to invite students who scored greater than 80 percentile

Example

 
(p = 80)
L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (25 + 1) = 20.8
i_p = 20, f_p = 0.8
\underbrace{}_{\textit{20}}

60

 

62

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

62, 65, 66, 68

 
\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}
\underbrace{}_{\textit{21}}
Y_{80} = x_{20} +
0.8~*
(x_{21} - x_{20})
=61.6

Percentiles

Suppose the school changes its decision and now only wants to invite students who scored greater than 80 percentile

Example

 
(p = 80)
L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (25 + 1) = 20.8
i_p = 20, f_p = 0.8
\underbrace{}_{\textit{20}}

60

 

62

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

62, 65, 66, 68

 
\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}
\underbrace{}_{\textit{21}}
Y_{80} = x_{20} +
0.8~*
(x_{21} - x_{20})
=61.6

Percentiles

L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (25 + 1) = 20.8
\underbrace{}_{\textit{20}}

60

 

62

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

62, 65, 66, 68

 
\underbrace{~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}
\underbrace{}_{\textit{21}}

Why did we have to compute       ? Wasn't knowing        enough to identify the shortlisted students?

 
Y_p
L_p

Yes, it was, but the university may also be required to declare the cut-off score, hence we need to compute.       also 

 
L_p

Suppose there were only 24 students and p=80

Example

 
Y_{80} = 60 + 0*(62-60)

Percentiles (special case:           )

f_p = 0
L_p = \frac{p}{100} (n + 1) = \frac{80}{100} (24 + 1) = 20
\underbrace{}_{\textit{20}}

60

 

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 19 elements}}

62, 62, 65, 66

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 4 elements}}
i_p = 20, f_p = 0

In such cases the percentile would actually correspond to a value in the data

 

What are some alternative methods found in textbooks?

Alternative 1

Sort the data

 

Compute location of the p-th percentile

 

integer part of

 
L_p = i_p
L_p = \frac{p}{100} (n + 1)
If~L_p~is~an~integer:
If~L_p~is~~not~an~integer:
Y_{p} = x_{i_{p}}
Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

Alternative 2

Marks of 24 students and interested in p=70

Example

 
L_p = \frac{p}{100} (n)

(note the use of n instead of n+1)

If~L_p~is~an~integer:
Y_{p} = \frac{x_{L_p} + x_{L_{p+1}}}{2}
If~L_p~is~an~integer:
If~L_p~is~~not~an~integer:
Y_{p} = x_{i_{p+1}}
(i_p~is~the~integer~part~of~L_p)

Alternative 2

Sort the data

 

Compute location of the p-th percentile

 

integer part of

 
L_p = i_p
L_p = \frac{p}{100} (n)

(note the use of n instead of n+1)

If~L_p~is~an~integer:
Y_{p} = \frac{x_{L_p} + x_{L_{p+1}}}{2}
If~L_p~is~~not~an~integer:
Y_{p} = x_{i_{p+1}}

Alternative 2

Marks of 25 students and  p=70 or p=80

Example

 
L_{70} = \frac{p}{100} (n) = \frac{70}{100}*25=17.5

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 
L_{80}~is~an~integer:
L_{70}~is~~not~an~integer:
Y_{70} = x_{18} = 56
L_{80} = \frac{p}{100} (n) = \frac{80}{100}*25=20
Y_{80} = \frac{x_{20} + x_{21}}{2}=\frac{60+62}{2}

Alternative 2: intuition

the p-th percentile is that value in the data such that at least p percentage of the values are less than or equal to it and at least (100-p) percentage of the values are greater than it

L_{70} = \frac{p}{100} (n) = \frac{70}{100}*25=17.5

At least 17.5 values should be less than or equal to it (so the location should be 18 or higher)

 

At least 7.5 values should be greater than or equal to it (so the location should be 18 or lower)

 

Location 18 (i.e.,              ) is the only location which satisfies both conditions

 
i_{p}+1

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 

Alternative 2: intuition

the p-th percentile is that value in the data such that at least p percentage of the values are less than or equal to it and at least (100-p) percentage of the values are greater than it

L_{80} = \frac{p}{100} (n) = \frac{80}{100}*25=20

Both locations 20 and 21 (i.e.,                           ) satisfy the above conditions so just take an average of these two values

 
L_p~\&~L_{p}+1

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 
Y_{80} = \frac{x_{20} + x_{21}}{2}=\frac{60+62}{2}=61

Alternative 3

Sort the data

 

Compute location of the p-th percentile

 

integer part of

 
L_p = i_p
L_p = \frac{p}{100} (n + 1)

(same as alternative 1)

If~L_p~is~an~integer:
If~L_p~is~~not~an~integer:
Y_{p} = x_{i_{p}}

(same as alternative 1)

Y_p = x_{i_p} + 0.5 * (x_{i_{p+1}} - x_{i_p})

(same as alternative 1 except that use 0.5 instead of      )

f_p

Comparison

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 

Alternative 1

Alternative 2

L_p = \dfrac{p}{100}*(n+1)
Y_p = x_{L_p}
L_p = \dfrac{p}{100}*n
Y_p = \dfrac{x_{L_p} + x_{L_{p+1}}}{2}
L_p = \dfrac{p}{100}*(n+1)
Y_p = x_{L_p}

Case 1 

L_p

is integer

Case 2 

L_p

is not integer

Y_p = x_{i_p} + f_p*(x_{i_{p+1}} - x_{i_p})
Y_p = x_{i_{p+1}}
Y_p = x_{i_p} + 0.5*(x_{i_{p+1}} - x_{i_p})

Alternative 3

P = 70

P = 80

L_{70} = 18.2
Y_{70} = 56.6
L_{70} = 17.5
Y_{70} = 56
L_{70} = 18.2
Y_{70} = 57.5
L_{80} = 20.8
Y_{80} = 61.6
L_{80} = 20
Y_{80} = 61
L_{80} = 20.8
Y_{80} = 61

What are some frequently used percentiles?

Quartiles

19, 28, 34, 36, 37

 

Quartiles divide the data into four equal parts

43, 44, 44, 46, 46

 

47, 48, 51, 52, 54

 

55, 55, 56, 59, 60

 
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{first 25\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{third 25\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{second 25\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{last 25\% data}}
\underbrace{Q_1}_{25\text{-}th~percentile}
\underbrace{Q_2}_{50\text{-}th~percentile}
\underbrace{Q_3}_{75\text{-}th~percentile}

Quartiles: Example

Shikhar Dhawan T20I scores (50 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 
L_{25} = \frac{25}{100}*(50 + 1) = 12.75
Q_1 = Y_{25} = x_{12} + 0.75 * (x_{13} - x_{12})
L_{50} = \frac{50}{100}*(50 + 1) = 25.5
Q_2 = Y_{50} = x_{25} + 0.5 * (x_{25} - x_{24})
L_{75} = \frac{75}{100}*(50 + 1) = 38.25
Q_3 = Y_{75} = x_{38} + 0.25 * (x_{39} - x_{38})
= 5
= 15.5
= 42.25

Median is same as Q2

L_{50} = \frac{50}{100}*(n + 1) = \frac{n+1}{2}
Q_2 = Y_{50} = x_{i_p} + 0.5 * (x_{i_p+1} - x_{i_p})
Q_2

median

 
\underbrace{x_{\frac{n+1}{2}}}_{n~is~odd}~or~ \underbrace{\frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}}_{n~is~even}

Are they the same?

 

(of course, the are !)

 

But why do the formulae look different?

 

Median is same as Q2

L_{50} = \frac{50}{100}*(n + 1) = \frac{n+1}{2}

Case 1: n is odd

 
L_{50}~will~be~an~integer
(\because n+1~is~even)
i_p=L_p = \frac{n+1}{2}, f_p = 0
Q_2 = Y_{50} = x_{i_p} + 0 * (x_{i_p+1} - x_{i_p}) = x_{\frac{n+1}{2}}

Case 1: n is even

 
L_{50}= \frac{n}{2} + \frac{1}{2} = i_p + 0.5
i_p=L_p = \frac{n}{2}, f_p = 0.5
Q_2 = Y_{50} = x_{i_p} + 0.5 * (x_{i_p+1} - x_{i_p}) = \frac{x_{\frac{n}{2}} + x_{\frac{n+1}{2}} }{2}

Median is same as Q2

Q_2 = Y_{50} = x_{i_p} + 0.5 * (x_{i_p+1} - x_{i_p})
= x_{\frac{n}{2}} + 0.5*(x_{\frac{n}{2}+1} - x_{\frac{n}{2}})~~(\because i_p = \frac{n}{2}) \\~\\
= 0.5*x_{\frac{n}{2}} + 0.5*x_{\frac{n}{2}+1}
= x_{\frac{n}{2}} + 0.5*x_{\frac{n}{2}+1} - 0.5*x_{\frac{n}{2}}
= \frac{x_{\frac{n}{2}} + x_{\frac{n+1}{2}} }{2}

Quintiles

19, 28, 34, 36

 

Quintiles divide the data into five equal parts

37, 43, 44, 44

 

46, 46, 47, 48

 

55, 56, 59, 60

 
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{first 20\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{second 20\% data}}
\underbrace{P_1}_{20\text{-}th~percentile}

51, 52, 54, 55

 
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{third 20\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{fourth 20\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{fifth 20\% data}}
\underbrace{P_4}_{80\text{-}th~percentile}
\underbrace{P_3}_{60\text{-}th~percentile}
\underbrace{P_2}_{40\text{-}th~percentile}

Quintiles: Example

Shikhar Dhawan T20I scores (50 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 
L_{60} = \frac{60}{100}*(50 + 1) = 30.6
Z_3 = Y_{60} = x_{30} + 0.3 * (x_{31} - x_{30})
= 26.9

Similarly you can compute the other 8 deciles!

 

Deciles

Deciles divide the data into five equal parts

\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{D_1}_{10\text{-}th~percentile}

19,28   34,36  37,43   44,44  46,46  47,48  51,52    54,55  55,56  59,60

 
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~}
\underbrace{D_7}_{70\text{-}th~percentile}
\underbrace{D_6}_{60\text{-}th~percentile}
\underbrace{D_5}_{50\text{-}th~percentile}
\underbrace{D_4}_{40\text{-}th~percentile}
\underbrace{D_3}_{30\text{-}th~percentile}
\underbrace{D_2}_{20\text{-}th~percentile}
\underbrace{D_8}_{80\text{-}th~percentile}
\underbrace{D_9}_{90\text{-}th~percentile}

Deciles: Example

Shikhar Dhawan T20I scores (50 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 
L_{30} = \frac{25}{100}*(50 + 1) = 15.3
D_3 = Y_{30} = x_{15} + 0.3 * (x_{16} - x_{15})
= 5.3

Similarly you can compute the other 8 deciles!

 

How to compute the percentile rank of a value in the data?

44, 43, 37, 68, 55, 46, 19, 59, 34, 46, 51, 62, 47, 52, 44, 28, 36, 56, 65, 60, 55, 66, 54, 48, 62

 

Percentile Rank

Compared to other students, how do you rate the performance of the student who scored 44?

 

What is the percentile rank of the student who scored 44

 

OR

 

The percentile rank of a value is the percentage of data values that are less than on equal to it

Percentile Rank: Example 1

c_s = \textit{number of values less than s}
PR_{s} = \frac{c_s+0.5*f_s}{n}*100
PR_s = \textit{percentile rank of the score s}
f_s = \textit{number of values equal to s}

19, 28, 34, 36, 37, 43, 44, 44, 46, 46, 47, 48, 51, 52, 54, 55, 55, 56, 59, 60, 62, 62, 65, 66, 68

 
PR_{44} = \frac{6+0.5*2}{25}*100 = 28

Percentile Rank: Example 2

PR_{s} = \frac{c_s+0.5*f_s}{n}*100
PR_{s} = \frac{37+0.5*2}{59}*100 = 64.40

Shikhar Dhawan T20I scores (59 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 

We typically round it upto the next whole number (65 in this case)

 

What is the effect of transformations on percentiles?

Transformations

Scaling and Shifting

 
x_{new} = a*x + c

(a = 5/9, c = -160/9)

 

Temperature in Fahrenheit:

 

[22.46, 23.54, 24.26, 27.86, 30.2, 30.74, 34.52, 35.96, 40.46, 44.06, 52.7, 54.68, 56.66, 57.56, 59.54, 61.52, 62.06, 65.66, 67.46, 70.88, 76.46, 82.4, 83.12, 84.38, 93.02, 94.28, 95.72, 96.44, 108.86, 109.58]

 

Temperature in Celsius:

 

[-5.3, -4.7, -4.3, -2.3, -1.0, -0.7, 1.4, 2.2, 4.7, 6.7, 11.5, 12.6, 13.7, 14.2, 15.3, 16.4, 16.7, 18.7, 19.7, 21.6, 24.7, 28.0, 28.4, 29.1, 33.9, 34.6, 35.4, 35.8, 42.7, 43.1]

 

bad aesthetics

 

Effect of transformations (on percentiles)

x_{new} = a*x + c
L_p^{new} = \frac{p}{100} (n + 1) = L_p
i_p^{new} = i_p,~~~~f_p^{new} = f_p
Y_p^{new} = x_{i_p}^{new} + f_p * (x_{i_{p+1}}^{new} - x_{i_p}^{new})
Y_p^{new} = (a*x_{i_p}^{new} + c) + f_p * (a*x_{i_{p+1}}^{new} + c - (a * x_{i_p}^{new} + c))
= a* (x_{i_p} + f_p*(x_{i_{p+1}} - x_{i_p})) + c
= a*x_{i_p} + c + f_p*a*(x_{i_{p+1}} - x_{i_p})
= a*Y_p + c

Effect of transformations (on percentiles)

Y_p^{new} = a*Y_p + c
L_p = \frac{40}{100}*(n+1) = 12.4
Y_p = x_{12} + 0.4 (x_{13} - x_{12}) = 55.47
Y_p^{new} = \frac{5}{9} * 55.47 + \frac{-160}{9} = 13.03

Temperature in Fahrenheit:

 

[22.46, 23.54, 24.26, 27.86, 30.2, 30.74, 34.52, 35.96, 40.46, 44.06, 52.7, 54.68, 56.66, 57.56, 59.54, 61.52, 62.06, 65.66, 67.46, 70.88, 76.46, 82.4, 83.12, 84.38, 93.02, 94.28, 95.72, 96.44, 108.86, 109.58]

 

Temperature in Celsius:

 

[-5.3, -4.7, -4.3, -2.3, -1.0, -0.7, 1.4, 2.2, 4.7, 6.7, 11.5, 12.6, 13.7, 14.2, 15.3, 16.4, 16.7, 18.7, 19.7, 21.6, 24.7, 28.0, 28.4, 29.1, 33.9, 34.6, 35.4, 35.8, 42.7, 43.1]

 

bad aesthetics

 

Summary

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{p\% of the values in the data}}

p-th percentile

 

sorted data values

 

Percentile (definition)

How to compute percentiles?

L_p = \frac{p}{100} (n + 1)
Y_p = x_{i_p} + f_p * (x_{i_{p+1}} - x_{i_p})

(alternative methods not recommended)

 

Summary

Frequently used percentiles

How to compute percentile rank of a value?

Quartiles

Quintiles

Deciles

PR_{s} = \frac{c_s+0.5*f_s}{n}*100

What is the effect of transformations on percentiles?

x_{new} = a*x + c
Y_p^{new} = a*Y_p + c

What are the measures of spread?

Motivation: Measures of spread

Sample 1: 81, 81, 82, 82, 83, 83, 84, 84, 85, 85  Mean: 83

Median: 83

 

Sample 1: 31, 41, 61, 72, 83, 83, 94, 105, 125, 135  Mean: 83

Median: 83

 

All values are very close to the mean & median (low variability in data)

 

Some values are far from the mean & median (high variability in data)

 

Measures of centrality don't tell us anything about the spread and variability in the data

 

Measures of spread (range)

Sample 1: 81, 81, 82, 82, 83, 83, 84, 84, 85, 85  Mean: 83

Median: 83

 

Sample 1: 31, 41, 61, 72, 83, 83, 94, 105, 125, 135  Mean: 83

Median: 83

 

Range clearly tells us that the second sample has more variability/spread than the first

 

Range: (max value - min value)  = 85 - 81 = 4

 

Range:  (max value - min value)  = 135 - 31 = 104

 

Measures of spread (range)

Farm yields of wheat (in bushels): 40.1, 40.9, 41.8, 44, 46.8, 47.2, 48.6, 49.3, 49.4, 51.9, 53.8, 55.9, 57.3, 58.1, 60.2, 60.7, 61.1, 61.4, 62.8, 633

 

Just like the mean, the range is very sensitive to outliers!

 
 

data of some 20 farms in one state or district such that one farm has a very high production and hence is an outlier (at least 10 times the production of the second maximum in the data)

 

 

Range: (max value - min value)  = 633 - 40.1 = 592.9

 

Note that most values were close to 40.1 : the range however is blown up due to one outlier (633)

40.1, 40.9, 41.8, 44, 46.8

 

47.2, 48.6, 49.3, 49.4, 51.9

 

53.8, 55.9, 57.3, 58.1, 60.2

 

60.7, 61.1, 61.4, 62.8, 633

 
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{first 25\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{third 25\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{second 25\% data}}
\underbrace{~~~~~~~~~~~~~~~~~~~}_{\textit{last 25\% data}}
\underbrace{Q_1}_{25\text{-}th~percentile}
\underbrace{Q_3}_{75\text{-}th~percentile}
 

replace the data in the table with the same data as on the previous slide

 
Inter~Quartile~Range
Inter~Quartile~Range (IQR) = Q_3 - Q_1

Measures of spread (IQR)

 

replace the data in the table with the same data as on the previous slide

fill in the values of xx, yy and zz

 
Inter~Quartile~Range
IQR = Q_3 - Q_1
L_{75} = \frac{75}{100}*(20 + 1) = 15.75
Q_{3} = Y_{75} = x_{15} + 0.75(x_{16} - x_{15}) = 60.575
L_{25} = \frac{25}{100}*(20 + 1) = 5.25
Q_{1} = Y_{25} = x_{5} + 0.25(x_{6} - x_{5}) = 46.9
= 60.575 - 46.9 = 13.675

Measures of spread (IQR)

40.1, 40.9, 41.8, 44, 46.8

 

47.2, 48.6, 49.3, 49.4, 51.9

 

53.8, 55.9, 57.3, 58.1, 60.2

 

60.7, 61.1, 61.4, 62.8, 633

 

40.1, 40.9, 41.8, 44, 46.8

 

47.2, 48.6, 49.3, 49.4, 51.9

 

53.8, 55.9, 57.3, 58.1, 60.2

 

60.7, 61.1, 61.4, 62.8, 633

 
 

replace the data in the table with the same data as on the previous slide

fill in the values of xx, yy and zz

 

put a cross on the outlier

 

fragments: table, IQR, cross, L_75, highlight 15th element in the data, L_25, highlight 5th element in the data, IQr^new

 
Inter~Quartile~Range

Measures of spread (IQR)

Clearly not sensitive to outliers (i.e,, will not change if we drop the outlier)

 
IQR = 13.4
L_{75} = \frac{75}{100}*(19 + 1) = 15
L_{25} = \frac{25}{100}*(19 + 1) = 5
IQR^{new} = x_{15} - x_5 = 13.4

40.1, 40.9, 41.8, 44, 46.8

 

47.2, 48.6, 49.3, 49.4, 51.9

 

53.8, 55.9, 57.3, 58.1, 60.2

 

60.7, 61.1, 61.4, 62.8, 633

 

Measures of spread (variance)

Question: How different are the values in the data from the typical value (mean) in the data?

 

Possible Solution: Compute the sum or average deviation of all points from the mean

 
\sum_{i=1}^{n}(x_{i} - \bar{x})

Issue: We already know that the sum of deviations from the mean is 0

 
\sum_{i=1}^{n}(x_{i} - \bar{x}) = 0

Motivation: Measures of spread

Sample 1: 81, 81, 82, 82, 83, 83, 84, 84, 85, 85  Mean: 83

Median: 83

 

Sample 1: 31, 41, 61, 72, 83, 83, 94, 105, 125, 135  Mean: 83

Median: 83

 

Sum of deviations = 0

 

The sum of deviations does not tell us anything about the spread of the data

 

Sum of deviations = 0

 

Measures of spread (variance)

Observation: We do not care about the sign of the deviation (Both positive and negative deviations contribute to the spread in the data and hence we do want them to cancel each other)

 

Deviations on left side

Deviations on right side

Mean: 6.0

   1        2         3        4        5      6        7       8        9      10

=

Measures of spread (variance)

Issue: The sum of deviations from the mean is 0

 
\frac{1}{n}\sum_{i=1}^{n}(x_{i} - \bar{x}) = 0

Reason: The positive deviations cancel the negative deviations

 

Solution1: Use absolute values

 

Solution1: Use square values

 
\frac{1}{n}\sum_{i=1}^{n}|x_{i} - \bar{x}|
\frac{1}{n}\sum_{i=1}^{n}(x_{i} - \bar{x})^2

preferred solution

 

Measures of spread (variance)

Variance:

 

(if computed from a sample)

 

(if computed from the entire population)

 
s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^2
\sigma^2 = \frac{1}{N}\sum_{i=1}^{n}(x_{i} - \bar{x})^2

Why is there a difference in the formula?

 

We will clarify this later once we introduce probability theory

 

Measures of spread (variance)

show two tables: one corresponding to each of the samples on slide 55

 

the tables will have the following columns: x, x-x_bar, (x-x_bar)^2

the columns will appear one by one

 
x
|x-\bar{x}|
|x-\bar{x}|^2

81

81

82

82

83

83

84

84

85

85

 

2

2

1

1

0

0

1

1

2

2

 

4

4

1

1

0

0

1

1

4

4

 
x
|x-\bar{x}|
|x-\bar{x}|^2

31

41

61

72

83

83

94

105

125

135

 

2

2

1

1

0

0

1

1

2

2

 

2704

1764

484

121

0

0

121

484

1764

2704

 

Measures of spread (variance)

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^2

Observation: Variance is not measured in the same units as the data

 
if~the~unit~of~data~is~km
then~the~unit~of~variance~is~km^2

Measures of spread (standard deviation)

s=\sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^2}

Observation: Variance is measured in the same units as the data

 

(if computed from a sample)

 

(if computed from the entire population)

 
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N}\sum_{i=1}^{n}(x_{i} - \bar{x})^2}

Recap of notations

Make this table again with yellow text instead of using the image

 

A slight detour ... ...

Why do we square the deviations?

Make a single plot showing both  |x-x_bar| and  (x-x_bar)^2. the x-axis will be labeled as (x-x_bar) The data on x-axis will vary from -4 to 4

 

Reason1: The square function has better properties than the absolute function

 

Observation1: The square function is a smooth function and hence differentiable everywhere

 

Observation1: The absolute function is not differentiable at

 
x_i - \bar{x} = 0

Why do we square the deviations?

Reason1: The square function has better properties than the absolute function

 

Why do we care about differentiability?

 

In many applications (especially in ML) we need functions which are differentiable

 

Make a single plot showing both  |x-x_bar| and  (x-x_bar)^2. the x-axis will be labeled as (x-x_bar) The data on x-axis will vary from -4 to 4 (same a prev slide)

 

Why do we square the deviations?

Reason2: The square function magnify the contribution of outliers

 

Why do we want to magnify the contribution of outliers?

 

Make a single plot showing both  |x-x_bar| and  (x-x_bar)^2. the x-axis will be labeled as (x-x_bar) The data on x-axis will vary from -4 to 4 (same a prev slide)

 

Why do we square the deviations?

Reason2: The square function magnify the contribution of outliers

 

Why do we want to magnify the contribution of outliers?

 

 0.1, 0.2, 0.3, 0.3, 0.5, 0.1, 0.4, 0.2, 0.6, 10.2 

Example: Toxic content in a fertiliser

 
\bar{x} = 1.29
s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^2 = xx
avg.~abs.~dev. = \frac{1}{n}\sum_{i=1}^{n}|x_{i} - \bar{x}| = zz

What does the variance tell us about the data?

Variance: a measure of consistency

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

 

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

 

Ashes 2017-18 series (runs scored)

 

Mean = 39.88

 

Mean = 41.78

 

Does the variance capture this?

 

Observation: Root was more consistent than Cook

 

Variance: a measure of consistency

show two tables: showing the computation of variance and standard deviation for Root and Cook

 

framgments: one column at a time, quote

 

Indeed the lower variance shows that Root was more consistent than Cook

 

Variance: a measure of consistency

The primary objective of manufacturing industries is to ensure that there is little variance in their products

 

Desirable to have almost 0 variance in

 

length of sleeves

radius of tyres

weight of dumbbells

... ... ...

What is the effect of transformations on measures of spread

Transformations

Scaling and Shifting

 
x_{new} = a*x + c

(a = 5/9, c = -160/9)

 

            [22.46, 23.54, 24.26, 27.86, 30.2, 30.74, 34.52, 35.96, 40.46, 44.06, 52.7, 54.68, 56.66, 57.56, 59.54, 61.52, 62.06, 65.66, 67.46, 70.88, 76.46, 82.4, 83.12, 84.38, 93.02, 94.28, 95.72, 96.44, 108.86, 109.58]

 

             [-5.3, -4.7, -4.3, -2.3, -1.0, -0.7, 1.4, 2.2, 4.7, 6.7, 11.5, 12.6, 13.7, 14.2, 15.3, 16.4, 16.7, 18.7, 19.7, 21.6, 24.7, 28.0, 28.4, 29.1, 33.9, 34.6, 35.4, 35.8, 42.7, 43.1]

 
\degree F:
\degree C:

bad aesthetics

 
range_{new} = max_{new} - min_{new}
= \overbrace{(a*max + c)} - \overbrace{(a*min + c)}
= a*(max - min)
= a*range
range = max - min

Effect of transformations (on range)

            [22.46, 23.54, 24.26, 27.86, 30.2, 30.74, 34.52, 35.96, 40.46, 44.06, 52.7, 54.68, 56.66, 57.56, 59.54, 61.52, 62.06, 65.66, 67.46, 70.88, 76.46, 82.4, 83.12, 84.38, 93.02, 94.28, 95.72, 96.44, 108.86, 109.58]

 

             [-5.3, -4.7, -4.3, -2.3, -1.0, -0.7, 1.4, 2.2, 4.7, 6.7, 11.5, 12.6, 13.7, 14.2, 15.3, 16.4, 16.7, 18.7, 19.7, 21.6, 24.7, 28.0, 28.4, 29.1, 33.9, 34.6, 35.4, 35.8, 42.7, 43.1]

 
\degree F:
\degree C:
range = 109.58 - 22.46 = 87.12
range_{new} = \frac{5}{9} * 87.12 = 48.4

Effect of transformations (on range)

bad aesthetics

 

Recap: 

 
x_{new} = a*x + c
Y_p^{new} = a*Y_p + c
Q_1^{new}=Y_{25}^{new} = a*Y_{25} + c = a*Q_1 + c
Q_3^{new}=Y_{75}^{new} = a*Y_{75} + c = a*Q_3 + c
IQR^{new}=Q_3^{new} - Q_1^{new}
IQR^{new}=(a*Q_3 + c) - (a*Q_1 + c)
IQR^{new}=a*(Q_3 - Q_1) = a*IQR

Effect of transformations (on IQR)

Effect of transformations (on IQR)

            [22.46, 23.54, 24.26, 27.86, 30.2, 30.74, 34.52, 35.96, 40.46, 44.06, 52.7, 54.68, 56.66, 57.56, 59.54, 61.52, 62.06, 65.66, 67.46, 70.88, 76.46, 82.4, 83.12, 84.38, 93.02, 94.28, 95.72, 96.44, 108.86, 109.58]

 

             [-5.3, -4.7, -4.3, -2.3, -1.0, -0.7, 1.4, 2.2, 4.7, 6.7, 11.5, 12.6, 13.7, 14.2, 15.3, 16.4, 16.7, 18.7, 19.7, 21.6, 24.7, 28.0, 28.4, 29.1, 33.9, 34.6, 35.4, 35.8, 42.7, 43.1]

 
\degree F:
\degree C:
IQR = Q_3 - Q_1 = xx
IQR_{new} = \frac{5}{9} * xx = yy

compute the values of xx and yy

 

Exercise: Compute the first and third quartile for the transformed data and verify that the new IQR is indeed yy

 

bad aesthetics

 

Effect of transformations (on variance)

s^2_{new} = \frac{1}{n-1}\sum_{i=1}^n (x_i^{new} - \bar{x}^{new})^2
=a^2s^2
= \frac{1}{n-1}\sum_{i=1}^n (\overbrace{a*x_i + c} - \overbrace{a*\bar{x} + c})^2
=\frac{1}{n-1}\sum_{i=1}^n [a(x_i - \bar{x})]^2
= a^2*\underbrace{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}

Effect of transformations (on std. dev.)

s_{new}=\sqrt{s^2_{new}} = \sqrt{a^2*s^2} = a*s
s^2_{new} = a^2*s^2

Effect of transformations (on variance)

            [22.46, 23.54, 24.26, 27.86, 30.2, 30.74, 34.52, 35.96, 40.46, 44.06, 52.7, 54.68, 56.66, 57.56, 59.54, 61.52, 62.06, 65.66, 67.46, 70.88, 76.46, 82.4, 83.12, 84.38, 93.02, 94.28, 95.72, 96.44, 108.86, 109.58]

 

             [-5.3, -4.7, -4.3, -2.3, -1.0, -0.7, 1.4, 2.2, 4.7, 6.7, 11.5, 12.6, 13.7, 14.2, 15.3, 16.4, 16.7, 18.7, 19.7, 21.6, 24.7, 28.0, 28.4, 29.1, 33.9, 34.6, 35.4, 35.8, 42.7, 43.1]

 
\degree F:
\degree C:

Exercise: Compute the variance and standard deviation for the original and transformed data and verify that:

 
s^2_{new} = a^2*s^2~~~and~~~s_{new} = a*s
s^2_{new} = a^2*s^2~~~and~~~s_{new} = a*s

bad aesthetics

 

Summary

 

Measures of centrality

 
x_{new} = a*x + c

Measures of spread

 

Effect of transformations

s_{new} = a*s
s^2_{new} = a^2*s^2
IQR_{new} = a*IQR
range_{new}= a*range
\bar{x}_{new} = a*\bar{x} + c
median_{new} = a*median + c
mode_{new} = a*mode + c

How do you use mean and variance to standardise data?

Standardising data

Question: How many standard deviations away from the mean is a given value x?

Intuition: Instead of expressing distances in absolute values, express them in units of standard deviations

Mean: 6.0 Std. dev: 2.97

   1        2         3        4        5      6        7       8        9      10

-1 SD

-2 SD

+1 SD

+2 SD

We will have to change this example

 

In this example there is no data which is at +2,+3 or -2,-3 SD. Create a new example where there is some data at 2,3SD and -2,-3 SD and change the plot accordingly

 

Remember to divide by n-1 while computing the variance

 

Standardising data

If      is the mean and s is the std. dev. then find the point which is one std. dev. away from the mean

\bar{x}
x = \bar{x} + s
x = \bar{x} + 2*s
x = \bar{x} + z*s

what about 2 standard deviations?

what about z standard deviations?

Standardising data

We can express any point in the data as

x_i = \bar{x} + z_i*s
z_i = \frac{x_i - \bar{x}}{s}

where

      is called the z-score and tells us the number of standard deviations that the point is away from the mean

z_i

We can express any point in the data as

Standardising data

for the data shown on slide 83 show a table showing the computation of z scores

 

The table will have two columns: x_i and z_i (give formula of z_i)

 

the mean and std. dev of the data at the bottom of the table

 

Standardising data (Usage in ML)

show a table with patient data containing 3 attributes: annual income (INR), weight (kg) and height (feet)

(a sample of 10 points)

 

write the range of each column below the column

 

Show an ML system which takes these 3 as inputs and predicts Health Risk/ No Risk (similar to a diagram that we used in one of the earlier lectures

 

 

Range:

2L 65 5.8
5L 60 5.5
10L 75 6
8L 70 5.3
4L 54 5.2
7L 60 5.3
1L 50 5.3
20L 72 6.2
7L 82 6.1
3L 67 5.9

20-1 = 19L

82-50 = 32Kgs

6.2-5.2 = 1.0ft

ML System

Income

Height

Weight

Health Risk

 

No Health Risk

 

Income (INR)

Height (Feet)

Weight (Kgs)

Standardising data (Usage in ML)

In addition to the table on the previous slide show a table where all the 3 columns are now in their standardised form

 

What is the mean and standard deviation of the standardised data?

 
2L 65 5.8
5L 60 5.5
10L 75 6
8L 70 5.3
4L 54 5.2
7L 60 5.3
1L 50 5.3
20L 72 6.2
7L 82 6.1
3L 67 5.9
-0.86 -0.05 0.37
-0.31 -0.56 -0.42
0.6 0.97 0.89
0.24 0.46 -0.95
-0.49 -1.18 -1.21
0.05 -0.56 -0.95
-1.04 -1.59 -0.95
2.44 0.66 1.42
0.05 1.69 1.16
-0.68 0.15 0.63

Income (INR)

Height (Feet)

Weight (Kgs)

Income (INR)

Height (Feet)

Weight (Kgs)

Standardising data (Usage in ML)

What is the mean and standard deviation of the standardised data?

 

Is this always the case?

 
-0.86 -0.05 0.37
-0.31 -0.56 -0.42
0.6 0.97 0.89
0.24 0.46 -0.95
-0.49 -1.18 -1.21
0.05 -0.56 -0.95
-1.04 -1.59 -0.95
2.44 0.66 1.42
0.05 1.69 1.16
-0.68 0.15 0.63

Mean: 0

Mean: 0

Mean: 0

Std: 0.9998

Std: 1

Std: 1

Income (INR)

Height (Feet)

Weight (Kgs)

Standardising data (Usage in ML)

Prove that the mean on the standardised data is 0

 

Proof:

 
z_i = \frac{x_i - \bar{x}}{s}
\bar{z} = \frac{1}{n}\sum_{i=1}^{n} z_i
\bar{z} = \frac{1}{n}\sum_{i=1}^{n} \frac{x_i - \bar{x}}{s}
\bar{z} = \frac{1}{s}(\frac{1}{n}\sum_{i=1}^{n} x_i - \frac{1}{n}\sum_{i=1}^{n}\bar{x})
\bar{z} = \frac{1}{s}(\bar{x} - \frac{1}{n}*n*\bar{x}) = 0

Standardising data (Usage in ML)

Prove that the standard deviation of the standardised data is 1

 

Proof:

 
z_i = \frac{x_i - \bar{x}}{s}
s^2_z = \frac{1}{n-1}\sum_{i=1}^{n} (z_i - \bar{z})^2 = \frac{1}{n-1}\sum_{i=1}^{n} z_i^2
s^2_z = \frac{1}{n-1}\sum_{i=1}^{n} (\frac{x_i - \bar{x}}{s})^2
s^2_z = \frac{1}{s^2}*\frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2
s^2_z = \frac{1}{s^2}*s^2 = 1

Summary (measures of spread)

Measures of spread

 
range = max - min
IQR = max - min
s^2 = \frac{1}{n-1}(x_i - \bar{x})^2~~,~~\sigma^2=\frac{1}{N}(x_i - \bar{x})^2
s = \sqrt{\frac{1}{n-1}(x_i - \bar{x})^2}~~,~~\sigma^2=\sqrt{\frac{1}{N}(x_i - \bar{x})^2}

Except IQR all measures are sensitive to outliers

 

Measure of consistency in the data

 

Except variance, all measures have the same unit as the original data

 

Summary (measures of spread)

Effect of transformations

 
x_{new} = a*x + c
s_{new} = a*s
s^2_{new} = a^2*s^2
IQR_{new} = a*IQR
range_{new}= a*range

Standardising data

 

After standardising the data has zero mean and unit variance

 
z_i = \frac{x_i - \bar{x}}{s}

What are box plots?

Box plots are used for visualising spread, median and outliers in the data

 

Box Plots

We say that x is an outlier if

 
x < Q_1 - 1.5 * IQR
x > Q_3 + 1.5* IQR

or if

 

(a formal definition of outliers)

 

Box Plots

IQR

Median

Outliers

1.5 * IQR

1.5 * IQR

Box Plots

Shikhar Dhawan T20I scores (59 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 

Median = 23

 

Q1 = 5, Q3 = 41

 

IQR = 36

 

Q1 - 1.5 IQR = -49

 

Q3 + 1.5 IQR = 95

 

(no outliers on either side)

 

Box Plots

Median = 23

 

Q1 = 5, Q3 = 41

 

IQR = 36

 

Q1 - 1.5 IQR = -49

 

Q3 + 1.5 IQR = 95

 

Box Plots (Variant 1)

Median = 32

 

Q1 = 27.75, Q3 = 38.25

 

IQR = 10.5

 

Q1 - 1.5 IQR = 12

 

Q3 + 1.5 IQR = 54

 

max value = 45 

 

choose some data which has at least few outliers only on one side

 

min value =

 

Box Plots (Variant 1)

Median = xx

 

Q1 = xx, Q3 = yy

 

IQR = zz

 

Q1 - 1.5 IQR = pp

 

Q3 + 1.5 IQR = rr

 

max value = 45 

 

min value = 16 

 

If there are no outliers move the whisker to the max/min value

 

plot corresponding to the data on the previous slide but matching the description of Variant1 in the gitbook

 

Box Plots (Variant 2)

choose same data as previous slide

 

Median = xx

 

Q1 = xx, Q3 = yy

 

IQR = zz

 

Q1 - 1.5 IQR = pp

 

Q3 + 1.5 IQR = rr

 

max value = zzzz 

 

min value = zzzz 

 

Box Plots (Variant 2)

five number summary: min, max, median, Q1, Q2

 

plot corresponding to the data on the previous slide but matching the description of Variant1 in the gitbook

 

also mark the 5 numbers on the plot

 

Median = xx

 

Q1 = xx, Q3 = yy

 

IQR = zz

 

Q1 - 1.5 IQR = pp

 

Q3 + 1.5 IQR = rr

 

max value = zzzz 

 

min value = zzzz 

 

Box Plots (Variant 3)

Median = xx

 

Q1 = xx, Q3 = yy

 

Q1 - 1.5 IQR = pp

 

Q3 + 1.5 IQR = rr

 

max value = zzzz (excluding outliers)

 

min value = zzzz (excluding outliers)

 

choose same data as previous slide

 

plot corresponding to the data on the previous slide but matching the description of Variant3 in the gitbook

 

Box Plots (Variant 3)

move the whisker to the max/min value obtained after excluding all outliers

 

Median = xx,

 

Q1 = xx, Q3 = yy

 

Q1 - 1.5 IQR = pp

 

Q3 + 1.5 IQR = rr

 

max value = zzzz (excluding outliers)

 

min value = zzzz (excluding outliers)

 

show box plot for left-skewed, right-skewed and symmetric data

 

the position of the median should be clearly highlighted in the plot

 

Box Plots: skewness in the data

show ML system for distinguishing between positive and negative reviews (we had such a diagram in one of the earlier lectures)

 

Box Plots (Usage in ML)

Evaluation:

 
p_1, p_2, p_3, ...., p_{200}
n_1, n_2, n_3, ...., n_{200}

(scores assigned by the system to + reviews)

 

(scores assigned by the system to - reviews)

 

ML System

 

 

or

 

 

Sample Review

Box Plots (Usage in ML)

Ideal box plot

 

(No overlap in the IQR of the scores assigned to positive and negative reviews)

 

compare plots of five systems as given by Ananya

 

Box Plots (Usage in ML)

Box plots can be used to visually compare the performance of ML systems

 

 - M1 does poorly, high overlap in IQRs of +ve and -ve

 

Ananya's notes: For comparing : (Please note that I have plotted for comparing 3 systems instead of 5 since I didn't have a description of M3 and M4 and felt adding those made the plot look more cluttered without much new information.)

 

 - M2 does poorly, many outliers coinciding with opposite      class IQR

 

 - M3 does well in separating the +ve and -ve classes

 

Measures of Centrality

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Which is the 7th most grown crop in the country?

 

Hard to answer

 

Nominal attributes

Recall

 

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

 

How many balls does Sachin Tendulkar typically face in a match?

 

Measures of Centrality

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Which is the 7th most grown crop in the country?

 

Hard to answer

 

Nominal attributes

Recall

 

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

 

How many balls does Sachin Tendulkar typically face in a match?

 

Motivation: Summarise Big Data

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Which is the 7th most grown crop in the country?

 

Hard to answer

 

Nominal attributes

ML System

 

 

or

 

 

Sample Review

Typically we feed words from the review to the ML system

 
 

Which words to give as input?

 

Frequency Plots in ML

2. Designing features for a ML system

 

1. Identifying discriminatory features

Use of histograms in ML

ML System

Income

Height

Weight

Health Risk

 

No Health Risk

 
Made with Slides.com