Recap: List of Topics

Descriptive Statistics

Probability Theory

Inferential Statistics

Different types of data

Different types of plots

Measures of centrality and spread

Sample spaces, events, axioms

Discrete and continuous RVs

Bernoulli, Uniform, Normal dist.

Sampling strategies

Interval Estimators

Hypothesis testing (z-test, t-test)

ANOVA, Chi-square test

Linear Regression

What are the different measures of centrality?

What are some characteristics of these measures?

Learning Objectives

What do the measures of centrality look like for different types of distributions?

How do you compute these measures from histograms?

Why do we need measures of spread and centrality?

What is the effect of certain transformations on these measures?

What are the different measures of centrality?

Why do we need measures of centrality and spread?

Motivation: Summarise Big Data

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Imagine a million such rows!

Drawing plots can give a good visual summary

 

In some situations, we want an even more succinct summary (say, a single/few numbers)

 

Motivation: Summarise Big Data

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

A parameter is any numeric property of the entire population under study

A statistic is any numerical property of the sample of a population (used as an estimate for the corresponding parameter of the population)

Recall

 

Motivation: Summarise Big Data

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Use summary statistics for quantitative data

Measures of centrality(mean, mode, median)

 

Percentiles (quartiles, quintiles, deciles)

 

Measures of spread (range, IQR, variance, standard deviation)

 

What are the different measures of centrality?

Measures of Centrality

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

 

How many balls does Sachin Tendulkar typically face in a match?

 

Measures of Centrality: mean

Notation:

n

data points

 
x_1, x_2, x_3, ...., x_n

Mean of sample:

 
\bar{x}

Mean of population:

 
\mu
x_1 = 0, x_2 = 0, x_3 = 36, x_4 =10, x_5 =20, x_6 = 19
x_{12} = 4, x_{13} = 53, x_{14} = 52, x_{15} = 22 .....
,x_7 = 31, x_8 = 36, x_9 = 53, x_{10} = 30, x_{11} = 0,
...... x_{452} = 52

Measures of Centrality: mean

\bar{x} = \frac{x_1 + x_2 + x_3 + \dots+x_n}{n}
= \frac{1}{n}\sum_{i=1}^{n}x_i
x_1 = 0, x_2 = 0, x_3 = 36, x_4 =10, x_5 =20, x_6 = 19
x_{12} = 4, x_{13} = 53, x_{14} = 52, x_{15} = 22 .....
,x_7 = 31, x_8 = 36, x_9 = 53, x_{10} = 30, x_{11} = 0,
...... x_{452} = 52

Measures of Centrality: mean

\bar{x} = \frac{x_1 + x_2 + x_3 + x_4 + \dots+x_{452}}{452}
= 40.76
x_1 = 0, x_2 = 0, x_3 = 36, x_4 =10, x_5 =20, x_6 = 19
x_{12} = 4, x_{13} = 53, x_{14} = 52, x_{15} = 22 .....
,x_7 = 31, x_8 = 36, x_9 = 53, x_{10} = 30, x_{11} = 0,
...... x_{452} = 52
\bar{x} = \frac{0 + 0 + 36 + 10 + \dots+52}{452}

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 matches)

 

5, 32, 30, 0, 1, 33, 3, 11, 5, 42, 26, 9, 51, 46, 2, 1, 16, 60, 1, 6, 23, 13, 23, 15, 2, 80, 1, 6, 72, 24, 47, 90, 55, 8, 35, 10, 74, 4, 10, 5, 3, 43, 92, 76, 41, 29, 30, 5, 14, 1, 23, 3, 40, 36, 41, 31, 19, 32, 52

 

Median is the value which appears at the centre of the data when the data is sorted.

 

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 sorted scores)

 
center~location = \frac{n+1}{2}
center~location = \frac{59+1}{2} = 30
\because n = 59~is~odd

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 29 elements}}

23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 29 elements}}

There are equal number of elements on either side of the central location

 

23

 
\underbrace{~~~~~~~~~~}_{\textit{mid-point}}
30th~element

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 sorted scores)

 

When n is odd, the median is the value at the central location (or mid-point) which is 23 in this case

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 29 elements}}

23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 29 elements}}

23

 
\underbrace{~~~~~~~~~~}_{\textit{mid-point}}

Measures of Centrality: median

Shikhar Dhawan T20I scores (50 sorted scores)

 

What happens when n is even? (say is we had data for  50 T20Is only)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 

Measures of Centrality: median

Shikhar Dhawan T20I scores (50 sorted scores)

 
\underbrace{~~~~~~~~~~}_{\textit{2 mid-points}}

15

 

16

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 24 elements}}

23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 24 elements}}

There are two mid-points now such that the number of elements of either side is the same

 

Measures of Centrality: median

Shikhar Dhawan T20I scores (50 sorted scores)

 
center~locations = \frac{n}{2}~and~\frac{n}{2}+1
\because n = 50~is~even
center~locations = 25~and~26
\underbrace{~~~~~~~~~~}_{\textit{2 mid-points}}

15

 

16

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 24 elements}}

23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 24 elements}}

Measures of Centrality: median

When n is even, the median is the average of the values at the two central locations (or mid-points) which is (15 + 16)/2 = 15.5 in this case

 

Shikhar Dhawan T20I scores (50 sorted scores)

 
\underbrace{~~~~~~~~~~}_{\textit{2 mid-points}}

15

 

16

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 24 elements}}

23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 24 elements}}

Measures of Centrality: median

Summary

 
Data:~x_1, x_2, x_3, \dots, x_n
if~n~is~odd:
median = x_{\frac{n+1}{2}}
if~n~is~even:
median = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}

the element at position

 
\frac{n+1}{2}
\large(
\large)

the mean of elements at positions

 
\frac{n}{2}
\large(
\large)
\&~\frac{n}{2}+1

Measures of Centrality: mode

The mode is defined as the most frequently occurring value in the dataset

 

Mode = 1

 

Shikhar Dhawan T20I scores (59 sorted scores)

 

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 

Measures of Centrality: mode

Single mode: (only 1 most frequent value)

 

Multiple modes: (more than 1 most frequent value)

 

No modes: (all values appear exactly once)

 

1,2,2,2,3,4,5,5,5,5,5,6,6,7,7,12,12,13,14,15,15,15,15,15,17,18,19,19

 

1,2,3,7,8,10,13,23,32,43,55,61,65,68,77,85,91,93

 

the mode is 1

 

the modes are 5 & 15

 

5, 32, 30, 0, 1, 33, 3, 11, 5, 42, 26, 9, 51, 46, 2, 1, 16, 60, 1, 6, 23, 13, 23, 15, 2, 80, 1, 6, 72, 24, 47, 90, 55, 8, 35, 10, 74, 4, 10, 5, 3, 43, 92, 76, 41, 29, 30, 5, 14, 1, 23, 3, 40, 36, 41, 31, 19, 32, 52

 

Measures of Centrality: summary

Summary

 

 

Median is the value which appears at the centre of the data when the data is sorted (slight difference when n is odd v/s when n is even)

 

 

Mean is the the sum of all the elements in the data divided by the total number of elements

 

 

Mode is the most frequent value appearing in the data

 

What are some characteristics of measures of centrality?

Mean is the centre of gravity

Data:~x_1, x_2, x_3, \dots, x_n
Mean:\bar{x}

The deviation of a point from the mean is defined as the difference between this point and the mean

 
Deviation:x_i - \bar{x}

Mean is the centre of gravity

The sum of the deviations of all points from the mean is 0

 
= (x_1 - \bar{x}) + (x_2 - \bar{x}) + \dots + (x_n - \bar{x})
=\sum_{i=1}^{n} (x_i -\bar{x})
(x_1 + x_2 + x_3 + \dots + x_n)

sum of deviations

 
= \sum_{i=1}^{n} x_i - n\bar{x}
- (\bar{x} + \bar{x} + \dots n~times)
=
= \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} x_i = 0
(\because \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i)

Mean is the centre of gravity

What is the physical interpretation of the above result?

 
=0

sum of deviations from the mean

 

Mean is the centre of gravity

Number line as a seesaw

 

Imagine

 
 

Data points as weights on the seesaw

 

Weights proportional to deviations from

 
\bar{x}

Left side values

Right side values

   1        2         3        4        5       6        7       8        9      10

Left side values

Right side values

   1        2         3        4        5       6        7       8        9      10

   1        2         3        4        5       6        7       8        9      10

Left side values

Right side values

Left side values

Right side values

Mean: 6.0

Left side values

Right side values

   1        2         3        4        5      6        7       8        9      10

Mean is the centre of gravity

The mean is thus also called the centre of gravity of the data

 

Deviations on left side

Deviations on right side

Mean: 6.0

   1        2         3        4        5      6        7       8        9      10

=

Sensitivity to outliers

Informally, we define an outlier as any point which is far off from the other values in the data (a formal definition will follow later) 

 

Scores

 
 

Frequency

 
 

Outlier

 
 

Sensitivity to outliers

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

 

Which player performed better in the series?

 

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

 

Ashes 2017-18 series (runs scored)

 

Mean = 39.88

 

Mean = 41.78

 

Median = 14

 

Median = 51

 

Outlier

 
 

Sensitivity to outliers

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

 

Except for the one high score (outlier), Cook performed poorly whereas Root was more consistent (this is reflected in the median but not in the mean)

 

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

 

Mean = 39.88

 

Mean = 41.78

 

Median = 14

 

Median = 51

 

Outlier

 
 

Sensitivity to outliers

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

 

Old Mean = 41.78

 

Old Median = 14

 

Outlier

 
 

The mean is very sensitive to outliers whereas the median is not so sensitive

 

What if we drop the outlier?

 

New Mean = 16.5

 

New Median = 12

 

Sensitivity to outliers (trimmed mean)

Alistair Cook:   2, 7, 7, 10, 14, 16, 37, 39, 244

 

Mean = 41.78

 

Trimmed Mean = 18.57

 

Trimmed mean is computed by dropping k extreme elements from either side (note that we need to drop the same number of elements from both sides) 

 

To account for the sensitivity to outliers it is advised to compute the trimmed mean

 

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

 

Mean = 39.88

 

Trimmed Mean = 39.28

 

Sensitivity to outliers (trimmed mean)

Student salaries (INR lakhs) at a top university

 

Mean = 24.57

 

Median = 17.5

 

Trimmed Mean = 18.95

 

(dropping 2 extreme values on either side)

 

9.1, 9.4, 10.5, 10.5, 11.5, 11.7, 12.3, 12.7, 12.8, 13.7, 13.8, 14.9, 14.9, 15.3, 16.2, 17.5, 17.6, 18.5, 18.6, 19.3, 19.9, 20.8, 23.6, 23.6, 24.3, 24.4, 32.1, 35.3, 45.5, 98.3, 133.1

Sensitivity to outliers (mode)

The mode is not sensitive to outliers (unless the mode itself is the outlier)

 

Shikhar Dhawan T20Is scores: 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

 

Mode = 1

 

Sample: 8, 11, 12, 13, 14, 17, 19, 20, 21, 23, 24, 27, 28, 29, 30, 31, 33, 35, 64, 64

 

Mode =64

 

Summary

 

 

It is often a good idea to compute a trimmed mean by dropping the same number of elements from both the extremes

 

 

Mean is sensitive to outliers whereas median and mode are not

 

Sensitivity to outliers

What do the measures of centrality look like for different types of distributions?

Perfectly symmetric distribution

If x is the central location in the data then for every element (x-i) in the data, there will also be a corresponding element (x+i)

 

Can we say something interesting about the mean, median and mode?

 

Perfectly symmetric distribution

mean = median = mode

 

mode corresponds to the tallest bar

 

median also corresponds to the tallest bar with equal no. of elements on either side

 

What about the mean?

 

Perfectly symmetric distribution

Toy data: 3,3,3,4,4,4,4,4,5,5,5

 

What about the mean? (Informal proof)

 

Let x be the central value (median, 4 here)

 

Since the data is symmetric for any element x-i (on the left) there will also be an element x+1 (on the right)

 
\bar{x} = \frac{3+3+3+4+4+4+4+4+5+5+5}{11}
\bar{x} = \frac{(4-1)+(4-1)+(4-1)+4+4+4+4+4+(4+1)+(4+1)+(4+1)}{11}
\bar{x} = \frac{11*4}{11} = 4

Perfectly symmetric distribution

Toy data: 1,2,3,3,3,4,4,4,4,4,5,5,5,6,7

 

What about the mean? (Informal proof)

 

Let x be the central value (median, 4 here)

 

Since the data is symmetric for any element x-i (on the left) there will also be an element x+1 (on the right)

 
\bar{x} = \frac{1+2+3+3+3+4+4+4+4+4+5+5+5+6+7}{11}
\bar{x} = \frac{(4-3)+(4-2)+(4-1)+(4-1)+(4-1)+4+4+4+4+4+(4+1)+(4+1)+(4+1)+(4+2)+(4+3)}{11}
\bar{x} = \frac{15*4}{15} = 4

Perfectly symmetric distribution

What about the mean? (Informal proof)

 

The seesaw will be balanced when the fulcrum is placed at the median. Hence

 

mean = median = mode

 

Perfectly symmetric distribution

1,2,2,3,3,3,3,4,4,5,6,6,7,7,7,7,8,8,9

 

mean = median != mode

 

Mean = 5, Median=5, Mode = 3,7

 

What about bimodal distributions?

 

Perfectly symmetric distribution

mean = median

 

n is even

 

Other examples of multimodal distributions

 

Skewed Distributions

Left-skewed distribution: has a long tail to the left (also called negative skewed)

 

Right-skewed distribution: has a long tail to the right (also called positive skewed)

 

Left-skewed

 

Right-skewed

 

Left Skewed Distributions

[Hint: mean is the centre of the gravity]

 

Without computing it can you say where would the mean be? (towards left or right)

 

Mean

Left Skewed Distributions

What about the median and the mode?

 

Observation: mean < median < mode

 

Mean 

Median

Mode

Left Skewed Distributions

What if the tail is very long?

 

Observation: mean < median < mode

 

Mean: 8.91 

Median: 9.055

Mode: 9.5

Mean : 8.93

Median: 9.05

Mode: 9.5

Mean: 8.935 

Median: 9.06

Mode: 9.5

Right Skewed Distributions

Without computing it can you say where would the mean be? (towards left or right)

 

[Hint: mean is the centre of the gravity]

 

Mean

Right Skewed Distributions

What about the median and the mode?

 

Observation: mean > median > mode

 

Mean 

Median

Mode

Skewed Distributions

Left skewed: mean < median < mode

 

Right-skewed

 

Left-skewed

 

Right skewed: mean > median > mode

 

Is this always true?

 

Almost always but not always (counter-example on next slide)

 

Mean 

Mode

Median

Mean 

Mode

Median

Skewed Distributions (with heavy and long tail)

Left skewed but mean > median!

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{long tail}}
\underbrace{~~~~~~~~~~~}_{\textit{heavy tail}}

Not true for some skewed distributions which have a heavy tail on the other side

 

Is this always true?

 

(not always!)

 

Median

Mean 

mean < median

 
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{long tail}}
\underbrace{~~~~~~~~~~~~~~~~}_{\textit{heavy tail}}

(same as the generic rule)

 

Skewed Distributions (with heavy and long tail)

Mean 

Median

Skewed Distributions (bimodal)

Left skewed but mean > median!

 

Mean 

Median

Summary

 

(Except for some cases where there is a heavy tail on the other side or if the distribution is bimodal)

 

 

(Almost always true)

 

Skewed Distributions

Left skewed: mean < median < mode

 

Right skewed: mean > median > mode

 

Left Skewed Distributions

What if the tail is very long?

 

Observation: mean < median < mode

 

Mean: 8.91 

Median: 9.055

Mode: 9.5

Mean : 8.93

Median: 9.05

Mode: 9.5

Mean: 8.935 

Median: 9.06

Mode: 9.5

How do we compute mean, median, mode from histograms?

What if we only have access to histograms and not the actual data?

 

Histograms (computing measures of centrality)

Can we still compute measures of centrality?

 

Scores

 
 

Frequency

 
 

Histograms (computing measures of centrality)

Can we still compute measures of centrality?

 
Interval Frequency
0 - 10 126
10-20 59
20-30 44
30-40 47
40-50 31
50-60 22
60-70 28
70-80 10
80-90 18
90-100 18
Interval Frequency
100-110 12
110-120 12
120-130 9
130-140 4
140-150 7
150-160 1
160-170 1
170-180 1
180-190 1
190-200 0
200-210 1

Compute median from a histogram

Interval Frequency Cumulative Frequency
0 - 10 126 126
10-20 59 185
20-30 44 229
30-40 47 276
40-50 31 307
50-60 22 329
60-70 28 357
70-80 10 367
80-90 18 385
90-100 18 403
Interval Frequency Cumulative Frequency
100 - 110 12 415
110-120 12 427
120-130 9 436
130-140 4 440
140-150 7 447
150-160 1 448
160-170 1 449
170-180 1 450
180-1900 1 451
190-200 0 451
200-210 1 452

Compute cumulative frequency (find n)

 
 

Compute central location

 
 
\underbrace{\frac{n+1}{2}}_{n~\textit{is odd}}~~or~~\underbrace{\frac{n}{2}, \frac{n}{2}+1}_{n~\textit{is even}}

Find the interval containing the centre

 
 

Estimate median = mid-point of this interval

 
 

Compute median from a histogram

Interval Frequency Cumulative Frequency
0 - 10 126 126
10-20 59 185
20-30 44 229

Why does the above procedure make sense?

 

Median

185 elements

 

452 -229 = 223 elements

 

44 elements

 

0-20

 

20-30

 

30-200

 

Compute median from a histogram

0-20

 

20-30

 

30-200

 

but it is also possible that the 44 elements were different (we don't know what the 44 values are)

 

20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 25, 26, 26, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29

 

20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 25, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 27, 28, 29

 

185 elements

 

452 -229 = 223 elements

 

44 elements

 

Compute median from a histogram

0-20

 

20-30

 

30-200

 

Since we do not the actual values in the class interval the best guess we can make is that the median is the mid-point of the class interval (we won't be very wrong)

 

True median: 28

 

Estimated median: 25

 

Error:

 
\frac{28-25}{28}*100=10.71\%

185 elements

 

452 -229 = 223 elements

 

44 elements

 

Compute median from a histogram

Bin Size: 10000

Total yield

 
 

# Farms

 
 

What if the class intervals are bigger?

 

Compute median from a histogram

What if the class intervals are bigger?

 

n =1611

centre = (1611+1)/2 = 806

class interval containing centre =90000-10000

True median: 96080

 

Estimated median= 95000

 

Error:

 
\frac{96080-95000}{96080}*100=1.12\%

The absolute error may be high due to larger values in the data but the relative error may still be reasonable

 

Compute mean from a histogram

Interval Frequency
0 - 10 126
10-20 59
20-30 44
30-40 47
40-50 31
50-60 22
60-70 28
70-80 10
80-90 18
90-100 18
Interval Frequency
100 - 110 12
110-120 12
120-130 9
130-140 4
140-150 7
150-160 1
160-170 1
170-180 1
180-1900 1
190-200 0
200-210 1

Runs scored

 
 

# Matches

 
 

How do you compute the mean from a histogram?

 

Compute mean from a histogram

Interval Frequency Mid-point Mid-point * frequency
0 - 10 126 5 630
10-20 59 15 885
20-30 44 25 1100
30-40 47 35 1645
40-50 31 45 1395
50-60 22 55 1210
60-70 28 65 1820
70-80 10 75 750
80-90 18 85 1530
90-100 18 95 1710
100-110 12 105 1260
110-120 12 115 1380
120-130 9 125 1125
130-140 4 135 540
140-150 7 145 1015
150-160 1 155 155
160-170 1 165 165

Compute the mid-point of each interval

 
 

Multiply the mid-point by the frequency of the interval

 
 

Sum up the resulting product for all intervals

 
 

Divide by the no. of data points

 
 
170-180 1 175 175
180-190 1 185 185
190-200 0 195 0
200-210 1 205 205

sum =18880

 
 

mean =

 
 
\frac{18880}{452}

=41.769

 
 

Compute mean from a histogram

True mean = 40.76

 
 

Estimated mean

 
 

= 41.77

 
 

Error

 
=\frac{40.76-41.77}{40.76}*100\\=-2.47\%

What is the intuition behind this procedure?

 
Interval Frequency Mid-point Mid-point * frequency
0 - 10 126 5 630
10-20 59 15 885
20-30 44 25 1100
30-40 47 35 1645
40-50 31 45 1395
50-60 22 55 1210
60-70 28 65 1820
70-80 10 75 750
80-90 18 85 1530
90-100 18 95 1710
100-110 12 105 1260
110-120 12 115 1380
120-130 9 125 1125
130-140 4 135 540
140-150 7 145 1015
150-160 1 155 155
160-170 1 165 165
170-180 1 175 175
180-190 1 185 185
190-200 0 195 0
200-210 1 205 205

Compute mean from a histogram

What is the intuition behind this procedure?

 
\bar{x}=\frac{\sum_{i=1}^{n} x_i}{n}
\sum_{i=1}^{n} x_i =

sum of elements in 1st interval

 

+ sum of elements in 2nd interval

 

+ sum of elements in 3rd interval

 

+ sum of elements in last interval

 

... ... ... ...

 
Interval Frequency
0 - 10 126
10-20 59
20-30 44
30-40 47
40-50 31
50-60 22
60-70 28
70-80 10
80-90 18
90-100 18
.... ..... .... ....

Compute mean from a histogram

Problem: We do not know what these 10 values are?

 

sum of elements in 8th interval

 

= sum of 10 elements

 
Interval Frequency
0 - 10 126
10-20 59
20-30 44
30-40 47
40-50 31
50-60 22
60-70 28
70-80 10
80-90 18
90-100 18
.... ..... .... ....

Solution: Assume each value is equal to the mid-point (over-estimate some values, underestimate some values)

 

70, 70, 71, 72, 73, 74, 77, 77, 78, 79

 

mid-point = 75, True sum = 741

 

Approximation: add 75 10 times = 75 * 10

 

What if the class intervals are bigger?

 

True mean: 135812.14

 

Estimated mean: 136837.37

 

Error:

 
\frac{135812.14-136837.37}{135812.14}*100=0.75\%

The absolute error may be high due to larger values in the data but the relative error may still be reasonable

 

Compute mean from a histogram

It is not possible to compute the mode from a histogram if the bin size is greater than 1

 

Compute mode from a histogram

2, 2, 11, 13, 14, 15, 16, 19, 20, 22, 23, 24, 26, 29, 31, 32, 35, 36, 37, 38, 39, 41, 42, 43, 44, 48, 49, 50, 52, 53, 54, 55, 56, 58, 60, 61, 62, 63, 64, 65, 66, 68, 69

 

Compute mode from a histogram

Of course if the bin size is 1 it is trivial to compute the mode

 

Modes: 0 and 1

 
 

What is the effect of transformations on the measures of centrality?

Transformations

Scaling: (example, kilometres to metres)

 
x_{new} = a*x

(a = 1000)

 

(a = 0.4535)

 

Original Data: (Distance km)

 

[50.52, 62.935, 50.888, 62.94, 62.929, 37.8, 36.687, 39.122, 63.453, 44.845]

 

Scaled Data: (Distance m)

 

[50520.0, 62935.0, 50888.0, 62940.0, 62929.0, 37800.0, 36687.0, 39122.0, 63453.0, 44845.0]

 

Original Data:(weight lbs)

 

[13, 29, 21, 34, 30, 33, 11, 31, 15, 20]

 

Scaled Data: (weight kgs)

 

[5.89, 13.15, 9.52, 15.41, 13.60, 14.96, 4.98, 14.05, 6.80, 9.07]

 

Transformations

Shifting: (flat 50 INR off on shirts)

 
x_{new} = x + c

(c = -50)

 

(c = 5)

 

Shifting: (flat 5 INR packing charge per item)

 

Original Data: (Pre discount)

 

[699, 599, 549, 1499, 799, 999, 1150, 850, 899, 1099]

 

Shifted Data: (Post Discount)

 

[649, 549, 499, 1449, 749, 949, 1100, 800, 849, 1049]

 

{'Veg Burger': 35, 'Cheese Maggi':45, 'Masala Dosa':80, 'Fried Rice': 75, 'Pizza': 129}

 

Original Data:

 

{'Veg Burger': 40, 'Cheese Maggi':50, 'Masala Dosa':85, 'Fried Rice': 80, 'Pizza': 134}

 

Shifted Data: (Post Packing)

 

Transformations

Scaling and Shifting

 
x_{new} = a*x + c

(a = 5/9, c = -160/9)

 

Temperature in Fahrenheit:

 

[75.25, 71 , 55.15, 58.28, 69.71, 44.4 , 38.77, 44.96, 80.7 , 73.76]

 

Temperature in Celsius:

 

[24.03, 21.67, 12.86, 14.6 , 20.95, 6.89, 3.76, 7.2 , 27.06, 23.2]

 

Transformations

Summary

 

Scaling and Shifting:

 
x_{new} = a*x + c

Special cases:

 
c=0: x_{new} = a*x

(Only scaling)

 

(Only shifting)

 
a=1: x_{new} = x+c

Effect of transformations on mean

Prove that if

 
x_{new} = a*x + c

then,

 
\bar{x}_{new} = a*\bar{x} + c

Proof:

 
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
\bar{x}^{new} = \frac{1}{n}\sum_{i=1}^{n} x^{new}_i
= a\bar{x} + c
= \frac{1}{n}\sum_{i=1}^{n} (\overbrace{a x_i + c})
= \frac{1}{n}(\sum_{i=1}^{n} a x_i + \sum_{i=1}^{n} c )
= a*\frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n} * nc

Effect of transformations on mean

\bar{x} = 61.2
\bar{x}^{new} = a*\bar{x} + c

(a = 5/9, c = -160/9)

 
\bar{x}^{new} = \frac{5}{9}*\bar{x} + \frac{-160}{9} = 16.2

Temperature in Celsius:

 

[24.03, 21.67, 12.86, 14.6 , 20.95, 6.89, 3.76, 7.2 , 27.06, 23.2]

 

Temperature in Fahrenheit:

 

[75.25, 71 , 55.15, 58.28, 69.71, 44.4 , 38.77, 44.96, 80.7 , 73.76]

 

Effect of transformations (on median)

Temperature in Fahrenheit:

 

Temperature in Celsius:

 
x_{new} = a*x + c
median_{new} = a*median + c

[3.76, 6.89, 7.2 , 12.86, 14.6 , 20.95, 21.67]

 

[38.77, 44.4 , 44.96, 55.15, 58.28, 69.71, 71]

 

The location of the median does not change (it only gets scaled)

 
12.86 = \frac{5}{9}*55.15 + \frac{-160}{9}

Effect of transformations (on mode)

Temperature in Fahrenheit:

 

Temperature in Celsius:

 

The scaled value of the mode will be the new mode

 
x_{new} = a*x + c
mode_{new} = a*mode + c

[3.76, 6.91, 6.91, 12.86, 14.6 , 20.95, 21.67, 23.2 , 24.03, 27.06]

 

[38.77, 44.44 , 44.44, 55.15, 58.28, 69.71, 71. , 73.76, 75.25, 80.7]

 
6.91 = \frac{5}{9}*44.44 + \frac{-160}{9}

Summary

mean

 

Mean is sensitive to outliers but median is not

 

median

 

mode

 
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
x_{\frac{n+1}{2}}~or~ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}

most freq. element

 

Mean is the centre of gravity of the data

 

Summary

Left skewed: mean < median < mode

 

Right skewed: mean > median > mode

 

Symmetric: mean = median = mode

 

Effect of Skewness

 

Almost always

 
\large(
\large)

Mean and median can be approximately computed from histograms

 

Effect of Transformations:

 
x_{new} = a*x + c
\bar{x}_{new} = a*\bar{x} + c
median_{new} = a*median + c
mode_{new} = a*mode + c

Left-skewed-histogram: Most of the short bars are towards the left of the histogram

 

Typical trends in histograms

Units

 
 

Frequency

 
 

Average Strike Rate

 
 

Frequency

 
 

Units

 
 

Frequency

 
 

Measures of Centrality

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Which is the 7th most grown crop in the country?

 

Hard to answer

 

Nominal attributes

Recall

 

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

 

How many balls does Sachin Tendulkar typically face in a match?

 

Measures of Centrality

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Which is the 7th most grown crop in the country?

 

Hard to answer

 

Nominal attributes

Recall

 

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

 

How many balls does Sachin Tendulkar typically face in a match?

 

Motivation: Summarise Big Data

Age Height Weight Cholesterol Sugar level .... .....
32 165 75 124 108 ...
24 172 81 112 98 ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...

Which is the 7th most grown crop in the country?

 

Hard to answer

 

Nominal attributes

Interval Frequency Cumulative Frequency
0 - 10 137 196
10-20 59 196
20-30 44 240
Interval Frequency Mid-point Mid-point * frequency
0 - 10 137 5 685
10-20 59 15 885
20-30 44 25 1100
30-40 47 35 1645
40-50 31 45 1395
50-60 22 55 1210
60-70 28 65 750
70-80 10 75 1530
80-90 18 85 1710
90-100 18 95 1260
100-110 12 105 1260
110-120 12 115 1380
120-130 9 125 1125
130-140 4 135 540
140-150 7 145 1015
150-160 1 155 155
160-170 1 165 165
170-180 1 175 175
180-190 1 185 185
190-200 0 195 0
200-210 47 205 205

Left side values

Right side values

Mean: 6.0

Left side values

Right side values

   1        2         3        4        5      6        7       8        9      10

Left side values

Right side values

   1        2         3        4        5       6        7       8        9      10

Left side values

Right side values

   1        2         3        4        5       6        7       8        9      10

   1        2         3        4        5       6        7       8        9      10

Left side values

Right side values

Summarising Data - Part 1

By One Fourth Labs

Summarising Data - Part 1

PadhAI One: FDS Week 3 (MK)

  • 121