Recap: List of Topics

Descriptive Statistics

Probability Theory

Inferential Statistics

Different types of data

Different types of plots

Measures of centrality and spread

Sample spaces, events, axioms

Discrete and continuous RVs

Bernoulli, Uniform, Normal dist.

Sampling strategies

Interval Estimators

Hypothesis testing (z-test, t-test)

ANOVA, Chi-square test

Linear Regression

What are the different measures of centrality?

What are some characteristics of these measures?

Learning Objectives

What do the measures of centrality look like for different types of distributions?

How do you compute these measures from histograms?

Why do we need measures of spread and centrality?

What is the effect of certain transformations on these measures?

What are the different measures of centrality?

Why do we need measures of centrality and spread?

Motivation: Summarise Big Data

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Imagine a million such rows!

Drawing plots can give a good visual summary

In some situations, we want an even more succinct summary (say, a single/few numbers)

Motivation: Summarise Big Data

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

A parameter is any numeric property of the entire population under study

A statistic is any numerical property of the sample of a population (used as an estimate for the corresponding parameter of the population)

Recall

Motivation: Summarise Big Data

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Use summary statistics for quantitative data

Measures of centrality(mean, mode, median)

Percentiles (quartiles, quintiles, deciles)

Measures of spread (range, IQR, variance, standard deviation)

What are the different measures of centrality?

Measures of Centrality

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

How many balls does Sachin Tendulkar typically face in a match?

Measures of Centrality: mean

Notation:

data points

x_1, x_2, x_3, ...., x_n

Mean of sample:

\bar{x}

Mean of population:

\mu

x_1 = 0, x_2 = 0, x_3 = 36, x_4 =10, x_5 =20, x_6 = 19

x_{12} = 4, x_{13} = 53, x_{14} = 52, x_{15} = 22 .....

,x_7 = 31, x_8 = 36, x_9 = 53, x_{10} = 30, x_{11} = 0,

...... x_{452} = 52

Measures of Centrality: mean

\bar{x} = \frac{x_1 + x_2 + x_3 + \dots+x_n}{n}

= \frac{1}{n}\sum_{i=1}^{n}x_i

x_1 = 0, x_2 = 0, x_3 = 36, x_4 =10, x_5 =20, x_6 = 19

x_{12} = 4, x_{13} = 53, x_{14} = 52, x_{15} = 22 .....

,x_7 = 31, x_8 = 36, x_9 = 53, x_{10} = 30, x_{11} = 0,

...... x_{452} = 52

Measures of Centrality: mean

\bar{x} = \frac{x_1 + x_2 + x_3 + x_4 + \dots+x_{452}}{452}

= 40.76

x_1 = 0, x_2 = 0, x_3 = 36, x_4 =10, x_5 =20, x_6 = 19

x_{12} = 4, x_{13} = 53, x_{14} = 52, x_{15} = 22 .....

,x_7 = 31, x_8 = 36, x_9 = 53, x_{10} = 30, x_{11} = 0,

...... x_{452} = 52

\bar{x} = \frac{0 + 0 + 36 + 10 + \dots+52}{452}

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 matches)

5, 32, 30, 0, 1, 33, 3, 11, 5, 42, 26, 9, 51, 46, 2, 1, 16, 60, 1, 6, 23, 13, 23, 15, 2, 80, 1, 6, 72, 24, 47, 90, 55, 8, 35, 10, 74, 4, 10, 5, 3, 43, 92, 76, 41, 29, 30, 5, 14, 1, 23, 3, 40, 36, 41, 31, 19, 32, 52

Median is the value which appears at the centre of the data when the data is sorted.

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 sorted scores)

center~location = \frac{n+1}{2}

center~location = \frac{59+1}{2} = 30

\because n = 59~is~odd

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 sorted scores)

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 29 elements}}

23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 29 elements}}

There are equal number of elements on either side of the central location

23

\underbrace{~~~~~~~~~~}_{\textit{mid-point}}

30th~element

Measures of Centrality: median

Shikhar Dhawan T20I scores (59 sorted scores)

When n is odd, the median is the value at the central location (or mid-point) which is 23 in this case

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 29 elements}}

23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 29 elements}}

23

\underbrace{~~~~~~~~~~}_{\textit{mid-point}}

Measures of Centrality: median

Shikhar Dhawan T20I scores (50 sorted scores)

What happens when n is even? (say is we had data for 50 T20Is only)

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

Measures of Centrality: median

Shikhar Dhawan T20I scores (50 sorted scores)

\underbrace{~~~~~~~~~~}_{\textit{2 mid-points}}

15

16 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 24 elements}}

23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 24 elements}}

There are two mid-points now such that the number of elements of either side is the same

Measures of Centrality: median

Shikhar Dhawan T20I scores (50 sorted scores)

center~locations = \frac{n}{2}~and~\frac{n}{2}+1

\because n = 50~is~even

center~locations = 25~and~26

\underbrace{~~~~~~~~~~}_{\textit{2 mid-points}}

15

16 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 24 elements}}

23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 24 elements}}

Measures of Centrality: median

When n is even, the median is the average of the values at the two central locations (or mid-points) which is (15 + 16)/2 = 15.5 in this case

Shikhar Dhawan T20I scores (50 sorted scores)

\underbrace{~~~~~~~~~~}_{\textit{2 mid-points}}

15

16 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{first 24 elements}}

23, 23, 24, 26, 29, 30, 30, 32, 33, 35, 41, 42, 43, 46, 47, 51, 55, 60, 72, 74, 76, 80, 90, 92

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{last 24 elements}}

Measures of Centrality: median

Summary

Data:~x_1, x_2, x_3, \dots, x_n

if~n~is~odd:

median = x_{\frac{n+1}{2}}

if~n~is~even:

median = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}

the element at position

\frac{n+1}{2}

\large(

\large)

the mean of elements at positions

\frac{n}{2}

\large(

\large)

\&~\frac{n}{2}+1

Measures of Centrality: mode

The mode is defined as the most frequently occurring value in the dataset

Mode = 1

Shikhar Dhawan T20I scores (59 sorted scores)

0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

Measures of Centrality: mode

Single mode: (only 1 most frequent value)

Multiple modes: (more than 1 most frequent value)

No modes: (all values appear exactly once)

1,2,2,2,3,4,5,5,5,5,5,6,6,7,7,12,12,13,14,15,15,15,15,15,17,18,19,19

1,2,3,7,8,10,13,23,32,43,55,61,65,68,77,85,91,93

the mode is 1

the modes are 5 & 15

5, 32, 30, 0, 1, 33, 3, 11, 5, 42, 26, 9, 51, 46, 2, 1, 16, 60, 1, 6, 23, 13, 23, 15, 2, 80, 1, 6, 72, 24, 47, 90, 55, 8, 35, 10, 74, 4, 10, 5, 3, 43, 92, 76, 41, 29, 30, 5, 14, 1, 23, 3, 40, 36, 41, 31, 19, 32, 52

Measures of Centrality: summary

Summary

Median is the value which appears at the centre of the data when the data is sorted (slight difference when n is odd v/s when n is even)

Mean is the the sum of all the elements in the data divided by the total number of elements

Mode is the most frequent value appearing in the data

What are some characteristics of measures of centrality?

Mean is the centre of gravity

Data:~x_1, x_2, x_3, \dots, x_n

Mean:\bar{x}

The deviation of a point from the mean is defined as the difference between this point and the mean

Deviation:x_i - \bar{x}

Mean is the centre of gravity

The sum of the deviations of all points from the mean is 0

= (x_1 - \bar{x}) + (x_2 - \bar{x}) + \dots + (x_n - \bar{x})

=\sum_{i=1}^{n} (x_i -\bar{x})

(x_1 + x_2 + x_3 + \dots + x_n)

sum of deviations

= \sum_{i=1}^{n} x_i - n\bar{x}

- (\bar{x} + \bar{x} + \dots n~times)

= \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} x_i = 0

(\because \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i)

Mean is the centre of gravity

What is the physical interpretation of the above result?

sum of deviations from the mean

Mean is the centre of gravity

Number line as a seesaw

Imagine

Data points as weights on the seesaw

Weights proportional to deviations from

\bar{x}

Left side values

Right side values

1 2 3 4 5 6 7 8 9 10

Left side values

Right side values

1 2 3 4 5 6 7 8 9 10

Left side values

Right side values

Left side values

Right side values

Mean: 6.0

Left side values

Right side values

1 2 3 4 5 6 7 8 9 10

Mean is the centre of gravity

The mean is thus also called the centre of gravity of the data

Deviations on left side

Deviations on right side

Mean: 6.0

1 2 3 4 5 6 7 8 9 10

=

Sensitivity to outliers

Informally, we define an outlier as any point which is far off from the other values in the data (a formal definition will follow later)

Scores

Frequency

Outlier

Sensitivity to outliers

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

Which player performed better in the series?

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

Ashes 2017-18 series (runs scored)

Mean = 39.88

Mean = 41.78

Median = 14

Median = 51

Outlier

Sensitivity to outliers

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

Except for the one high score (outlier), Cook performed poorly whereas Root was more consistent (this is reflected in the median but not in the mean)

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

Mean = 39.88

Mean = 41.78

Median = 14

Median = 51

Outlier

Sensitivity to outliers

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

Old Mean = 41.78

Old Median = 14

Outlier

The mean is very sensitive to outliers whereas the median is not so sensitive

What if we drop the outlier?

New Mean = 16.5

New Median = 12

Sensitivity to outliers (trimmed mean)

Alistair Cook: 2, 7, 7, 10, 14, 16, 37, 39, 244

Mean = 41.78

Trimmed Mean = 18.57

Trimmed mean is computed by dropping k extreme elements from either side (note that we need to drop the same number of elements from both sides)

To account for the sensitivity to outliers it is advised to compute the trimmed mean

Joe Root: 1, 9, 14, 15, 51, 58, 61, 67, 83

Mean = 39.88

Trimmed Mean = 39.28

Sensitivity to outliers (trimmed mean)

Student salaries (INR lakhs) at a top university

Mean = 24.57

Median = 17.5

Trimmed Mean = 18.95

(dropping 2 extreme values on either side)

9.1, 9.4, 10.5, 10.5, 11.5, 11.7, 12.3, 12.7, 12.8, 13.7, 13.8, 14.9, 14.9, 15.3, 16.2, 17.5, 17.6, 18.5, 18.6, 19.3, 19.9, 20.8, 23.6, 23.6, 24.3, 24.4, 32.1, 35.3, 45.5, 98.3, 133.1

Sensitivity to outliers (mode)

The mode is not sensitive to outliers (unless the mode itself is the outlier)

Shikhar Dhawan T20Is scores: 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 9, 10, 10, 11, 13, 14, 15, 16, 19, 23, 23, 23, 24, 26, 29, 30, 30, 31, 32, 32, 33, 35, 36, 40, 41, 41, 42, 43, 46, 47, 51, 52, 55, 60, 72, 74, 76, 80, 90, 92

Mode = 1

Sample: 8, 11, 12, 13, 14, 17, 19, 20, 21, 23, 24, 27, 28, 29, 30, 31, 33, 35, 64, 64

Mode =64

Summary

It is often a good idea to compute a trimmed mean by dropping the same number of elements from both the extremes

Mean is sensitive to outliers whereas median and mode are not

Sensitivity to outliers

What do the measures of centrality look like for different types of distributions?

Perfectly symmetric distribution

If x is the central location in the data then for every element (x-i) in the data, there will also be a corresponding element (x+i)

Can we say something interesting about the mean, median and mode?

Perfectly symmetric distribution

mean = median = mode

mode corresponds to the tallest bar

median also corresponds to the tallest bar with equal no. of elements on either side

What about the mean?

Perfectly symmetric distribution

Toy data: 3,3,3,4,4,4,4,4,5,5,5

What about the mean? (Informal proof)

Let x be the central value (median, 4 here)

Since the data is symmetric for any element x-i (on the left) there will also be an element x+1 (on the right)

\bar{x} = \frac{3+3+3+4+4+4+4+4+5+5+5}{11}

\bar{x} = \frac{(4-1)+(4-1)+(4-1)+4+4+4+4+4+(4+1)+(4+1)+(4+1)}{11}

\bar{x} = \frac{11*4}{11} = 4

Perfectly symmetric distribution

Toy data: 1,2,3,3,3,4,4,4,4,4,5,5,5,6,7

What about the mean? (Informal proof)

Let x be the central value (median, 4 here)

Since the data is symmetric for any element x-i (on the left) there will also be an element x+1 (on the right)

\bar{x} = \frac{1+2+3+3+3+4+4+4+4+4+5+5+5+6+7}{11}

\bar{x} = \frac{(4-3)+(4-2)+(4-1)+(4-1)+(4-1)+4+4+4+4+4+(4+1)+(4+1)+(4+1)+(4+2)+(4+3)}{11}

\bar{x} = \frac{15*4}{15} = 4

Perfectly symmetric distribution

What about the mean? (Informal proof)

The seesaw will be balanced when the fulcrum is placed at the median. Hence

mean = median = mode

Perfectly symmetric distribution

1,2,2,3,3,3,3,4,4,5,6,6,7,7,7,7,8,8,9

mean = median != mode

Mean = 5, Median=5, Mode = 3,7

What about bimodal distributions?

Perfectly symmetric distribution

mean = median

n is even

Other examples of multimodal distributions

Skewed Distributions

Left-skewed distribution: has a long tail to the left (also called negative skewed)

Right-skewed distribution: has a long tail to the right (also called positive skewed)

Left-skewed

Right-skewed

Left Skewed Distributions

[Hint: mean is the centre of the gravity]

Without computing it can you say where would the mean be? (towards left or right)

Mean

Left Skewed Distributions

What about the median and the mode?

Observation: mean < median < mode

Mean

Median

Mode

Left Skewed Distributions

What if the tail is very long?

Observation: mean < median < mode

Mean: 8.91

Median: 9.055

Mode: 9.5

Mean : 8.93

Median: 9.05

Mode: 9.5

Mean: 8.935

Median: 9.06

Mode: 9.5

Right Skewed Distributions

Without computing it can you say where would the mean be? (towards left or right)

[Hint: mean is the centre of the gravity]

Mean

Right Skewed Distributions

What about the median and the mode?

Observation: mean > median > mode

Mean

Median

Mode

Skewed Distributions

Left skewed: mean < median < mode

Right-skewed

Left-skewed

Right skewed: mean > median > mode

Is this always true?

Almost always but not always (counter-example on next slide)

Mean

Mode

Median

Mean

Mode

Median

Skewed Distributions (with heavy and long tail)

Left skewed but mean > median!

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{long tail}}

\underbrace{~~~~~~~~~~~}_{\textit{heavy tail}}

Not true for some skewed distributions which have a heavy tail on the other side

Is this always true?

(not always!)

Median

Mean

mean < median

\underbrace{~~~~~~~~~~~~~~~~~~~~~~~}_{\textit{long tail}}

\underbrace{~~~~~~~~~~~~~~~~}_{\textit{heavy tail}}

(same as the generic rule)

Skewed Distributions (with heavy and long tail)

Mean

Median

Skewed Distributions (bimodal)

Left skewed but mean > median!

Mean

Median

Summary

(Except for some cases where there is a heavy tail on the other side or if the distribution is bimodal)

(Almost always true)

Skewed Distributions

Left skewed: mean < median < mode

Right skewed: mean > median > mode

Left Skewed Distributions

What if the tail is very long?

Observation: mean < median < mode

Mean: 8.91

Median: 9.055

Mode: 9.5

Mean : 8.93

Median: 9.05

Mode: 9.5

Mean: 8.935

Median: 9.06

Mode: 9.5

How do we compute mean, median, mode from histograms?

What if we only have access to histograms and not the actual data?

Histograms (computing measures of centrality)

Can we still compute measures of centrality?

Scores

Frequency

Histograms (computing measures of centrality)

Can we still compute measures of centrality?

Interval	Frequency
0 - 10	126
10-20	59
20-30	44
30-40	47
40-50	31
50-60	22
60-70	28
70-80	10
80-90	18
90-100	18

Interval	Frequency
100-110	12
110-120	12
120-130	9
130-140	4
140-150	7
150-160	1
160-170	1
170-180	1
180-190	1
190-200	0
200-210	1

Compute median from a histogram

Interval	Frequency	Cumulative Frequency
0 - 10	126	126
10-20	59	185
20-30	44	229
30-40	47	276
40-50	31	307
50-60	22	329
60-70	28	357
70-80	10	367
80-90	18	385
90-100	18	403

Interval	Frequency	Cumulative Frequency
100 - 110	12	415
110-120	12	427
120-130	9	436
130-140	4	440
140-150	7	447
150-160	1	448
160-170	1	449
170-180	1	450
180-1900	1	451
190-200	0	451
200-210	1	452

Compute cumulative frequency (find n)

Compute central location

\underbrace{\frac{n+1}{2}}_{n~\textit{is odd}}~~or~~\underbrace{\frac{n}{2}, \frac{n}{2}+1}_{n~\textit{is even}}

Find the interval containing the centre

Estimate median = mid-point of this interval

Compute median from a histogram

Interval	Frequency	Cumulative Frequency
0 - 10	126	126
10-20	59	185
20-30	44	229

Why does the above procedure make sense?

Median

185 elements

452 -229 = 223 elements

44 elements

0-20

20-30

30-200

Compute median from a histogram

0-20

20-30

30-200

but it is also possible that the 44 elements were different (we don't know what the 44 values are)

20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 25, 26, 26, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29

20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 25, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 27, 28, 29

185 elements

452 -229 = 223 elements

44 elements

Compute median from a histogram

0-20

20-30

30-200

Since we do not the actual values in the class interval the best guess we can make is that the median is the mid-point of the class interval (we won't be very wrong)

True median: 28

Estimated median: 25

Error:

\frac{28-25}{28}*100=10.71\%

185 elements

452 -229 = 223 elements

44 elements

Compute median from a histogram

Bin Size: 10000

Total yield

# Farms

What if the class intervals are bigger?

Compute median from a histogram

What if the class intervals are bigger?

n =1611

centre = (1611+1)/2 = 806

class interval containing centre =90000-10000

True median: 96080

Estimated median= 95000

Error:

\frac{96080-95000}{96080}*100=1.12\%

The absolute error may be high due to larger values in the data but the relative error may still be reasonable

Compute mean from a histogram

Interval	Frequency
0 - 10	126
10-20	59
20-30	44
30-40	47
40-50	31
50-60	22
60-70	28
70-80	10
80-90	18
90-100	18

Interval	Frequency
100 - 110	12
110-120	12
120-130	9
130-140	4
140-150	7
150-160	1
160-170	1
170-180	1
180-1900	1
190-200	0
200-210	1

Runs scored

# Matches

How do you compute the mean from a histogram?

Compute mean from a histogram

Interval	Frequency	Mid-point	Mid-point * frequency
0 - 10	126	5	630
10-20	59	15	885
20-30	44	25	1100
30-40	47	35	1645
40-50	31	45	1395
50-60	22	55	1210
60-70	28	65	1820
70-80	10	75	750
80-90	18	85	1530
90-100	18	95	1710
100-110	12	105	1260
110-120	12	115	1380
120-130	9	125	1125
130-140	4	135	540
140-150	7	145	1015
150-160	1	155	155
160-170	1	165	165

Compute the mid-point of each interval

Multiply the mid-point by the frequency of the interval

Sum up the resulting product for all intervals

Divide by the no. of data points

170-180	1	175	175
180-190	1	185	185
190-200	0	195	0
200-210	1	205	205

sum =18880

mean =

\frac{18880}{452}

=41.769

Compute mean from a histogram

True mean = 40.76

Estimated mean

= 41.77

Error

=\frac{40.76-41.77}{40.76}*100\\=-2.47\%

What is the intuition behind this procedure?

Interval	Frequency	Mid-point	Mid-point * frequency
0 - 10	126	5	630
10-20	59	15	885
20-30	44	25	1100
30-40	47	35	1645
40-50	31	45	1395
50-60	22	55	1210
60-70	28	65	1820
70-80	10	75	750
80-90	18	85	1530
90-100	18	95	1710
100-110	12	105	1260
110-120	12	115	1380
120-130	9	125	1125
130-140	4	135	540
140-150	7	145	1015
150-160	1	155	155
160-170	1	165	165

170-180	1	175	175
180-190	1	185	185
190-200	0	195	0
200-210	1	205	205

Compute mean from a histogram

What is the intuition behind this procedure?

\bar{x}=\frac{\sum_{i=1}^{n} x_i}{n}

\sum_{i=1}^{n} x_i =

sum of elements in 1st interval

+ sum of elements in 2nd interval

+ sum of elements in 3rd interval

+ sum of elements in last interval

... ... ... ...

Interval	Frequency
0 - 10	126
10-20	59
20-30	44
30-40	47
40-50	31
50-60	22
60-70	28
70-80	10
80-90	18
90-100	18
.... .....	.... ....

Compute mean from a histogram

Problem: We do not know what these 10 values are?

sum of elements in 8th interval

= sum of 10 elements

Interval	Frequency
0 - 10	126
10-20	59
20-30	44
30-40	47
40-50	31
50-60	22
60-70	28
70-80	10
80-90	18
90-100	18
.... .....	.... ....

Solution: Assume each value is equal to the mid-point (over-estimate some values, underestimate some values)

70, 70, 71, 72, 73, 74, 77, 77, 78, 79

mid-point = 75, True sum = 741

Approximation: add 75 10 times = 75 * 10

What if the class intervals are bigger?

True mean: 135812.14

Estimated mean: 136837.37

Error:

\frac{135812.14-136837.37}{135812.14}*100=0.75\%

The absolute error may be high due to larger values in the data but the relative error may still be reasonable

Compute mean from a histogram

It is not possible to compute the mode from a histogram if the bin size is greater than 1

Compute mode from a histogram

2, 2, 11, 13, 14, 15, 16, 19, 20, 22, 23, 24, 26, 29, 31, 32, 35, 36, 37, 38, 39, 41, 42, 43, 44, 48, 49, 50, 52, 53, 54, 55, 56, 58, 60, 61, 62, 63, 64, 65, 66, 68, 69

Compute mode from a histogram

Of course if the bin size is 1 it is trivial to compute the mode

Modes: 0 and 1

What is the effect of transformations on the measures of centrality?

Transformations

Scaling: (example, kilometres to metres)

x_{new} = a*x

(a = 1000)

(a = 0.4535)

Original Data: (Distance km)

[50.52, 62.935, 50.888, 62.94, 62.929, 37.8, 36.687, 39.122, 63.453, 44.845]

Scaled Data: (Distance m)

[50520.0, 62935.0, 50888.0, 62940.0, 62929.0, 37800.0, 36687.0, 39122.0, 63453.0, 44845.0]

Original Data:(weight lbs)

[13, 29, 21, 34, 30, 33, 11, 31, 15, 20]

Scaled Data: (weight kgs)

[5.89, 13.15, 9.52, 15.41, 13.60, 14.96, 4.98, 14.05, 6.80, 9.07]

Transformations

Shifting: (flat 50 INR off on shirts)

x_{new} = x + c

(c = -50)

(c = 5)

Shifting: (flat 5 INR packing charge per item)

Original Data: (Pre discount)

[699, 599, 549, 1499, 799, 999, 1150, 850, 899, 1099]

Shifted Data: (Post Discount)

[649, 549, 499, 1449, 749, 949, 1100, 800, 849, 1049]

{'Veg Burger': 35, 'Cheese Maggi':45, 'Masala Dosa':80, 'Fried Rice': 75, 'Pizza': 129}

Original Data:

{'Veg Burger': 40, 'Cheese Maggi':50, 'Masala Dosa':85, 'Fried Rice': 80, 'Pizza': 134}

Shifted Data: (Post Packing)

Transformations

Scaling and Shifting

x_{new} = a*x + c

(a = 5/9, c = -160/9)

Temperature in Fahrenheit:

[75.25, 71 , 55.15, 58.28, 69.71, 44.4 , 38.77, 44.96, 80.7 , 73.76]

Temperature in Celsius:

[24.03, 21.67, 12.86, 14.6 , 20.95, 6.89, 3.76, 7.2 , 27.06, 23.2]

Transformations

Summary

Scaling and Shifting:

x_{new} = a*x + c

Special cases:

c=0: x_{new} = a*x

(Only scaling)

(Only shifting)

a=1: x_{new} = x+c

Effect of transformations on mean

Prove that if

x_{new} = a*x + c

then,

\bar{x}_{new} = a*\bar{x} + c

Proof:

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

\bar{x}^{new} = \frac{1}{n}\sum_{i=1}^{n} x^{new}_i

= a\bar{x} + c

= \frac{1}{n}\sum_{i=1}^{n} (\overbrace{a x_i + c})

= \frac{1}{n}(\sum_{i=1}^{n} a x_i + \sum_{i=1}^{n} c )

= a*\frac{1}{n}\sum_{i=1}^{n} x_i + \frac{1}{n} * nc

Effect of transformations on mean

\bar{x} = 61.2

\bar{x}^{new} = a*\bar{x} + c

(a = 5/9, c = -160/9)

\bar{x}^{new} = \frac{5}{9}*\bar{x} + \frac{-160}{9} = 16.2

Temperature in Celsius:

[24.03, 21.67, 12.86, 14.6 , 20.95, 6.89, 3.76, 7.2 , 27.06, 23.2]

Temperature in Fahrenheit:

[75.25, 71 , 55.15, 58.28, 69.71, 44.4 , 38.77, 44.96, 80.7 , 73.76]

Effect of transformations (on median)

Temperature in Fahrenheit:

Temperature in Celsius:

x_{new} = a*x + c

median_{new} = a*median + c

[3.76, 6.89, 7.2 , 12.86, 14.6 , 20.95, 21.67]

[38.77, 44.4 , 44.96, 55.15, 58.28, 69.71, 71]

The location of the median does not change (it only gets scaled)

12.86 = \frac{5}{9}*55.15 + \frac{-160}{9}

Effect of transformations (on mode)

Temperature in Fahrenheit:

Temperature in Celsius:

The scaled value of the mode will be the new mode

x_{new} = a*x + c

mode_{new} = a*mode + c

[3.76, 6.91, 6.91, 12.86, 14.6 , 20.95, 21.67, 23.2 , 24.03, 27.06]

[38.77, 44.44 , 44.44, 55.15, 58.28, 69.71, 71. , 73.76, 75.25, 80.7]

6.91 = \frac{5}{9}*44.44 + \frac{-160}{9}

Summary

mean

Mean is sensitive to outliers but median is not

median

mode

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

x_{\frac{n+1}{2}}~or~ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}

most freq. element

Mean is the centre of gravity of the data

Summary

Left skewed: mean < median < mode

Right skewed: mean > median > mode

Symmetric: mean = median = mode

Effect of Skewness

Almost always

\large(

\large)

Mean and median can be approximately computed from histograms

Effect of Transformations:

x_{new} = a*x + c

\bar{x}_{new} = a*\bar{x} + c

median_{new} = a*median + c

mode_{new} = a*mode + c

Left-skewed-histogram: Most of the short bars are towards the left of the histogram

Typical trends in histograms

Units

Frequency

Average Strike Rate

Frequency

Units

Frequency

Measures of Centrality

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Which is the 7th most grown crop in the country?

Hard to answer

Nominal attributes

Recall

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

How many balls does Sachin Tendulkar typically face in a match?

Measures of Centrality

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Which is the 7th most grown crop in the country?

Hard to answer

Nominal attributes

Recall

Question: What is the typical value of an attribute in our dataset?

How many runs does Sachin Tendulkar typically score in a match?

How many balls does Sachin Tendulkar typically face in a match?

Motivation: Summarise Big Data

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Which is the 7th most grown crop in the country?

Hard to answer

Nominal attributes

Interval	Frequency	Cumulative Frequency
0 - 10	137	196
10-20	59	196
20-30	44	240

Interval	Frequency	Mid-point	Mid-point * frequency
0 - 10	137	5	685
10-20	59	15	885
20-30	44	25	1100
30-40	47	35	1645
40-50	31	45	1395
50-60	22	55	1210
60-70	28	65	750
70-80	10	75	1530
80-90	18	85	1710
90-100	18	95	1260
100-110	12	105	1260
110-120	12	115	1380
120-130	9	125	1125
130-140	4	135	540
140-150	7	145	1015
150-160	1	155	155
160-170	1	165	165

170-180	1	175	175
180-190	1	185	185
190-200	0	195	0
200-210	47	205	205

Left side values

Right side values

Mean: 6.0

Left side values

Right side values

1 2 3 4 5 6 7 8 9 10

Left side values

Right side values

1 2 3 4 5 6 7 8 9 10

Left side values

Right side values

1 2 3 4 5 6 7 8 9 10

Left side values

Right side values

Summarising Data - Part 1

By One Fourth Labs

Summarising Data - Part 1

PadhAI One: FDS Week 3 (MK)

One Fourth Labs

We deliver courseware in AI and related areas

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...

Age	Height	Weight	Cholesterol	Sugar level	.... .....
32	165	75	124	108	...
24	172	81	112	98	...
...	...	...	...	...	...
...	...	...	...	...	...
...	...	...	...	...	...