Chapter 11 Language of Descriptive Statistics

Section 11.3 Statistical Measures

11.3.2 Robust Measures


The measures presented in this section are robust with respect to outliers: large deviations of single data values do not affect this measures (or only affect it slightly).
Consider an original list

x  =  ( x1 , x2 ,, xn )

for a sample of size n. Let the data xi be the property values of a quantitative property X.
Info 11.3.7
 

The list x(  ) =( x(1) , x(2) ,, x(n) ) gained by ascending sorting

x(1)      x(2)          x(n)

of the original list is called an ordered list or ordered sample (of the original list x). The i th entry x(i) in the ordered list is the ith smallest value in the original list.

Example 11.3.8
Let us again consider the original list x=( x1 , x2 ,, x20 ) for the sample of size n=20 from the examples above. Ascending sorting x(  ) =( x(1) , x(2) ,, x(20) ) results in the following ordered sample:

79999101010101111111112121212131322


Info 11.3.9
 
The (empirical) median x ~ of x1 , x2 ,, xn is defined as

x ~   =  { x( n+1 2 ) for n     odd 1 2 ·( x( n 2 ) + x( n 2 +1) ) for n     even .


In contrast to the arithmetic mean, the (empirical) mean is not sensitive to outliers. For example, the largest value in the ordered original list can be arbitrarily enlarged without changing the median.
Example 11.3.10
In the example above, the sample size n=20 is even. Thus, we have for the median

x ~   =   1 2 ·( x(10) + x(11) )  =   1 2 ·(11+11)  =  11.


Approximately half of the values in the original list are less than or equal to the median, and half of the values are greater than or equal to the median x ~ . This principle can be generalised to define quantiles. For this purpose, take an original list x=( x1 , x2 ,, xn ) for a sample of size n of a quantitative property X.
Info 11.3.11
 
Let

x(  )   =  ( x(1) , x(2) ,, x(n) )

be the corresponding ordered sample and

α(0,1)     and     k= floor (n·α)  =  n·α.

Then

x ~ α   =  { x(k+1) if n·α 1 2 ·( x(k) + x(k+1) ) if n·α

is called a sample α-quantile or simply α-quantile of x1 , x2 , xn .

The 0.25-quantile is also called the lower quartile. It splits off approximately the lowest 25 % of data values from the highest 75 %. Accordingly, the 0.75-quantile is called the upper quartile. For α=0.5 we have the median, i.e. x ~ = x ~ 0.5 . If α(0,1), the ordered list x1 , x2 ,, xn is split so that approximately α·100% of the data value are less or equal to x ~ α and approximately (1-α)·100% of the data values are greater or equal to x ~ α .
Example 11.3.12
Consider again the original list x=( x1 , x2 ,, x20 ) for the sample of size n=20 from the examples above together with the ordered sample x(  ) =( x(1) , x(2) ,, x(20) )

79999101010101111111112121212131322

For α=0.25, the 25%-quantile is defined by n·α= 20 4 =5, i.e. for the lower quartile we have

x ~ 0.25   =   1 2 ·( x(5) + x(6) )  =   1 2 ·(9+10)  =   19 2   =  9.5.

For the upper quartile, we set α=0.75 and obtain n·α= 20·3 4 =15, hence

x ~ 0,75   =   1 2 ·( x(15) + x(16) )  =   1 2 ·(12+12)  =  12.


again, let a sample of size n be given to a quantitative property X with the corresponding ordered sample

x(  )   =  ( x(1) , x(2) ,, x(n) )

and

α[0,  0.5)     and     k  =   floor (n·α)  =  n·α.


Info 11.3.13
 
The α-trimmed (or α-truncated) sample mean is defined as

x α   =   1 n-2·k · j=k+1 n-k x(j)   =   1 n-2·k ·( x(k+1) ++ x(n-k) ).


The α-trimmed mean is an arithmetic mean that discards the α·100% largest and α·100% smallest data points from the calculation. Thus, it is a flexible protection tool against outliers at the boundaries of the data range. However, we mustn't forget that we no longer take all data into account when we use this tool.
Example 11.3.14
In the already much considered data set, the ordered sample x() =( x(1) , x(2) ,, x(20) ) is given by

79999101010101111111112121212131322,

and for α=0.12 and k=20·0.12=2.4=2 we obtain for the 12%-trimmed mean of the sample

x 0.12   =   1 16 · j=3 18 x(j)   =   1 16 ·172  =  10.75.

It is less than the arithmetic mean x =11.15 since outliers, such as x(20) =22, were ignored.