HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Introduction
Extreme value statistics has to do with the estimation of the probability of rare events occuring. We shall
illustrate this by an example taken from [Castillo88] below:
Consider a civil engineer designing a jetty on a sea. The civil engineer spends several days measuring the
heights of waves at the site and gets a sequence of measurements, ranging between 2ft to 5ft, with an average of
3ft and a standard deviation of 2.5 ft.. The jetty itself is being built using
pylons which are 20ft in height - any wave higher than that will wash away the pylon and destroy the jetty.
What is the probability that a wave of more than 20ft will appear at this site, based on the available data?
A second variant of this problem is as follows. Assume that there are 10 pylons, and the jetty can survive the
loss of upto 4 pylons. So now we are interested in knowing the probability of 4 20ft waves appearing at this
site.
A third problem is as follows. Imagine that there are `n' different traffic sources transmitting data into a
network where there is a single bottleneck buffer of capacity B. This buffer is subject to a limit check. If the
total amount of data exceeds B, then the excess data is also dropped. Note that in this case we are looking a
different type of variable, one which is the sum of random variables; off course, it is also random. If the
random variables are identical and independent, then the sum is a
Martingale, a random variable with very interesting
properties. We shall review some properties of martingales in this document as well.
The problems above are further complicated further by the fact that the knowledge of the underlying process can
be very limited. In many cases, the probability density function of the underlying process is not known. More
or less all we can know about the underlying process is what can be estimated by direct observation i.e. the
mean and potentially the variance. In statistics, it is typical to use the Gaussian distribution in such cases.
However, Gaussian is a good example only when we are considering values close to the observed mean, not when we
are interested in rare events. Also, the penalty of using the wrong c.d.f can be very severe.
In the following sections, we shall see how to estimate probabilities of these so called 'large deviation or
rare' events.
Definitions
A random variable is the quantified outcome of an experiment. Thus, if the experiment consists of drawing
a straight line on the ground and dropping a pin onto it, the outcome is the position of the pin on the line.
The quantification of this outcome would be, for example, the distance of the pin from one end-point of the
line.
The cumulative density function or cdf of a random variable X is the function
F(k) = P(X \ge k)
Some simple estimates
We start with some very simple estimates. Given a random variable X, with a mean \mu and a standard deviation
\sigma, the Markov Inequality P(X \ge a) \le \mu/a provides a bound on the probability of the
exceedance of a. A tighter bound is provided by the Chebyshev's inequality
P(\vert x - \mu \vert \ge \alpha*\sigma) \le \frac{1}{\alpha^2} . Note that these estimates require
knowledge of only the mean and standard deviation, which can be directly estimated from a sample of the random
variable; see here.
Next consider where S=\sum_0^{N-1}X_i \vert X_i \vert \le 1 . A strong limit for S is provided by the
Chernoff's Inequality P(\vert S \vert \ge \alpha \sigma) \le 2 \exp( -n^2/4) , which holds if the absolute
value of X is smaller than 1.0 for all samples.
Order statistics
Consider a random variable X, whose cumulative density function F(k) = P(X \ge k)
An interesting example of the use of ordered statistics is given
here. During World War II, the german army used
to put serial numbers on tanks. Using these serial numbers, the allies tried to estimate the number of German
tanks produced each month. The data from the following tables show that the statistical estimates gave
surprisingly accurate values
| Month | Statistical estimate | Intelligence estimate | Actual data (verified from German records) |
| June 1940 | 169 | 1000 | 122 |
| June 1941 | 244 | 1500 | 271 |
| August 1942 | 327 | 1550 | 342 |
Estimators of ordered statistics
Some common formulae for the estimators of ordered statistics are given below:
| Distribution function | Formula |
| Cdf of the rth order statistic | r {n_c}_r \int_0^{F(x)} u^{r-1} (1-u)^{n-r} du = I_{F(x)} (r,n-r+1) |
| cdf of the maximum among n samples | {F(x)}^n |
| cdf of the minimum among n samples | 1 - (1-F(x))^n |
| cdf of the range | n \int_{-\infty}^{\infty} f(u) {[F(u+w) - F(u)]}^{n-1} du |
Note that I_p(a,b) is the
Incomplete Beta function.
The above formulae have one very large limitation; they deal with the higher order powers of the cumulative
density functions. When we say that we "know" the cdf of a process, we mean that we either have a theoretical
model of the process or we have done an empirical study and fitted a model into it. In either case, the cdf is
an approximation; even more importantly, the act of approximation typically consists of fitting the mean and
other central moments and not really about fitting the cdf, especially for larger values. Now consider that we
have used a cdf F() , where the correct cdf was G(). Let us say that the two cdfs are very close to
each other. F(a)=0.98 and G(a) = 0.99. However, if we are considering the cdf of the maximum from 100
samples, then the difference between F^100(a) = {0.98}^100 = 0.133 and G^100(a) = {0.99}^100 = 0.37
is actually very large! Thus, a very small error of estimation is blown up significantly. To avoid this problem,
we use limit distributions.
Limit distributions
There are many cases when either the number of samples n is very large, or the cdf of the parent process is not
known. In these situations, the table given above is of no use. Rather, we use limit distributions to estimate
the cdf of the maximum or minimum values. There are three limit distributions, shown in the table below.
| Distribution | Maxima limit | Minimum Limit |
| Frechet | H_(1,\gamma) = exp(-x^{- \gamma}) | L_(1,\gamma) = 1 - exp(-(-x^{- \gamma})) |
| Weibull | H_(2,\gamma) = exp(-(-x^{- \gamma})) | L_(2,\gamma) = 1 - exp(-x^{\gamma}) |
| Gumbel | H_{3,0)} = e^{-exp(-x)} | L_{3,0)} = 1- e^{-exp(x)} |
The applicability of the limit distribution for each case depends on the nature of the cdf for the process and some other factors. Also, for the first two distributions, a value of Attach:gamma.jpg Δ has to be computed. The interested reader is invited to see section 3 of [Castillo88]. The following table summarizes some results for some well-known distribution functions.
| cdf of X | Limit Distribution to use | value of \gamma |
| Cauchy distribution | Frechet | 1 |
| Uniform distribution | Weibull | 1 |
| Exponential distribution | Weibull | 1 |
References
[Castillo88] Enrique Castillo, "Extreme Value Theory in Engineering", Academic Press, 1988
[Williams] David Williams, "Probability with Martingales" , Cambridge University Press
Maintainer: abheek.saha@hsc.com
Page views: ###
Comments