HSCTechnicalWiki


view edit history print Talk subscribe
SearchWiki
Inspired by: Support Wikipedia

Views: 92

Full site statistics

Authors:

edit SideBar

Main » Extreme Value Statistics

PageList

Papers

Tutorials

HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.

Introduction

Extreme value statistics has to do with the estimation of the probability of rare events occuring. We shall illustrate this by an example taken from [Castillo88] below:

Consider a civil engineer designing a jetty on a sea. The civil engineer spends several days measuring the heights of waves at the site and gets a sequence of measurements, ranging between 2ft to 5ft, with an average of 3ft and a standard deviation of 2.5 ft.. The jetty itself is being built using pylons which are 20ft in height - any wave higher than that will wash away the pylon and destroy the jetty. What is the probability that a wave of more than 20ft will appear at this site, based on the available data?

A second variant of this problem is as follows. Assume that there are 10 pylons, and the jetty can survive the loss of upto 4 pylons. So now we are interested in knowing the probability of 4 20ft waves appearing at this site.

A third problem is as follows. Imagine that there are `n' different traffic sources transmitting data into a network where there is a single bottleneck buffer of capacity B. This buffer is subject to a limit check. If the total amount of data exceeds B, then the excess data is also dropped. Note that in this case we are looking a different type of variable, one which is the sum of random variables; off course, it is also random. If the random variables are identical and independent, then the sum is a Martingale, a random variable with very interesting properties. We shall review some properties of martingales in this document as well.

The problems above are further complicated further by the fact that the knowledge of the underlying process can be very limited. In many cases, the probability density function of the underlying process is not known. More or less all we can know about the underlying process is what can be estimated by direct observation i.e. the mean and potentially the variance. In statistics, it is typical to use the Gaussian distribution in such cases. However, Gaussian is a good example only when we are considering values close to the observed mean, not when we are interested in rare events. Also, the penalty of using the wrong c.d.f can be very severe.

In the following sections, we shall see how to estimate probabilities of these so called 'large deviation or rare' events.

Definitions

A random variable is the quantified outcome of an experiment. Thus, if the experiment consists of drawing a straight line on the ground and dropping a pin onto it, the outcome is the position of the pin on the line. The quantification of this outcome would be, for example, the distance of the pin from one end-point of the line.

The cumulative density function or cdf of a random variable X is the function F(k) = P(X \ge k)

Some simple estimates

We start with some very simple estimates. Given a random variable X, with a mean \mu and a standard deviation \sigma, the Markov Inequality P(X \ge a) \le \mu/a provides a bound on the probability of the exceedance of a. A tighter bound is provided by the Chebyshev's inequality P(\vert x - \mu \vert \ge \alpha*\sigma) \le \frac{1}{\alpha^2} . Note that these estimates require knowledge of only the mean and standard deviation, which can be directly estimated from a sample of the random variable; see here.

Next consider where S=\sum_0^{N-1}X_i \vert X_i \vert \le 1 . A strong limit for S is provided by the Chernoff's Inequality P(\vert S \vert \ge \alpha \sigma) \le 2 \exp( -n^2/4) , which holds if the absolute value of X is smaller than 1.0 for all samples.

Order statistics

Consider a random variable X, whose cumulative density function F(k) = P(X \ge k) An interesting example of the use of ordered statistics is given here. During World War II, the german army used to put serial numbers on tanks. Using these serial numbers, the allies tried to estimate the number of German tanks produced each month. The data from the following tables show that the statistical estimates gave surprisingly accurate values

MonthStatistical estimateIntelligence estimateActual data (verified from German records)
June 19401691000122
June 19412441500271
August 19423271550342

Estimators of ordered statistics

Some common formulae for the estimators of ordered statistics are given below:

Distribution functionFormula
Cdf of the rth order statistic r {n_c}_r \int_0^{F(x)} u^{r-1} (1-u)^{n-r} du = I_{F(x)} (r,n-r+1)
cdf of the maximum among n samples {F(x)}^n
cdf of the minimum among n samples 1 - (1-F(x))^n
cdf of the range n \int_{-\infty}^{\infty} f(u) {[F(u+w) - F(u)]}^{n-1} du

Note that I_p(a,b) is the Incomplete Beta function.

The above formulae have one very large limitation; they deal with the higher order powers of the cumulative density functions. When we say that we "know" the cdf of a process, we mean that we either have a theoretical model of the process or we have done an empirical study and fitted a model into it. In either case, the cdf is an approximation; even more importantly, the act of approximation typically consists of fitting the mean and other central moments and not really about fitting the cdf, especially for larger values. Now consider that we have used a cdf F() , where the correct cdf was G(). Let us say that the two cdfs are very close to each other. F(a)=0.98 and G(a) = 0.99. However, if we are considering the cdf of the maximum from 100 samples, then the difference between F^100(a) = {0.98}^100 = 0.133 and G^100(a) = {0.99}^100 = 0.37 is actually very large! Thus, a very small error of estimation is blown up significantly. To avoid this problem, we use limit distributions.

Limit distributions

There are many cases when either the number of samples n is very large, or the cdf of the parent process is not known. In these situations, the table given above is of no use. Rather, we use limit distributions to estimate the cdf of the maximum or minimum values. There are three limit distributions, shown in the table below.

DistributionMaxima limitMinimum Limit
Frechet H_(1,\gamma) = exp(-x^{- \gamma}) L_(1,\gamma) = 1 - exp(-(-x^{- \gamma}))
Weibull H_(2,\gamma) = exp(-(-x^{- \gamma})) L_(2,\gamma) = 1 - exp(-x^{\gamma})
Gumbel H_{3,0)} = e^{-exp(-x)} L_{3,0)} = 1- e^{-exp(x)}

The applicability of the limit distribution for each case depends on the nature of the cdf for the process and some other factors. Also, for the first two distributions, a value of Attach:gamma.jpg Δ has to be computed. The interested reader is invited to see section 3 of [Castillo88]. The following table summarizes some results for some well-known distribution functions.

cdf of XLimit Distribution to usevalue of \gamma
Cauchy distributionFrechet1
Uniform distributionWeibull1
Exponential distributionWeibull1

References

[Castillo88] Enrique Castillo, "Extreme Value Theory in Engineering", Academic Press, 1988

[Williams] David Williams, "Probability with Martingales" , Cambridge University Press

Maintainer: abheek.saha@hsc.com Page views: ###

Comments

Add Comment 
Email address(will be kept hidden) 
Enter code:

Page last modified on November 16, 2009, at 05:40 AM