HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Introduction
This article discusses 'rare' events and the special science of predicting and accounting for them in system
design. Understanding rare events is of interest in many disciplines of telecommunications and software
engineering, from predicting packet losses in a G/1 queue to doing reliability design.
Definitions
A rare event is an event which is caused by a culmination of multiple tiny improbable events. Since there are
multiple events which contribute, and the probabilities of
each of these events are low, the probability of
their concurrent occurence is lower still. For example, the probability of each and every member of Systique
having a birthday on the same day is a candidate, since this calls for multiple unrelated birthdays to all be
on one particular day. However, the event that a person wins a lottery is, in a sense, not a candidate for a
large deviation process, because it simply consists of one improbable event and cannot be broken down.
How does one estimate the probability of a rare event? The brute force means would be to find all different
permutations of the constituent events which cause this rare event, estimate the probability of each and then
go on to estimate the probability of the rare event. Not only is this cumbersome, but it runs into the old
problem of having to deal with large powers of fractions; any errors in the fraction value is magnified
significantly. Also, the knowledge of the probability distribution functions of the constituent events are
required, which is a big constraint.
An even more complex issue is that the rare event can come about by a large number of paths. In some cases, the
number of these paths can be infinite. Consider the sum of two random numbers S = X_1 + X_2. The event that
\{ S \ge c \} for some fixed c can occur if X_1 \ge c or X_2 \ge c or if
X_1 = c-k \text{ and } X_2 \ge k . k can take an infinite number of values, and for each possible value
of k, we have a different 'path' by which the rare event we are trying to estimate can be actualized. Large
deviation theory steps into the rescue here too - if there is one path, which has a "cost" significantly lower
than other paths, it is overwhelmingly the case that the rare event will always be actualized by that path
alone.
The large deviation principle is an alternate means to this end. Large deviation theory allows us, under certain
conditions, to find the rate function for the random variable in question. This allows us to directly
estimate the event probabilities asymptotically.
What is the large deviation principle?
Consider a process X , for example, the birthdate (day/month). For each student, we have recorded their
birthdates; in other words, we have taken a sequence \{X_n\} = \{X_1, X_2, \ldots, X_n \} of samples of
the process X. Let us take a random variable Z_n which is some function of \{X_n\}. Now consider
the event that \{Z_n < c \} for some real number c. Information Theory ? tells us that the amount
of information contained in this event is the log of the probability of the event. i.e. the rarer the event,
the lower its probability and the higher the information content.
Thus, for a particular class of 8 students, the samples read {11th Jan, 18th Jan, 11th March, 30th June,
29th July, 8th August, 22nd September, 30th Nov}. The large deviation principle establishes the conditions by
which an event can be treated as a large deviation event and allows us to estimate, in the limit, the
probability of that event.
Formal statement of the large deviation principle
The large deviation principle requires that we find a positive, semi-continuous and smooth function I(c) ,
such that
\lim_{n \rightarrow \infty} \frac{1}{n} \log(\text{Pr}\{Z_n \le c\}) \equiv -I(c) \forall c \in \mathbb{R}
.
As the equation indicates, the probability of the event can be estimated by knowledge of the event itself (which
is stored in the variable c) and the size of sample available. Due to the structure of the formulation, it is
clear that as n increases, the right hand side of the equation will converge for a good rate function. c here
indicates the size of deviation we are interested in; as c increases, the event presumably gets rarer and rarer.
Chernoff's theorem
One of the most common events we are interested in is the nth empirical mean Z_n = \frac{\sum_k^n x_k}{n}
of independent random variables. The term independent is emphasized here, because the proof of this theorem uses
the scaling property of the moment generating functions for independent random variables. This theorem is widely
used, for example, in queueing theory, to estimate the average size of the q, given n different sources of data.
Chernoff's theorem (also known as Cramer's theorem) gives the rate function for this to be
I(c) = \max_\theta (\theta c - \log(M(\theta)))
where
M(\theta) = \mathbb{E} e^{\theta x}
is the
well-known moment-generating function. If we
know the distribution of x_i and it is differentiable, we can easily solve this equation by using the
differentiation technique for maximizing for theta.
The table below gives the rate function for various distributions.
| Distribution | Rate function |
| Bernoulli | I(c) = c \log(c/p) (1-c)\log((1-c)/(1-p)) |
| Gaussian | I(c) = 0.5 \frac{ (c - \mu)^2 }{ \sigma^2 } |
| Exponential | I(c) = c \lambda - 1 - \log(c \lambda) |
| Poisson | I(c) = c \log(\frac{c}{\lambda}) + \lambda - c |
where all symbols have the usual meanings.
''What happens if the sequence of random variables is not independent? The Gartner-Ellis theorem is used in this
case. For more information, the reader can consult [addref]''
Chernoff's theorem extends in a more general way to arbitrary functions of \{X_1,\ldots,X_k\} as follows:
\lim_{n \rightarrow \infty} \frac{1}{n} \log(P \{f(X_1,X_2,\ldots,X_n) < c\} \equiv -I(c) \forall c \in \mathbb{R}
where I(x) is the Fenchel Legendre transform
I(c) = \max_\lambda (c \lambda - M(\lambda))
of
M(a) = \log(\sum_{i \in \mathbb{R}}P(X_i)e^{af(X_i)})
.
Clearly, the former equation is a special case of this equation.
Empirical measures and Sanov's theorem.
The section above assumes that the distribution of the random variable in question is known. We extend this to
the case where the distribution is not known, by creating an empirical distribution function. Consider that we
have a sequence of independent samples {1, 2.1, -0.9, 13, 7, 9, 4.2, 6.3}. We can define the empirical
distribution function F(k) = P \{X \le k\} by counting the number of samples which are less than or equal
to k and then dividing it by the total number of samples. The distribution function for sample set above is
shown in
.
Kolmogorov's strong law of large numbers tells us that the empirical distribution computed here converges to the
"true" distribution function as n grows large (the rate of convergence is O(\sqrt(n)) , so it is kind of
slow).
The distribution function in turn induces an empirical probability measure for all sets within the domain i.e.
P_n(a \le x \le b) = \frac{\text{samples }X \vert a \le X \le b}{\text{total samples} }
.
Sanov's theorem tells us that we can now use a reference measure \mu and create a rate function
I(\mu) = \mu \sum_{\mathbb{R}} \log(\frac{P_n}{\mu}
where
\begin{eqnarray}
\lim_{n \rightarrow \infty} \frac{1}{n} P(P_n \in C) \le - \min_{P_n \in C} I(P_n) \\
C = \mu: \sum_{k=1}^n P_k(x_k) \ge c
\end{eqnarray}
Since there may be multiple possible reference measures, we limit \mu to those measures which will have
make the event true. This is the set C in the equation above.
Effective bandwidth
Let us now look at an application of rare events for communication network design; the concept of effective
bandwidths. We shall define and derive the effective bandwidth for the simplest case, following the work of
[Kelly]. Much work has been done on effective bandwidths, the interested reader is invited to see [Walrand],
[Connell] and others.
Problem definition
Consider a router having buffer space B (in bytes/packets, whatever). When there is data in the buffer, the
router forwards that data using a fixed transmission rate C. Source traffic enters the buffer such that the
number of packets generated in the time interval t is given by a(t) . The size of the queue is given by
q(t) and it is clear that q(t+1) = \max(q(t) + a(t) - c),0) .
Treating q(t) as a random variable, we can find the rate function I(c) using the general form of
Chernoff's theorem discussed above such that Pr\{q \ge x\} \equiv e^{-\theta x} where
M(\theta) = c \theta . The term c(\alpha) = \frac{M(\alpha)} {\alpha} is called the effective
bandwidth.
Significance of the effective bandwidth
To understand the significance of the effective bandwidth, let us assume that a packet is dropped everytime the
queue size is greater than the buffer (this is not strictly true; the difference between q(t) and B
may be larger than one; a more exact expression can be worked out separately) and rewrite the first equation as
-\log\{PER\} = \frac{M(\alpha) B }{ c \alpha }
.
This expression gives a relationship between the log of the packet drop rate (PER), the buffer size available (B)
, the transmission rate c and the function M(a) . From the expression given above, the function M(a)
is actually a measure of the disorderliness or entropy of the arrival function, compared to the clearing rate
c . In other words, the effective bandwidth gives us a relationship between the likely packet drop rate
that a packet stream experiences given the buffer available, the clearing rate c and the statistical
'disorderliness' of the packet stream. The larger the buffer, the more the ability to handle disorderliness
i.e. lower packet drop rate. The more the disorderliness, the easier it is to drop packets, because of buffer overflow.
An empirical formula for effective bandwidth
Since it is not always possible to compute the entropy of a packet stream, there are empirical ways to compute
the entropy of a packet stream. The simplest method known to the author is the index of dispersion metric,
discussed in [Courcoubetis]. Ilka Norros gives a more sophisticated formulation using the Hurst parameter in
[Norros]
Uses of the effective bandwidth formula
The effective bandwidth formula can be used in multiple areas of traffic management and flow control. Some
problems can be created by the independence assumption - there is a large body of work which indicates that even
aggregated Internet traffic is not independent, but heavily correlated. Replacing the independence assumption
by using the Gartner Ellis theorem has been discussed in [Connell], [Chang] and others. We shall outline some
of the uses of the effective bandwidth formula; these will be explained in more detail in a future article.
- Bandwidth allocation for incoming traffic sources. The general theory can be expanded to handle bandwidth
allocation at an aggregate level, and further into allocation of bandwidth for multi-class traffic sources,
offering different QoS models. See [Walrand] and others. Note that the bandwidth allocation may be both driven
by factors independent of the incoming source i.e. fixed allocations, opportunistic scheduling, etc. and may
also be matched to the incoming data. The effective bandwidth theory can be used for optimal matching of
allocation patterns.
- Partitioning of buffers. The concept is similar to the one above, but this relates more to input regulation -
for example appropriate settings for the Random Early Dropping protocol.
- Cascaded, multiplexed and demultiplexed flows. The statistical description of individual flows can be combined
to predict the statistical description of the same flows at an aggregate level as well as the behaviour of the
flows after passing through a router. This is of use in the analysis of multi-stage networks.
References
[Weiss] Alan Weiss, "An Introduction to Large Deviations for communication networks", IEEE Journal on Selected
Areas of Communication, 1995
[Stanford] Course material at Stanford University http://www-stat.stanford.edu/~amir/large-deviations
[Umich]Tutorial at University of Michigan, http://www.cscs.umich.edu/~crshalizi/notebooks/large-deviations.html
[Kelly] F.P.Kelly, "Effective Bandwidths at Multi-class queues"
[Walrand] Kesidis, Walrand, Chang, "Effective bandwidths for multiclass markov fluids and other atm sources"
IEEE JSAC, August 1995
[Connell] N. Connell, "Large Deviations in Queueing Networks", April 1995
[Chang] C.S.Chang, "Effective bandwidth in high-speed digital networks" IEEE JSAC, August 1995
[Courcoubetis] C. Courcoubetis, J. Fouskas, Weber, "On the Performance of an Effective Bandwidths Formula",
Technical Report, 1997
[Norros] I.Norros, "On the use of Fractional Brownian Motion in the theory of connectionless networks",
IEEE Journal on Selected Areas of Communication, August 1995
Maintained by: abheek.saha@hsc.com
Comments