HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Statistical Methods
A quantitative variable is a variable that measures a parameter of a given phenomenon. It can be either
discrete or continuous. Example, number of children in a family.
A categorical variable is a variable that defines types or ranges. They are ordered or unordered. For
example, blood types are categorical variables.
Quantitative variables are converted to categorical variables by dividing the range and assigning
categorical variables for each division.
Statistical Measures
The mean is the average of a number of samples. The standard deviation or variance is a measure of the
dispersion from the mean. It is computed as
SD = \sqrt{\frac{\sum_n (X_n - \sigma)^2}{N-1}}
. The
standard deviation is a valid measure of the dispersion of the data around the mean, if the original data
distribution is Gaussian or a similar stable distribution i.e. a distribution that obeys the central limit
theorem. In such a variable, the mean +/- two times the SD covers 95% of the distribution and the mean +/-
three times the SD covers 99.73%. The table below gives a set of confidence intervals for given multiples
of the standard deviation.
| Multiple of standard deviation | Confidence interval |
| 0.674 | 50% |
| 0.7 | 52% |
| 0.8 | 58% |
| 0.9 | 63% |
| 1.0 | 69% |
| 1.2 | 69% |
| 1.4 | 84% |
| 1.5 | 87% |
| 1.645 | 90% |
| 1.8 | 92.8% |
| 1.9 | 94.3% |
| 1.96 | 95% |
| 2.2 | 97.2% |
| 2.4 | 98.4% |
| 3.0 | 99.7% |
| 3.291 | 99.9% |
The median is the point in the range of the random variable X, such that of the total number of measurements, half lie above the median and half lie below. The median is a much more robust parameter than the mean, because it is not affected by extremely anomalous readings. For example, if one reading was measured erroneously as 1800 instead of 1.800, it will affect the mean badly, but not the median. Other similar measures are the quartile and third-quartile points etc.
Populations and Samples
A sample is a bunch of measurements made from the process under observation. The number of points in the sample is given by Ns. A sample is taken in order to estimate various characteristics of the process under question.
A sample is unbiased if it is representative of the overall process. A sample is said to be precise if it is repeatable.
Note that since a sample is made randomly from the process under observation, there is going to be some mismatch between the measurements made on the sample and the measurements for the actual process. Let us consider the mean as computed from the sample of size Ns. The standard deviation of the sample is given as SDs. The standard error of the mean of a sample is computed as
SEM = \frac{SD_s}{\sqrt{N_s}}
. What does the standard error mean? It means that there is a 95% chance
that the actual mean of the process under observation is within mean + 2*SEM. In other words, it is an estimate
of the accuracy of the sampling process and results obtained thereof.
The above can be summarized as follows: We have a sample of some size and we obtain estimates of the mean of the
actual process by taking a mean of the sample. The Standard Error Measure is obtained from the measured standard
deviation of the sample. It provides a range around the measured mean within which the actual mean would lie
with some degree of probability.
The mean plus/minus 1.96 times its SEM is the 95% confidence interval. There is a 95% confidence that the actual
mean lies within this range. Similarly we can derive confidence intervals for any desired degree of confidence,
P, using the Table 1 above.. Note that the confidence intervals tighten as the sample size goes up, because the
range of the confidence interval is a multiple of the SEM, which itself is inversely proportional to the sample
size.
Testing the null hypothesis
What is the null hypothesis
Suppose we take two samples. Can we say that the two samples are representative of the same process? The hypothesis that they are is known as the null hypothesis.
The Standard Error difference of two samples with sample size N1 and N2 and variances SD1 and SD2 is given by:
SE_{diff} = \sqrt{\frac{{SD_1}^2}{N_1} + \frac{{SD_2}^2}{N_2}}
The Standard Error difference is interpreted as follows. Suppose the two samples have means M1 and M2, with a
difference of k. Even though the hypothesis is that the two samples are measuring the same process, due to
chance k will be non-zero. The question is whether this value is significant. The important ratio here is that
of k with respect to the Standard Error difference. If k is greater than 1.96*SEdiff, there is a 5% probability
that the two samples are indeed from the same process. Similar limits may be set for different confidence levels.
Type I and Type II errors
The error of assuming two samples to be from different processes when in fact they are from the same process is
called a Type I error. The inverse error is known as a Type II error.
Percentages and Paired Alternatives
We look now at a study where we are trying to establish percentages in a population that have a particular
attribute by doing random sampling. For example, let us say we are trying to measure the male to female ratio
in India. By taking a random sample we measure a ratio P. The standard error of the ratio is given by
SE(p) = \sqrt{\frac{p (100 - p) }{N_s}}
where p is measured as a percentage. From this we can get the 95% confidence interval as we have shown before
i.e. the 95% confidence interval is from (P-SE(p))% to (p+SE(p))%.
Similarly the standard error between percentages to arrive at the null hypothesis for two random samples which
yield two ratios p1 and p2 is given by:
SE_{diff} = \sqrt{\frac{p_1(100 - p_1)}{N_1} + \frac{p_2(100 - p_2)}{N_2}}
Student's T test
Student's T test is used to test the null hypothesis that the measured mean of a sample is equal to the real
mean of a population, when the sample size is small i.e. less than 50. If you look at section 4 above, we
evaluate the null-hypothesis by comparing the difference between the two quantities against the standard error.
The problem with a small sample size is that the error in measuring standard error is higher, which means that
the uncertainty of the null hypothesis is lower. Gossett studied this problem and came up with a distribution
called the Student's distribution (he used Student as a pseudonym), which is used to make the null-hypothesis
evaluation when the sample size is small. The Student's distribution is used to establish the multiple of the
standard error above which a measured difference is significant. We know that for large sample sizes, the 95%
confidence interval around the mean is given by 1.96 times the standard error S_i. For small sample sizes,
the figure 1.96 is replaced by a corresponding figure from Student's table which is dependent on the sample
size. The Student's table is given here. Note that it is
indexed by the degrees of freedom. For a case where we are comparing the output of a single survey against an
established means, the degrees of freedom is one less than the number of samples. When we are comparing two
surveys, the overall degrees of freedom is the sum of the degrees of freedom of the two surveys.
Chi-squared tests
Upto now, we have been discussing tests for quantitative variables. The Chi-squared test (pronounced with a hard 'ch', as in kite) is used to measure categorical variables. See 1 for a definition of a categorical variable, if you have forgotten.
Suppose we have a team with 50 members. The director produces a compensation plan for the team, in which the
team is broken up into 4 categories, CA,CB,CC and CD (in increasing order) according to their compensation
packages. Then the senior technical member of the team is asked to classify the members of the team into four
categories, A,B,C,D (in increasing order) according to their value to the team. Obviously there will be
differences in the number of people in the categories in the two measures. The question that the chi-squared
test wishes to answer is : is the difference significant? If it is, there is obviously some discrepancy between
the perception of the senior management (who set the compensation packages) and the perception of the technical
head.
The technique for conducting a chi-squared test is pretty simple.
- Write the counts for each category in table. For the situation above, we can write up a table such as given
below
| Category | Compensation | Value |
| A | 4 | 10 |
| B | 26 | 101 |
- Construct a null-hypothesis out of the table. For example, we could hypothesise that the differences between
the two tables are not statistically significant. The null-hypothesis then becomes that the proportions in each
category for the two sets of observations are really the same. The table then gets converted as follows:
| Category | Compensation | Value | Totals | Proportion of first col to sum |
| A | 4 | 10 | 14 | 0.2857 |
| B | 26 | 17 | 43 | 0.6046 |
| C | 14 | 15 | 29 | 0.4827 |
| D | 6 | 8 | 14 | 0.4285 |
- Based on the above table, we compute the expected number for each row. The expected number is that reading
that we would expect if the null-hypothesis is true. Here the null hypothesis is that the quantities /
proportions in each cell of both columns ( Compensation and value) are the same. We get the total quantity/
proportion from the Totals column. Thus for row B, the total proportion is 0.43, so both columns should have
the figure 21.5 in the row B. After this computation, the table looks like the following
| Category | Measured values | | Expected Values | | O-E | | (O-E)^2 | |
| | Comp | Value | Comp | Value | Comp | Value | Comp | Value |
| A | 4 | 10 | 7 | 7 | -3 | 3 | 1.285 | 1.285 |
| B | 26 | 17 | 21.5 | 21.5 | 4.5 | -4.5 | 0.941 | 0.941 |
| C | 14 | 15 | 14.5 | 14.5 | -0.5 | 0.5 | 0.17 | 0.17 |
| D | 6 | 8 | 7 | 7 | -1 | 1 | 0.142 | 0.142 |
| Totals | | | | | | | 2.548 | 2.548 |
- The result is the sum of the two columns, which is 5.096. Note that the computation in this case is simplified
since the sample sizes are the same.
- Having computed the expected number, the chi-squared value can be taken from
\chi^2 = \sum_{\text{all columns}}\frac{(O_i - E_i)^2}{E_i}
- Having computed the chi-squared value, we can match it against the standard chi-table and, given the degrees
of freedom, determine the confidence interval. The computation of degrees of freedom is a little tricky.
A standard rule of thumb is that it is given as (num columns - 1)*(num rows - 1). The theory is that the number
of degrees of freedom is the number of entries that must be provided in addition to all the totals (row wise and
column wise) in order for someone to complete the table. In our case, it is obviously 3. The standard chi-table
is given here
Entering the table at three degrees of freedom, we can see that our value of chi comes to between 0.5 and 0.1.
This means that there is a more than 95%, but less than 98% chance that the perception of the senior management
and that of the technical lead is the same and all discrepancies are insignificant. The null hypothesis is not
disproved.
For a two-by-two table i.e. two rows of two columns containing entries a, b, c and d as shown below the
chi-value can be computed directly as follows:
\chi^2 = \frac{(ad - bc)^2 (a+b+c+d) } { (a+b)(c+d)(b+d)(a+c) }
Measuring skew - non parametric tests
Upto now we have considered we have considered surveys and samples which are assumed to have a normal distribution. However, it is possible that we have to execute similar tests, especially null-hypothesis tests on distributions that cannot safely be assumed to be normal.
Wilcoxon signed rank sum test
In this section we discuss the Wilcoxon signed rank sum test which can be used to measure non-normal distributions and evaluate null-hypotheses.
The first test we describe is used to evaluate a null-hypothesis regarding the differences between two sets of data, one taken of a control group and the other taken of the experimental group. Alternately, it could be the same group, with measurements taken in a 'normal' state and subsequently measurements taken in the 'experimental' state. It is important that the pairs be taken so that they are independent of the other pairs and that the distribution, even if not normal be symmetric. The steps are then as follows:
- Compute the difference in readings for each pair. Ignore any negative sign.
- Rank all differences in increasing order of magnitude. If absolute value of two or more differences are the same, give them the same rank but averaging over the individual ranks i.e. if there are three same readings and they fall in ranks 3,4 and 5, set them all to 3+4+5/3 = 4.
- Now, sign each rank, by taking the sign of the corresponding difference.
- Add up the signed ranks and the unsigned ranks separately and then take the minimum of the two. Count the number of rows and consult the table given below. Find the level of confidence appropriate. For example, if the minimum rank sum is 8 for 12 rows, the confidence level is better than 95% that the two sets of results are statistically the same, but not 99%.
| Number of rows | 5% confidence | 1% confidence |
| 7 | 2 | 0 |
| 8 | 2 | 0 |
| 9 | 6 | 2 |
| 10 | 8 | 3 |
| 11 | 11 | 5 |
| 12 | 14 | 7 |
| 13 | 17 | 10 |
| 14 | 21 | 13 |
| 15 | 25 | 16 |
| 16 | 30 | 19 |
The second test is when the distribution for the study sample is non-symmetric. The technique is as follows:
- We take all the results and sort them in linear order and rank them as given above.
- Then we add up the rank for the two sets independently and choose the smaller of the two.
- We consult the table here. taking the entry at the row and
column for the number of rows in the study. Here it is assumed that the number of entries in both studies is
equal. The table gives the appropriate confidence interval.
Mann and Whitney Test
If the number of entries in the two sets of data, n1 and n2 are not equal, we use the Mann and Whitney U test
as follows. T1 is set to the rank of the smaller sample
If the differences are more or less unique, we can compute z as
z = \frac{\vert T_1 - n_1(n_1 + n_2 + 1)/2 \vert}{\sqrt{n_1 n_2 (n_1 + n_2 + 1) /12}}
z is normally distributed and we can use the table in section 2 to arrive at the appropriate confidence
interval.
Correlation and Regression
Correlation
Correlation is a measure of linear association. If we have two sets of data, the correlation coefficient
(or Pearson's correlation coefficient) is set to 1.0 (the maximum) if a linear increase in one set is always
matched by a linear increase in the other. It is set to -1.0 if the correlation is negative, i.e. increase in
one set is matched by a corresponding decrease in the other.
A good way to get a visual indication of a correlation is a scatter diagram. This consists of a graph, where
the X-axis and Y-axis represent the ranges of the first and second sets of data respectively. The points in the
graph correspond to each pair of data points for the two sets. A scatter diagram with points scattered all over
indicates low correlation. High correlation is shown by points lining up either on the x=y line or the x+y=k line.
The correlation coefficient for two sets of data xi and yi of sample size n is given by
r = \frac{\sum_i xy -n Avg(X) Avg(Y)} {(n-1) SD(X) SD(Y) }
To test whether the correlation is significant one uses the Student's t test described above. The value of t is
computed as
t = r \sqrt{\frac{n-2}{1 - r^2}}
The number of degrees of freedom is given by n-2.
The Pearson's correlation coefficient assumes normal distribution of the two data sets. Spearman's rank
correlation can be used to compute the correlation coefficient of non-Normal data sets. The technique as before
is to rank the readings and then compute 'd', the distance between ranks as given above. The correlation
coefficient is measured as
r_s = \frac{6 \sum_i {d_i}^2}{n(n^2 - 1)}
Regression
Whereas correlation is a symmetrical measure i.e. the correlation of X to Y is the same as the correlation of Y to X, regression attempts to describes one set of data points, the dependent variable Y as a linear function of the other set of data points X, which are supposedly independent. The function takes the form Y = \alpha + \beta X
It can be shown that
\beta = \frac{ \sum_i xy -n Avg(X) Avg(Y) } { (n-1){SD(X)}^2 }
and
\alpha = Avg(X) - \beta Avg(Y).
The standard error of the slope coefficient beta is given by
\sqrt{{SD(Y)}^2(1-r^2)\frac{n-1}{n-2}}
. We
can use this to estimate the confidence interval as given above in section 2
References
Everything in this tutorial can be found in any basic book on statistics. One suggested reading is
"Basic Econometrics", by Damodar N. Gujarati
<< | DataModeling-Trail | TimeSeries >>
Maintainer: abheek.saha@hsc.com
Comments