HSCTechnicalWiki


view edit history print Talk subscribe
SearchWiki
Inspired by: Support Wikipedia

Views: 31

Full site statistics

Authors:

edit SideBar

Main » Statistical Methods

PageList

Papers

Tutorials

HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.

Statistical Methods

A quantitative variable is a variable that measures a parameter of a given phenomenon. It can be either discrete or continuous. Example, number of children in a family. A categorical variable is a variable that defines types or ranges. They are ordered or unordered. For example, blood types are categorical variables. Quantitative variables are converted to categorical variables by dividing the range and assigning categorical variables for each division.

Statistical Measures

The mean is the average of a number of samples. The standard deviation or variance is a measure of the dispersion from the mean. It is computed as

SD = \sqrt{\frac{\sum_n (X_n - \sigma)^2}{N-1}}
. The standard deviation is a valid measure of the dispersion of the data around the mean, if the original data distribution is Gaussian or a similar stable distribution i.e. a distribution that obeys the central limit theorem. In such a variable, the mean +/- two times the SD covers 95% of the distribution and the mean +/- three times the SD covers 99.73%. The table below gives a set of confidence intervals for given multiples of the standard deviation.

Multiple of standard deviationConfidence interval
0.67450%
0.752%
0.858%
0.963%
1.069%
1.269%
1.484%
1.587%
1.64590%
1.892.8%
1.994.3%
1.9695%
2.297.2%
2.498.4%
3.099.7%
3.29199.9%

The median is the point in the range of the random variable X, such that of the total number of measurements, half lie above the median and half lie below. The median is a much more robust parameter than the mean, because it is not affected by extremely anomalous readings. For example, if one reading was measured erroneously as 1800 instead of 1.800, it will affect the mean badly, but not the median. Other similar measures are the quartile and third-quartile points etc.

Populations and Samples

A sample is a bunch of measurements made from the process under observation. The number of points in the sample is given by Ns. A sample is taken in order to estimate various characteristics of the process under question. A sample is unbiased if it is representative of the overall process. A sample is said to be precise if it is repeatable. Note that since a sample is made randomly from the process under observation, there is going to be some mismatch between the measurements made on the sample and the measurements for the actual process. Let us consider the mean as computed from the sample of size Ns. The standard deviation of the sample is given as SDs. The standard error of the mean of a sample is computed as

SEM = \frac{SD_s}{\sqrt{N_s}}
. What does the standard error mean? It means that there is a 95% chance

that the actual mean of the process under observation is within mean + 2*SEM. In other words, it is an estimate of the accuracy of the sampling process and results obtained thereof. The above can be summarized as follows: We have a sample of some size and we obtain estimates of the mean of the actual process by taking a mean of the sample. The Standard Error Measure is obtained from the measured standard deviation of the sample. It provides a range around the measured mean within which the actual mean would lie with some degree of probability. The mean plus/minus 1.96 times its SEM is the 95% confidence interval. There is a 95% confidence that the actual mean lies within this range. Similarly we can derive confidence intervals for any desired degree of confidence, P, using the Table 1 above.. Note that the confidence intervals tighten as the sample size goes up, because the range of the confidence interval is a multiple of the SEM, which itself is inversely proportional to the sample size.

Testing the null hypothesis

What is the null hypothesis

Suppose we take two samples. Can we say that the two samples are representative of the same process? The hypothesis that they are is known as the null hypothesis. The Standard Error difference of two samples with sample size N1 and N2 and variances SD1 and SD2 is given by:

SE_{diff} = \sqrt{\frac{{SD_1}^2}{N_1} + \frac{{SD_2}^2}{N_2}}

The Standard Error difference is interpreted as follows. Suppose the two samples have means M1 and M2, with a difference of k. Even though the hypothesis is that the two samples are measuring the same process, due to chance k will be non-zero. The question is whether this value is significant. The important ratio here is that of k with respect to the Standard Error difference. If k is greater than 1.96*SEdiff, there is a 5% probability that the two samples are indeed from the same process. Similar limits may be set for different confidence levels.

Type I and Type II errors

The error of assuming two samples to be from different processes when in fact they are from the same process is called a Type I error. The inverse error is known as a Type II error.

Percentages and Paired Alternatives

We look now at a study where we are trying to establish percentages in a population that have a particular attribute by doing random sampling. For example, let us say we are trying to measure the male to female ratio in India. By taking a random sample we measure a ratio P. The standard error of the ratio is given by

SE(p) = \sqrt{\frac{p (100 - p) }{N_s}}

where p is measured as a percentage. From this we can get the 95% confidence interval as we have shown before i.e. the 95% confidence interval is from (P-SE(p))% to (p+SE(p))%. Similarly the standard error between percentages to arrive at the null hypothesis for two random samples which yield two ratios p1 and p2 is given by:

SE_{diff} = \sqrt{\frac{p_1(100 - p_1)}{N_1} + \frac{p_2(100 - p_2)}{N_2}}

Student's T test

Student's T test is used to test the null hypothesis that the measured mean of a sample is equal to the real mean of a population, when the sample size is small i.e. less than 50. If you look at section 4 above, we evaluate the null-hypothesis by comparing the difference between the two quantities against the standard error. The problem with a small sample size is that the error in measuring standard error is higher, which means that the uncertainty of the null hypothesis is lower. Gossett studied this problem and came up with a distribution called the Student's distribution (he used Student as a pseudonym), which is used to make the null-hypothesis evaluation when the sample size is small. The Student's distribution is used to establish the multiple of the standard error above which a measured difference is significant. We know that for large sample sizes, the 95% confidence interval around the mean is given by 1.96 times the standard error S_i. For small sample sizes, the figure 1.96 is replaced by a corresponding figure from Student's table which is dependent on the sample size. The Student's table is given here. Note that it is indexed by the degrees of freedom. For a case where we are comparing the output of a single survey against an established means, the degrees of freedom is one less than the number of samples. When we are comparing two surveys, the overall degrees of freedom is the sum of the degrees of freedom of the two surveys.

Chi-squared tests

Upto now, we have been discussing tests for quantitative variables. The Chi-squared test (pronounced with a hard 'ch', as in kite) is used to measure categorical variables. See 1 for a definition of a categorical variable, if you have forgotten. Suppose we have a team with 50 members. The director produces a compensation plan for the team, in which the team is broken up into 4 categories, CA,CB,CC and CD (in increasing order) according to their compensation packages. Then the senior technical member of the team is asked to classify the members of the team into four categories, A,B,C,D (in increasing order) according to their value to the team. Obviously there will be differences in the number of people in the categories in the two measures. The question that the chi-squared test wishes to answer is : is the difference significant? If it is, there is obviously some discrepancy between the perception of the senior management (who set the compensation packages) and the perception of the technical head. The technique for conducting a chi-squared test is pretty simple.

  1. Write the counts for each category in table. For the situation above, we can write up a table such as given

below

CategoryCompensationValue
A410
B26101
  1. Construct a null-hypothesis out of the table. For example, we could hypothesise that the differences between

the two tables are not statistically significant. The null-hypothesis then becomes that the proportions in each category for the two sets of observations are really the same. The table then gets converted as follows:

CategoryCompensationValueTotalsProportion of first col to sum
A410140.2857
B2617430.6046
C1415290.4827
D68140.4285
  1. Based on the above table, we compute the expected number for each row. The expected number is that reading

that we would expect if the null-hypothesis is true. Here the null hypothesis is that the quantities / proportions in each cell of both columns ( Compensation and value) are the same. We get the total quantity/ proportion from the Totals column. Thus for row B, the total proportion is 0.43, so both columns should have the figure 21.5 in the row B. After this computation, the table looks like the following

CategoryMeasured values Expected Values O-E (O-E)^2 
 CompValueCompValueCompValueCompValue
A41077-331.2851.285
B261721.521.54.5-4.50.9410.941
C141514.514.5-0.50.50.170.17
D6877-110.1420.142
Totals      2.5482.548
  1. The result is the sum of the two columns, which is 5.096. Note that the computation in this case is simplified

since the sample sizes are the same.

  1. Having computed the expected number, the chi-squared value can be taken from
\chi^2 = \sum_{\text{all columns}}\frac{(O_i - E_i)^2}{E_i}
  1. Having computed the chi-squared value, we can match it against the standard chi-table and, given the degrees

of freedom, determine the confidence interval. The computation of degrees of freedom is a little tricky. A standard rule of thumb is that it is given as (num columns - 1)*(num rows - 1). The theory is that the number of degrees of freedom is the number of entries that must be provided in addition to all the totals (row wise and column wise) in order for someone to complete the table. In our case, it is obviously 3. The standard chi-table is given here Entering the table at three degrees of freedom, we can see that our value of chi comes to between 0.5 and 0.1. This means that there is a more than 95%, but less than 98% chance that the perception of the senior management and that of the technical lead is the same and all discrepancies are insignificant. The null hypothesis is not disproved.

For a two-by-two table i.e. two rows of two columns containing entries a, b, c and d as shown below the chi-value can be computed directly as follows:

\chi^2 = \frac{(ad - bc)^2 (a+b+c+d) } { (a+b)(c+d)(b+d)(a+c) }

Measuring skew - non parametric tests

Upto now we have considered we have considered surveys and samples which are assumed to have a normal distribution. However, it is possible that we have to execute similar tests, especially null-hypothesis tests on distributions that cannot safely be assumed to be normal.

Wilcoxon signed rank sum test

In this section we discuss the Wilcoxon signed rank sum test which can be used to measure non-normal distributions and evaluate null-hypotheses. The first test we describe is used to evaluate a null-hypothesis regarding the differences between two sets of data, one taken of a control group and the other taken of the experimental group. Alternately, it could be the same group, with measurements taken in a 'normal' state and subsequently measurements taken in the 'experimental' state. It is important that the pairs be taken so that they are independent of the other pairs and that the distribution, even if not normal be symmetric. The steps are then as follows:

  1. Compute the difference in readings for each pair. Ignore any negative sign.
  2. Rank all differences in increasing order of magnitude. If absolute value of two or more differences are the same, give them the same rank but averaging over the individual ranks i.e. if there are three same readings and they fall in ranks 3,4 and 5, set them all to 3+4+5/3 = 4.
  3. Now, sign each rank, by taking the sign of the corresponding difference.
  4. Add up the signed ranks and the unsigned ranks separately and then take the minimum of the two. Count the number of rows and consult the table given below. Find the level of confidence appropriate. For example, if the minimum rank sum is 8 for 12 rows, the confidence level is better than 95% that the two sets of results are statistically the same, but not 99%.
Number of rows5% confidence1% confidence
720
820
962
1083
11115
12147
131710
142113
152516
163019

The second test is when the distribution for the study sample is non-symmetric. The technique is as follows:

  1. We take all the results and sort them in linear order and rank them as given above.
  2. Then we add up the rank for the two sets independently and choose the smaller of the two.
  3. We consult the table here. taking the entry at the row and

column for the number of rows in the study. Here it is assumed that the number of entries in both studies is equal. The table gives the appropriate confidence interval.

Mann and Whitney Test

If the number of entries in the two sets of data, n1 and n2 are not equal, we use the Mann and Whitney U test as follows. T1 is set to the rank of the smaller sample If the differences are more or less unique, we can compute z as

z = \frac{\vert T_1 - n_1(n_1 + n_2 + 1)/2 \vert}{\sqrt{n_1 n_2 (n_1 + n_2 + 1) /12}}

z is normally distributed and we can use the table in section 2 to arrive at the appropriate confidence interval.

Correlation and Regression

Correlation

Correlation is a measure of linear association. If we have two sets of data, the correlation coefficient (or Pearson's correlation coefficient) is set to 1.0 (the maximum) if a linear increase in one set is always matched by a linear increase in the other. It is set to -1.0 if the correlation is negative, i.e. increase in one set is matched by a corresponding decrease in the other. A good way to get a visual indication of a correlation is a scatter diagram. This consists of a graph, where the X-axis and Y-axis represent the ranges of the first and second sets of data respectively. The points in the graph correspond to each pair of data points for the two sets. A scatter diagram with points scattered all over indicates low correlation. High correlation is shown by points lining up either on the x=y line or the x+y=k line. The correlation coefficient for two sets of data xi and yi of sample size n is given by

r = \frac{\sum_i xy -n Avg(X) Avg(Y)} {(n-1) SD(X) SD(Y) }

To test whether the correlation is significant one uses the Student's t test described above. The value of t is computed as

t = r \sqrt{\frac{n-2}{1 - r^2}}
The number of degrees of freedom is given by n-2. The Pearson's correlation coefficient assumes normal distribution of the two data sets. Spearman's rank correlation can be used to compute the correlation coefficient of non-Normal data sets. The technique as before is to rank the readings and then compute 'd', the distance between ranks as given above. The correlation coefficient is measured as
r_s = \frac{6 \sum_i {d_i}^2}{n(n^2 - 1)}

Regression

Whereas correlation is a symmetrical measure i.e. the correlation of X to Y is the same as the correlation of Y to X, regression attempts to describes one set of data points, the dependent variable Y as a linear function of the other set of data points X, which are supposedly independent. The function takes the form Y = \alpha + \beta X It can be shown that

\beta = \frac{ \sum_i xy -n Avg(X) Avg(Y) } { (n-1){SD(X)}^2 }

and \alpha = Avg(X) - \beta Avg(Y). The standard error of the slope coefficient beta is given by

\sqrt{{SD(Y)}^2(1-r^2)\frac{n-1}{n-2}}
. We can use this to estimate the confidence interval as given above in section 2

References

Everything in this tutorial can be found in any basic book on statistics. One suggested reading is "Basic Econometrics", by Damodar N. Gujarati

<< | DataModeling-Trail | TimeSeries >>

Maintainer: abheek.saha@hsc.com

Comments

Add Comment 
Email address(will be kept hidden) 
Enter code:

Page last modified on February 23, 2011, at 03:25 AM