## Testing whether two distributions are different

Use the chi-square test to test whether two distributions are different.

The chi-square test is

$Formula does not parse: \chi^2 = \sum_i {\frac{(O_i – E_i)^2}{E_i}}$

where:

$O_i =$ observed data in bin i

$E_i =$ expected data in bin i

The above can be used directly when comparing a set of observations with a known (expected) distribution. In this case the number of degrees of freedom is equal to the number of bins.

Given two sets of binned data A and B, the expected value in each bin of each set is its proportion of the total, i.e.:

$E_i^A = \frac{A_i + B_i}{N_A + N_B} N_A$

$E_i^B = \frac{A_i + B_i}{N_A + N_B} N_B$

where $N_A$ is the total number of samples in set A, etc.

Thus the chi-square statistic is

$Formula does not parse: \chi^2 = \sum_i {\frac{(A_i – E_i^A)^2}{E_i^A} + \frac{(B_i – E_i^B)^2}{E_i^B}}$

which can also be written:

$Formula does not parse: \chi^2 = \sum_i {\frac{( N_B A_i – N_A B_i)^2}{N_A N_B (A_i + B_i)}}$

If the total number of samples in each set is the same, i.e. $N_A = N_B$, then this simplifies down to:

$Formula does not parse: \chi^2 = \sum_i {\frac{(A_i – B_i)^2}{A_i + B_i}}$

The number of degrees of freedom is (number of bins – 1).

Testing against a significance level

Choose a confidence level and look up the inverse chi square cumulative distribution for the given number of degrees of freedom, e.g. at 95% confidence and 1 degree of freedom, the threshold is $\chi^2_t = 3.84$. If $\chi^2 > \chi^2_t$, then it can be said with the given level of confidence that the distributions differ.

Since the chi square distribution is strictly the probability that the sum of the squares of normal random variables would exceed the given value, this test should only be used when there are enough samples to assume a normal distribution. It will normally be acceptable so long as no more than 10% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10.

Code Pointers

Octave – chisquare_inv, chisquare_test_homogeneity

Perl – Statistics::Distributions

References

Numerical Recipes

Chi Square Distribution

Chi Square Test

## Determining whether two means come from the same distribution

Useful when: you have two sets of measurements, and want to know if there has been a shift in mean value.

Use Student’s t-test for significantly different means.

Sample variance for sample A:

$Formula does not parse: s_A^2 = \frac{1}{N_A – 1}\sum_{i}(a_i – \bar{a})$

Similarly for sample B. Then

$Formula does not parse: t = \frac{\bar{a} – \bar{b}}{s_D}$

where

$Formula does not parse: S_D = \sqrt{ \frac{(N_A – 1)s_A^2 + (N_B – 1) s_B^2}{N_A + N_B – 2} (\frac{1}{N_A} + \frac{1}{N_B}) }$

Finding the significance level

Use the t-distribution with $Formula does not parse: {N_A + N_B – 2}$ degrees of freedom to compute the significance level, which is the probability that $|t|$ could be larger, by chance, for distributions of equal means. Thus a significance of 0.05 suggests that the means are different with 95% confidence.

Testing against a significance level

For e.g. a test with 95% confidence level, find the threshold value of t at 0.05 from the t-distribution. If the computed t exceeds the threshold, the means are considered different to that level of confidence.

Code pointers

Octave – t_test

Perl – Statistics::TTest