<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jon Schutz Technical Notes and Recommendations &#187; Statistics</title>
	<atom:link href="http://notes.jschutz.net/topics/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://notes.jschutz.net</link>
	<description>Useful snippets technical info and recommendations</description>
	<lastBuildDate>Wed, 02 Mar 2011 12:01:56 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Testing whether two distributions are different</title>
		<link>http://notes.jschutz.net/2007/11/testing-whether-two-distributions-are-different/</link>
		<comments>http://notes.jschutz.net/2007/11/testing-whether-two-distributions-are-different/#comments</comments>
		<pubDate>Thu, 08 Nov 2007 11:37:20 +0000</pubDate>
		<dc:creator>jon</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[chi-square test]]></category>
		<category><![CDATA[distributions]]></category>

		<guid isPermaLink="false">http://notes.jschutz.net/6/statistics/testing-whether-two-distributions-are-different</guid>
		<description><![CDATA[Use the chi-square test to test whether two distributions are different.
The chi-square test is

where:
 observed data in bin i
 expected data in bin i
The above can be used directly when comparing a set of observations with a known (expected) distribution.  In this case the number of degrees of freedom is equal to the number [...]]]></description>
			<content:encoded><![CDATA[<p>Use the chi-square test to test whether two distributions are different.</p>
<p>The chi-square test is</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/3e9/3e96b86664a179679785450090c0a9a2-FFFFFF000000.png' alt='Formula does not parse: \chi^2 = \sum_i {\frac{(O_i &#8211; E_i)^2}{E_i}}' title='Formula does not parse: \chi^2 = \sum_i {\frac{(O_i &#8211; E_i)^2}{E_i}}' class='latex' /></p>
<p>where:</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/78e/78e7614d2e26a28f9b82f0c05a91175e-FFFFFF000000.png' alt='O_i = ' title='O_i = ' class='latex' /> observed data in bin <em>i</em></p>
<p><img src='http://notes.jschutz.net/wp-content/latex/98c/98c2a07b8a49208981ed2f3de8903935-FFFFFF000000.png' alt='E_i = ' title='E_i = ' class='latex' /> expected data in bin <em>i</em></p>
<p>The above can be used directly when comparing a set of observations with a known (expected) distribution.  In this case the number of degrees of freedom is equal to the number of bins.</p>
<p>Given two sets of binned data <em>A</em> and <em>B</em>, the expected value in each bin of each set is its proportion of the total, i.e.:</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/9d5/9d538d4047b2cf481f68f5b615e12653-FFFFFF000000.png' alt='E_i^A =  \frac{A_i + B_i}{N_A + N_B} N_A ' title='E_i^A =  \frac{A_i + B_i}{N_A + N_B} N_A ' class='latex' /></p>
<p><img src='http://notes.jschutz.net/wp-content/latex/a79/a79c82be76e2ca4f77d288b86395705c-FFFFFF000000.png' alt='E_i^B =  \frac{A_i + B_i}{N_A + N_B} N_B ' title='E_i^B =  \frac{A_i + B_i}{N_A + N_B} N_B ' class='latex' /></p>
<p>where  <img src='http://notes.jschutz.net/wp-content/latex/486/486baa809f43ec0c6f0252cb16e8026e-FFFFFF000000.png' alt='N_A ' title='N_A ' class='latex' /> is the total number of samples in set <em>A</em>, etc.</p>
<p>Thus the chi-square statistic is</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/999/999bd282f7f8c5a8c3bcb76ea041ef08-FFFFFF000000.png' alt='Formula does not parse: \chi^2 = \sum_i {\frac{(A_i &#8211; E_i^A)^2}{E_i^A} + \frac{(B_i &#8211; E_i^B)^2}{E_i^B}}' title='Formula does not parse: \chi^2 = \sum_i {\frac{(A_i &#8211; E_i^A)^2}{E_i^A} + \frac{(B_i &#8211; E_i^B)^2}{E_i^B}}' class='latex' /></p>
<p>which can also be written:</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/9f7/9f73641ecc088f7b5fc7a05f9a17194f-FFFFFF000000.png' alt='Formula does not parse: \chi^2 = \sum_i {\frac{( N_B A_i &#8211; N_A B_i)^2}{N_A N_B (A_i + B_i)}} ' title='Formula does not parse: \chi^2 = \sum_i {\frac{( N_B A_i &#8211; N_A B_i)^2}{N_A N_B (A_i + B_i)}} ' class='latex' /></p>
<p>If the total number of samples in each set is the same, i.e. <img src='http://notes.jschutz.net/wp-content/latex/23d/23d24c5818b025aa3f0991005b6753d3-FFFFFF000000.png' alt='N_A = N_B ' title='N_A = N_B ' class='latex' />, then this simplifies down to:</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/c4d/c4d7de51eb46c6bae9ec6d0f99695fc4-FFFFFF000000.png' alt='Formula does not parse: \chi^2 = \sum_i {\frac{(A_i &#8211; B_i)^2}{A_i + B_i}} ' title='Formula does not parse: \chi^2 = \sum_i {\frac{(A_i &#8211; B_i)^2}{A_i + B_i}} ' class='latex' /></p>
<p>The number of degrees of freedom is (number of bins &#8211; 1).</p>
<p><strong>Testing against a significance level</strong></p>
<p>Choose a confidence level and look up the inverse chi square cumulative distribution for the given number of degrees of freedom, e.g. at 95% confidence and 1 degree of freedom, the threshold is <img src='http://notes.jschutz.net/wp-content/latex/f96/f96937a1d2c1578b3a6501937104f625-FFFFFF000000.png' alt='\chi^2_t = 3.84 ' title='\chi^2_t = 3.84 ' class='latex' />.  If <img src='http://notes.jschutz.net/wp-content/latex/f8f/f8f347910ccf2934a4a92070dd386ede-FFFFFF000000.png' alt='\chi^2 &gt; \chi^2_t ' title='\chi^2 &gt; \chi^2_t ' class='latex' />, then it can be said with the given level of confidence that the distributions differ.</p>
<p>Since the chi square distribution is strictly the probability that the sum of the squares of <em>normal</em> random variables would exceed the given value, this test should only be used when there are enough samples to assume a normal distribution.  It will normally be acceptable so long as no more than 10% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10.</p>
<p><strong>Code Pointers </strong></p>
<p>Octave &#8211; chisquare_inv,  chisquare_test_homogeneity</p>
<p>Perl &#8211;  Statistics::Distributions</p>
<p>Spreadsheet &#8211; chiinv</p>
<p><strong>References</strong></p>
<p><a href="http://www.nrbook.com/a/bookcpdf/c14-3.pdf">Numerical Recipes</a></p>
<p><a href="http://en.wikipedia.org/wiki/Chi_square" target="_blank">Chi Square Distribution</a></p>
<p><a href="http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test" target="_blank">Chi Square Test </a></p>
]]></content:encoded>
			<wfw:commentRss>http://notes.jschutz.net/2007/11/testing-whether-two-distributions-are-different/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Determining whether two means come from the same distribution</title>
		<link>http://notes.jschutz.net/2007/11/determining-whether-two-means-come-from-the-same-distribution/</link>
		<comments>http://notes.jschutz.net/2007/11/determining-whether-two-means-come-from-the-same-distribution/#comments</comments>
		<pubDate>Thu, 08 Nov 2007 07:00:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[means]]></category>
		<category><![CDATA[t-test]]></category>

		<guid isPermaLink="false">http://notes.jschutz.net/archives/3</guid>
		<description><![CDATA[Useful when: you have two sets of measurements, and want to know if there has been a shift in mean value.
Use Student&#8217;s t-test for significantly different means.
Sample variance for sample A:

Similarly for sample B.  Then

where

Finding the significance level 
Use the t-distribution with  degrees of freedom to compute the significance level, which is the [...]]]></description>
			<content:encoded><![CDATA[<p>Useful when: you have two sets of measurements, and want to know if there has been a shift in mean value.</p>
<p>Use Student&#8217;s t-test for significantly different means.</p>
<p>Sample variance for sample A:</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/125/1250435c79ff98515bc6071bd1d09a42-FFFFFF000000.png' alt='Formula does not parse: s_A^2 = \frac{1}{N_A &#8211; 1}\sum_{i}(a_i &#8211; \bar{a})' title='Formula does not parse: s_A^2 = \frac{1}{N_A &#8211; 1}\sum_{i}(a_i &#8211; \bar{a})' class='latex' /></p>
<p>Similarly for sample B.  Then</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/a11/a11ab783a4bd05ef9192bea42e92a250-FFFFFF000000.png' alt='Formula does not parse: t = \frac{\bar{a} &#8211; \bar{b}}{s_D} ' title='Formula does not parse: t = \frac{\bar{a} &#8211; \bar{b}}{s_D} ' class='latex' /></p>
<p>where</p>
<p><img src='http://notes.jschutz.net/wp-content/latex/8ac/8acd54e69a89d0a493a0fa5960946beb-FFFFFF000000.png' alt='Formula does not parse: S_D = \sqrt{ \frac{(N_A &#8211; 1)s_A^2 + (N_B &#8211; 1) s_B^2}{N_A + N_B &#8211; 2} (\frac{1}{N_A} + \frac{1}{N_B}) }' title='Formula does not parse: S_D = \sqrt{ \frac{(N_A &#8211; 1)s_A^2 + (N_B &#8211; 1) s_B^2}{N_A + N_B &#8211; 2} (\frac{1}{N_A} + \frac{1}{N_B}) }' class='latex' /></p>
<p><strong>Finding the significance level </strong></p>
<p>Use the t-distribution with <img src='http://notes.jschutz.net/wp-content/latex/868/868aa3ce2eb383583f175fb62c629abc-FFFFFF000000.png' alt='Formula does not parse:  {N_A + N_B &#8211; 2}' title='Formula does not parse:  {N_A + N_B &#8211; 2}' class='latex' /> degrees of freedom to compute the significance level, which is the probability that <img src='http://notes.jschutz.net/wp-content/latex/e8f/e8fe2606f603ea96e3dd9e6faccdc16a-FFFFFF000000.png' alt='|t|' title='|t|' class='latex' /> could be larger, by chance, for distributions of equal means.  Thus a significance of 0.05 suggests that the means are different with 95% confidence.</p>
<p><strong>Testing against a significance level </strong></p>
<p>For e.g. a test with 95% confidence level, find the threshold value of t at 0.05 from the t-distribution.  If the computed t exceeds the threshold, the means are considered different to that level of confidence.</p>
<p><strong>Code pointers</strong></p>
<p>Octave &#8211; t_test</p>
<p>Perl &#8211;  Statistics::TTest</p>
<p>Spreadsheet &#8211; TTEST, TDIST, TINV</p>
<p><strong>References:</strong></p>
<p><a href="http://en.wikipedia.org/wiki/Student's_t-test">http://en.wikipedia.org/wiki/Student&#8217;s_t-test</a></p>
<p><a href="http://www.nrbook.com/a/bookcpdf/c14-2.pdf" title="Numerical Recipes">Numerical Recipes<br />
</a></p>
]]></content:encoded>
			<wfw:commentRss>http://notes.jschutz.net/2007/11/determining-whether-two-means-come-from-the-same-distribution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

