Supplementary Notes 4.3

Iain Pardoe

Lesson 4.3: Testing Independence in Two-way Tables

Supplementary Notes 4.3

Chi-Square Test for Independence

We first encountered the idea of independence of two categorical variables in Lesson 1.4. When subject responses are cross-classified according to row variable and column variable categories, the resulting frequency (or percent) table is called a two-way table or contingency table. A chi-square test of independence tests whether two categorical variables measured on the same population are independent.

Hypotheses

H₀: The variables are independent.
H_A: The variables are not independent.

Assumptions and Conditions

We test the same conditions as we did for the chi-square goodness-of-fit test in Lesson 4.2. We use a similar chi-square test statistic, but the resulting distribution (when the null hypothesis is true) has a chi-square distribution with degrees of freedom (df) equal to $(R-1) \times (C-1)$ , where R is the number of rows and C is the number of columns.

When the conditions below are met, $\chi^2 = \sum \dfrac{(Obs-Exp)^2}{Exp}$ follows a chi-square model with $(R-1) \times (C-1)$ df, where R is the number of rows and C is the number of columns in the contingency table.

Counted Data Condition: The data must be counts (frequencies) for combinations of the categories of two categorical variables.
Independence: The counts in the cells should be independent of each other.
Random: The sample is random.
Expected Cell Frequency Condition: All expected cell frequencies must be at least five. If a small number of cells have expected counts slightly less than five, you can proceed with caution. However, if it makes sense to do so, it is better to combine cells with expected counts less than five. There is no similar cell frequency condition for the observed counts.

Calculating Expected Counts Under Independence

Independence again? Way back in Supplementary Notes 1.4, we calculated the expected frequencies so that smoking status and blood group (type A, B, O, or AB) were independent. The calculations are reproduced in Table 1.

Given the row and column totals in Table 1, what would the individual cell frequencies have to be if smoking status and blood group were perfectly independent?

Table 1 Smoking Status by Blood Group
		Blood Group
		A	B	O	AB
Smoking Status	Yes	?	?	?	?	200
Smoking Status	No	?	?	?	?	800
		400	110	450	40	1,000

First, notice that 200/1000 = 20% are smokers, and 80% are not. So for smoking status and blood group to be perfectly independent, we must preserve this 20% smoker ratio across all blood groups.

In blood group A, the number of smokers must be 20% of 400 = 80.
In blood group B, the number of smokers must be 20% of 110 = 22.
In blood group O, the number of smokers must be 20% of 450 = 90.
In blood group AB, the number of smokers must be 20% of 40 = 8.

Then, use subtraction to fill-in the other cell frequencies:

In blood group A, the number of non-smokers must be 400 − 80 = 320.
In blood group B, the number of non-smokers must be 110 − 22 = 88.
In blood group O, the number of non-smokers must be 450 − 90 = 360.
In blood group AB, the number of non-smokers must be 40 − 8 = 32.

Table 2 Smoking Status by Blood Group
		Blood Group
		A	B	O	AB	Total
Smoking Status	Yes	80	22	90	8	200
Smoking Status	No	320	88	360	32	800
	Total	400	110	450	40	1,000

We can also use our knowledge of independence from Lesson 2.2 to achieve the same results. For example, let x = the expected number of smokers in blood group A. For “A” and “Yes” to be independent, we require P(Yes|A) = P(Yes). From Table 2, P(Yes|A) = x/400 and P(Yes) = 200/1000. Set these probabilities equal to get x/400 = 200/1000, which we can rearrange to solve for x: x = (200)(400) / 1000 = 80.

Note that the expected counts (Exp) needed to make the variables independent are obtained by (row total)(column total) / overall total, i.e., $Exp = \dfrac{\text{row total} \times \text{column total}}{\text{overall total}}$ .

For a second example, let y = the expected number of smokers in blood group B, so y = (200)(110) / 1000 = 22.

The other expected cell frequencies can be calculated similarly.

Example: Chi-Square Test for Independence

Below are data about eye colour and handedness for a group of students. Using the data, test the claim that handedness of the subject is independent of the person’s eye colour by performing a chi-square test. State the null and alternative hypotheses, give the chi-square test statistic value, degrees of freedom, and p-value, and state your conclusion about the hypotheses.

Table 3 Student Eye Colour and Handedness
		Handedness
		Left	Right	Total
Eye Colour	Brown	6	36	42
	Blue	7	26	33
	Green	2	21	23
	Other	4	12	16
	Total	19	95	114

Hypotheses

H₀: Handedness and eye colour are independent.
H_A: There is an association between handedness and eye colour (they are not independent).

Model Conditions

Counted Data Condition: We have counts of combinations of two categorical variables.
Independence: No reason to doubt independence with the information given.
Random: Although the sample may not be random, it may be representative of all students.
Expected Counts Condition: The expected counts are in brackets in Table 4:

Table 4 Expected Counts of Eye Colour to Handedness
		Handedness
		Left	Right	Total
Eye Colour	Brown	6 (7)	36 (35)	42
	Blue	7 (5.5)	26 (27.5)	33
	Green	2 (3.83)	21 (19.17)	23
	Other	4 (2.67)	12 (13.33)	16
	Total	19	95	114

The expected count for the first cell (shown in brackets in Table 4) is obtained by: (42)(19) / 114 = 7. The expected counts for left-handed students with green eyes (3.83) and other (2.67) are too small (less than 5), so we combine the Green category with the Other category. After combining them in Table 5, all expected counts are greater than 5:

Table 5 Expected Counts of Eye Colour to Handedness
(Green Colour and Other Combined)
		Handedness
		Left	Right	Total
Eye Colour	Brown	6 (7)	36 (35)	42
	Blue	7 (5.5)	26 (27.5)	33
	Other	6 (6.5)	33 (32.5)	39
	Total	19	95	114

We will calculate the chi-square test statistic value with the help of Table 6.

Table 6 Calculations of Chi-Square Statistic Value
Observed	Expected	Residual (Obs–Exp)	(Obs–Exp)²	Component (Obs–Exp)²/Exp
6	7	-1	1	0.1429
36	35	1	1	0.0286
7	5.5	1.5	2.25	0.4091
26	27.5	-1.5	2.25	0.0818
6	6.5	-0.5	0.25	0.0384
33	32.5	0.5	0.25	0.0077
				0.7085

χ² = 0.7085, df = $(3-1)(2-1) = 2$ .

P-value calculation: 1 - pchisq(0.7085, df=2) ≈ 0.7017.

Conclusion

With a p-value as large as 0.7017, there is insufficient evidence to reject the null hypothesis. There is no evidence of any association between handedness and eye colour.

Alternatively, reject H₀ in favour of H_A if the test statistic is in the rejection region (greater than the critical value). Do not reject H₀ if the test statistic is not in the rejection region (less than the critical value). The critical value in the eye colour and handedness example is 5.9915, the 95^th percentile of the chi-square distribution with two degrees of freedom. Since the test statistic, χ² = 0.7085 is less than 5.9915, it is not in the rejection region, so we do not reject H₀.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Probability and Statistics Copyright © 2023 by Thompson Rivers University is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Chi-Square Test for Independence

Hypotheses

Assumptions and Conditions

Calculating Expected Counts Under Independence

Example: Chi-Square Test for Independence

Hypotheses

Model Conditions

Conclusion

License

Share This Book