Lesson 4.3: Testing Independence in Two-way Tables
Supplementary Notes 4.3
Chi-Square Test for Independence
We first encountered the idea of independence of two categorical variables in Lesson 1.4. When subject responses are cross-classified according to row variable and column variable categories, the resulting frequency (or percent) table is called a two-way table or contingency table. A chi-square test of independence tests whether two categorical variables measured on the same population are independent.
Hypotheses
- H0: The variables are independent.
- HA: The variables are not independent.
Assumptions and Conditions
We test the same conditions as we did for the chi-square goodness-of-fit test in Lesson 4.2. We use a similar chi-square test statistic, but the resulting distribution (when the null hypothesis is true) has a chi-square distribution with degrees of freedom (df) equal to , where R is the number of rows and C is the number of columns.
When the conditions below are met, follows a chi-square model with
df, where R is the number of rows and C is the number of columns in the contingency table.
- Counted Data Condition: The data must be counts (frequencies) for combinations of the categories of two categorical variables.
- Independence: The counts in the cells should be independent of each other.
- Random: The sample is random.
- Expected Cell Frequency Condition: All expected cell frequencies must be at least five. If a small number of cells have expected counts slightly less than five, you can proceed with caution. However, if it makes sense to do so, it is better to combine cells with expected counts less than five. There is no similar cell frequency condition for the observed counts.
Calculating Expected Counts Under Independence
Independence again? Way back in Supplementary Notes 1.4, we calculated the expected frequencies so that smoking status and blood group (type A, B, O, or AB) were independent. The calculations are reproduced in Table 1.
Given the row and column totals in Table 1, what would the individual cell frequencies have to be if smoking status and blood group were perfectly independent?
Blood Group | ||||||
A | B | O | AB | |||
Smoking Status | Yes | ? | ? | ? | ? | 200 |
No | ? | ? | ? | ? | 800 | |
400 | 110 | 450 | 40 | 1,000 |
First, notice that 200/1000 = 20% are smokers, and 80% are not. So for smoking status and blood group to be perfectly independent, we must preserve this 20% smoker ratio across all blood groups.
- In blood group A, the number of smokers must be 20% of 400 = 80.
- In blood group B, the number of smokers must be 20% of 110 = 22.
- In blood group O, the number of smokers must be 20% of 450 = 90.
- In blood group AB, the number of smokers must be 20% of 40 = 8.
Then, use subtraction to fill-in the other cell frequencies:
- In blood group A, the number of non-smokers must be 400 − 80 = 320.
- In blood group B, the number of non-smokers must be 110 − 22 = 88.
- In blood group O, the number of non-smokers must be 450 − 90 = 360.
- In blood group AB, the number of non-smokers must be 40 − 8 = 32.
Blood Group | ||||||
A | B | O | AB | Total | ||
Smoking Status | Yes | 80 | 22 | 90 | 8 | 200 |
No | 320 | 88 | 360 | 32 | 800 | |
Total | 400 | 110 | 450 | 40 | 1,000 |
We can also use our knowledge of independence from Lesson 2.2 to achieve the same results. For example, let x = the expected number of smokers in blood group A. For “A” and “Yes” to be independent, we require P(Yes|A) = P(Yes). From Table 2, P(Yes|A) = x/400 and P(Yes) = 200/1000. Set these probabilities equal to get x/400 = 200/1000, which we can rearrange to solve for x: x = (200)(400) / 1000 = 80.
Note that the expected counts (Exp) needed to make the variables independent are obtained by (row total)(column total) / overall total, i.e., .
For a second example, let y = the expected number of smokers in blood group B, so y = (200)(110) / 1000 = 22.
The other expected cell frequencies can be calculated similarly.
Example: Chi-Square Test for Independence
Below are data about eye colour and handedness for a group of students. Using the data, test the claim that handedness of the subject is independent of the person’s eye colour by performing a chi-square test. State the null and alternative hypotheses, give the chi-square test statistic value, degrees of freedom, and p-value, and state your conclusion about the hypotheses.
Handedness | ||||
Left | Right | Total | ||
Eye Colour | Brown | 6 | 36 | 42 |
Blue | 7 | 26 | 33 | |
Green | 2 | 21 | 23 | |
Other | 4 | 12 | 16 | |
Total | 19 | 95 | 114 |
Hypotheses
- H0: Handedness and eye colour are independent.
- HA: There is an association between handedness and eye colour (they are not independent).
Model Conditions
- Counted Data Condition: We have counts of combinations of two categorical variables.
- Independence: No reason to doubt independence with the information given.
- Random: Although the sample may not be random, it may be representative of all students.
- Expected Counts Condition: The expected counts are in brackets in Table 4:
Handedness | ||||
Left | Right | Total | ||
Eye Colour | Brown | 6 (7) | 36 (35) | 42 |
Blue | 7 (5.5) | 26 (27.5) | 33 | |
Green | 2 (3.83) | 21 (19.17) | 23 | |
Other | 4 (2.67) | 12 (13.33) | 16 | |
Total | 19 | 95 | 114 |
The expected count for the first cell (shown in brackets in Table 4) is obtained by: (42)(19) / 114 = 7. The expected counts for left-handed students with green eyes (3.83) and other (2.67) are too small (less than 5), so we combine the Green category with the Other category. After combining them in Table 5, all expected counts are greater than 5:
Handedness | |||||
Left | Right | Total | |||
Eye Colour | Brown | 6 (7) | 36 (35) | 42 | |
Blue | 7 (5.5) | 26 (27.5) | 33 | ||
Other | 6 (6.5) | 33 (32.5) | 39 | ||
Total | 19 | 95 | 114 |
We will calculate the chi-square test statistic value with the help of Table 6.
Observed | Expected | Residual (Obs–Exp) |
(Obs–Exp)2 | Component (Obs–Exp)2/Exp |
6 | 7 | -1 | 1 | 0.1429 |
36 | 35 | 1 | 1 | 0.0286 |
7 | 5.5 | 1.5 | 2.25 | 0.4091 |
26 | 27.5 | -1.5 | 2.25 | 0.0818 |
6 | 6.5 | -0.5 | 0.25 | 0.0384 |
33 | 32.5 | 0.5 | 0.25 | 0.0077 |
0.7085 |
χ2 = 0.7085, df = .
P-value calculation: 1 - pchisq(0.7085, df=2)
≈ 0.7017.
Conclusion
With a p-value as large as 0.7017, there is insufficient evidence to reject the null hypothesis. There is no evidence of any association between handedness and eye colour.
Alternatively, reject H0 in favour of HA if the test statistic is in the rejection region (greater than the critical value). Do not reject H0 if the test statistic is not in the rejection region (less than the critical value). The critical value in the eye colour and handedness example is 5.9915, the 95th percentile of the chi-square distribution with two degrees of freedom. Since the test statistic, χ2 = 0.7085 is less than 5.9915, it is not in the rejection region, so we do not reject H0.