Lesson 4.3: Testing Independence in Two-way Tables

Supplementary Notes 4.3

Chi-Square Test for Independence

We first encountered the idea of independence of two categorical variables in Lesson 1.4. When subject responses are cross-classified according to row variable and column variable categories, the resulting frequency (or percent) table is called a two-way table or contingency table. A chi-square test of independence tests whether two categorical variables measured on the same population are independent.

Hypotheses

  • H0: The variables are independent.
  • HA: The variables are not independent.

Assumptions and Conditions

We test the same conditions as we did for the chi-square goodness-of-fit test in Lesson 4.2. We use a similar chi-square test statistic, but the resulting distribution (when the null hypothesis is true) has a chi-square distribution with degrees of freedom (df) equal to (R-1) \times (C-1), where R is the number of rows and C is the number of columns.

When the conditions below are met, \chi^2 = \sum \dfrac{(Obs-Exp)^2}{Exp} follows a chi-square model with (R-1) \times (C-1) df, where R is the number of rows and C is the number of columns in the contingency table.

  • Counted Data Condition: The data must be counts (frequencies) for combinations of the categories of two categorical variables.
  • Independence: The counts in the cells should be independent of each other.
  • Random: The sample is random.
  • Expected Cell Frequency Condition: All expected cell frequencies must be at least five. If a small number of cells have expected counts slightly less than five, you can proceed with caution. However, if it makes sense to do so, it is better to combine cells with expected counts less than five. There is no similar cell frequency condition for the observed counts.

Calculating Expected Counts Under Independence

Independence again? Way back in Supplementary Notes 1.4, we calculated the expected frequencies so that smoking status and blood group (type A, B, O, or AB) were independent. The calculations are reproduced in Table 1.

Given the row and column totals in Table 1, what would the individual cell frequencies have to be if smoking status and blood group were perfectly independent?

Table 1 Smoking Status by Blood Group
    Blood Group
    A B O AB  
Smoking Status Yes ? ? ? ? 200
No ? ? ? ? 800
400 110 450 40 1,000

First, notice that 200/1000 = 20% are smokers, and 80% are not. So for smoking status and blood group to be perfectly independent, we must preserve this 20% smoker ratio across all blood groups.

  • In blood group A, the number of smokers must be 20% of 400 = 80.
  • In blood group B, the number of smokers must be 20% of 110 = 22.
  • In blood group O, the number of smokers must be 20% of 450 = 90.
  • In blood group AB, the number of smokers must be 20% of 40 = 8.

Then, use subtraction to fill-in the other cell frequencies:

  • In blood group A, the number of non-smokers must be 400 − 80 = 320.
  • In blood group B, the number of non-smokers must be 110 − 22 = 88.
  • In blood group O, the number of non-smokers must be 450 − 90 = 360.
  • In blood group AB, the number of non-smokers must be 40 − 8 = 32.
Table 2 Smoking Status by Blood Group
    Blood Group
    A B O AB Total 
Smoking Status Yes 80 22 90 8 200
No 320 88 360 32 800
Total 400 110 450 40 1,000

We can also use our knowledge of independence from Lesson 2.2 to achieve the same results. For example, let x = the expected number of smokers in blood group A. For “A” and “Yes” to be independent, we require P(Yes|A) = P(Yes). From Table 2, P(Yes|A) = x/400 and P(Yes) = 200/1000. Set these probabilities equal to get x/400 = 200/1000, which we can rearrange to solve for x: x = (200)(400) / 1000 = 80.

Note that the expected counts (Exp) needed to make the variables independent are obtained by (row total)(column total) / overall total, i.e., Exp = \dfrac{\text{row total} \times \text{column total}}{\text{overall total}}.

For a second example, let y = the expected number of smokers in blood group B, so y = (200)(110) / 1000 = 22.

The other expected cell frequencies can be calculated similarly.

Example: Chi-Square Test for Independence

Below are data about eye colour and handedness for a group of students. Using the data, test the claim that handedness of the subject is independent of the person’s eye colour by performing a chi-square test. State the null and alternative hypotheses, give the chi-square test statistic value, degrees of freedom, and p-value, and state your conclusion about the hypotheses.

Table 3 Student Eye Colour and Handedness 
Handedness
Left Right Total
Eye Colour Brown 6 36 42
Blue 7 26 33
Green 2 21 23
Other 4 12 16
Total 19 95 114

Hypotheses

  • H0: Handedness and eye colour are independent.
  • HA: There is an association between handedness and eye colour (they are not independent).

Model Conditions

  • Counted Data Condition: We have counts of combinations of two categorical variables.
  • Independence: No reason to doubt independence with the information given.
  • Random: Although the sample may not be random, it may be representative of all students.
  • Expected Counts Condition: The expected counts are in brackets in Table 4:
Table 4 Expected Counts of Eye Colour to Handedness
Handedness
Left Right Total
Eye Colour Brown 6 (7) 36 (35) 42
Blue 7 (5.5) 26 (27.5) 33
Green 2 (3.83) 21 (19.17) 23
Other 4 (2.67) 12 (13.33) 16
Total 19 95 114

The expected count for the first cell (shown in brackets in Table 4) is obtained by: (42)(19) / 114 = 7. The expected counts for left-handed students with green eyes (3.83) and other (2.67) are too small (less than 5), so we combine the Green category with the Other category. After combining them in Table 5, all expected counts are greater than 5:

Table 5 Expected Counts of Eye Colour to Handedness
(Green Colour and Other Combined)
Handedness
Left Right Total
Eye Colour Brown 6 (7) 36 (35) 42
Blue 7 (5.5) 26 (27.5) 33
Other 6 (6.5) 33 (32.5) 39
Total 19 95 114

We will calculate the chi-square test statistic value with the help of Table 6.

Table 6 Calculations of Chi-Square Statistic Value
Observed Expected Residual
(Obs–Exp)
(Obs–Exp)2 Component
(Obs–Exp)2/Exp
6 7 -1 1 0.1429
36 35 1 1 0.0286
7 5.5 1.5 2.25 0.4091
26 27.5 -1.5 2.25 0.0818
6 6.5 -0.5 0.25 0.0384
33 32.5 0.5 0.25 0.0077
0.7085

 

χ2 = 0.7085, df = (3-1)(2-1) = 2.

P-value calculation: 1 - pchisq(0.7085, df=2) ≈ 0.7017.

Conclusion

With a p-value as large as 0.7017, there is insufficient evidence to reject the null hypothesis. There is no evidence of any association between handedness and eye colour.

Alternatively, reject H0 in favour of HA if the test statistic is in the rejection region (greater than the critical value). Do not reject H0 if the test statistic is not in the rejection region (less than the critical value). The critical value in the eye colour and handedness example is 5.9915, the 95th percentile of the chi-square distribution with two degrees of freedom. Since the test statistic, χ2 = 0.7085 is less than 5.9915, it is not in the rejection region, so we do not reject H0.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Probability and Statistics Copyright © 2023 by Thompson Rivers University is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book