Lesson 1.4: Summarizing Categorical Data

Supplementary Notes 1.4

Summarizing Categorical Data

Frequency Tables

Consider building a frequency table from the following categorical data table from a student survey:

Table 1 Student Phone Carrier Preference
Student Carrier
1 Fido
2 Rogers
3 Bell
4 Rogers
5 Telus
6 Rogers
7 Rogers
8 Bell
9 Rogers
10 Telus
11 Telus
12 Rogers
13 Rogers
14 Telus
15 Fido
16 Rogers
Table 1 Student Phone Carrier Preference
Student Carrier
1 Fido
2 Rogers
3 Bell
4 Rogers
5 Telus
6 Rogers
7 Rogers
8 Bell
9 Rogers
10 Telus
11 Telus
12 Rogers
13 Rogers
14 Telus
15 Fido
16 Rogers

First, let’s consider the data:

  • Who? Students in a statistics class who own a cell phone.
  • What? Cell phone carrier
  • Where? Capilano College
  • When? September 2013
  • Why? To gain information on student choice of cell phone carriers.
  • How? Students in a statistics class filled-out a questionnaire.

Now from this data table, construct a frequency table by making “frequency counts” for each cell phone carrier. To display this frequency distribution, you could construct a bar, pie, or segmented bar graph. To draw a pie graph by hand, you would need a protractor to mark out the angles. (Don’t worry––you won’t be asked to do this in the course!)

Frequency table:

Table 2 Student Phone Carrier Preference (Frequency Table) 
Carrier Frequency
Bell 2
Fido 2
Rogers 8
Telus 4
Total 16

Suppose we collected another sample of student cell phone carrier preferences and wanted to compare the frequency distributions of the two samples.

Table 3 Student Phone Carrier Preference
(Frequency Table Comparison of Two Samples: Sample 1 n = 40 students and Sample 2 n = 16 students)

Carrier

Sample 1 Frequency

Sample 2 Frequency

Bell 5 2
Fido 5 2
Rogers 17 8

Telus

13

4

Total 40 16

It’s hard to compare these two distributions in this form since they are based on different sample sizes. The comparison will be much easier if we convert the frequencies to relative frequencies (or percentages).

Table 4 Student Phone Carrier Preferences
(Relative Frequency of Two Samples: Sample 1 n= 40 and Sample 2 n=16)
Carrier Sample 1 Relative Frequency
Sample 2 Percent Sample 2 Relative Frequency Sample 2 Percent
Bell 5/40 = .125 12.5% 2/16 = .125 12.5%
Fido 5/40 = .125 12.5% 2/16 = .125 12.5%
Rogers 17/40 = .425 42.5% 8/16 = .5 50%
Telus 13/40 = .325 32.5% 4/16 = .25 25%
Total 1.0 100% 1.0 100%

Now the comparison is much clearer. Both samples show the same percentage support for Bell and Fido. Rogers has the highest percentage of support in both samples, although somewhat lower in Sample 1 compared to Sample 2. The corresponding bar graphs for these two distributions would now be drawn using percent on the vertical scale instead of frequency.

Contingency Tables

When subject responses are cross-classified according to row variable and column variable categories, the resulting frequency (or percent) table is called a contingency table.

Building a Contingency Table from a Data Table

In the cell phone survey, students were also asked to rate their satisfaction level with the service provided by the carrier.

Table 5 Student Satisfaction in Phone Carrier
Student Carrier Satisfaction Level
1 Fido Somewhat
2 Rogers Somewhat
3 Bell Somewhat
4 Rogers Very
5 Telus Somewhat
6 Rogers Somewhat
7 Rogers Very
8 Bell Somewhat
9 Rogers Neutral
10 Telus Very
11 Telus Very
12 Rogers Neutral
13 Rogers Neutral
14 Telus Somewhat
15 Fido Neutral
16 Rogers Very

To construct a contingency table for this data, you choose one of the two variables for the row variable and the other for the column variable. Let’s choose “Carrier“ as the column variable and “Satisfaction Level” as the row variable. Now you build “frequency counts” for each cell in the table.

Here’s the resulting contingency table for this data set:

Table 6 Student Satisfaction in Phone Carrier (Contingency Table)
Satisfaction Level Bell Carrier
Fido Carrier
Rogers Carrier
Telus Carrier
Row Totals
Very 0 0 3 2 5
Somewhat 2 1 2 2 7
Neutral 0 1 3 0 4
Column Totals 2 2 8 4 16

Calculating Percentages on a Contingency Table

Caution! Read the questions very carefully––some of these percentage calculations can be tricky.

What percentage of students use Rogers and are very satisfied?

3/16 = 0.1875 = 18.75%

Overall, what percentage of users is very satisfied?

5/16 = 0.3125 = 31.25%

What percentage of Rogers users is very satisfied?

3/8 = 0.375 = 37.5%

Notice here this percentage is based on only the 8 students who use Rogers; whereas in the previous two questions, the percentages are based on the full 16 students. In the language of statistics, we are “conditioning” on the Rogers column to get the conditional percentage of 37.5%. Notice that this conditional percentage of Rogers users that are very satisfied (37.5%) is not equal to the overall percentage of very satisfied users (31.25%). This gives us some evidence of an association between satisfaction level and carrier company.

What percentage of Telus users is very satisfied?

2/4 = 0.5 = 50%

Now this percentage is based on only the 4 students who use Telus. We are conditioning on the Telus column. Since this is not equal to the overall percentage of very satisfied users, this again suggests an association between satisfaction level and carrier company (however, with such a small sample size we shouldn’t overstate the significance of the differences in these percentages).

Of those that are somewhat satisfied, what percentage use Fido?

1/7  0.143 = 14.3%

This is just another common phrasing of a conditional percentage calculation where we are now conditioning on the somewhat satisfied row.

Independence in Contingency Tables

The concept of independence is very important in probability and statistics, but it’s also a little slippery! The objective at this point in the course is to give you an intuitive introduction to independence in the context of a contingency table. Then, later in the course, we’ll get much more formal in dealing with independence.

Before doing any calculations, let’s see if you already have a feel for the meaning of independence for two variables:

Is a student’s statistics grade independent of the number of hours that they study?

Clearly not! Students who study more are more likely to get better grades.

Is a person’s shoe size independent of their height?

Well no. People with larger shoes sizes are more likely to be taller.

Is a person’s eye colour independent of their hair colour?

Again no. Dark-eyed people are more likely to have dark hair than, say, blue-eyed people.

Is a person’s smoking status (smoker or not) independent of their education level?

No again. Generally, the higher a person’s education level, the less likely they are to smoke.

Is a person’s smoking status independent of their blood type?

Seems plausible. It’s hard to see how or why these two variables might be associated. (But you never know!)

As the above examples illustrate, it’s actually hard to think of two variables that are clearly independent. The association between the two variables might be weak, and the reason for the association might be unclear. And remember, just because two variables are associated it does not follow that one causes the other. They both could be responding to a third variable lurking in the background.

The next example explores in greater detail what it means when we say two variables in a contingency table are independent.

Checking for Independence in a Contingency Table

In the following contingency table, the row and column totals have been given, but the cell frequencies have not.

What would these cell frequencies be if we assume that that smoking status and blood group are perfectly independent?

Table 7 Smoking Status by Blood Type (Contingency Table)
Smoking Status Blood Group A Blood Group B Blood Group O Blood Group AB  Row Totals
Yes ? ? ? ? 200
No ? ? ? ? 800
Column Totals 400 110 450 40 1000

First notice that 200/1000 = 20% are smokers, and 80% are not. So for smoking status and blood group to be perfectly independent, we must preserve this 20% smoker ratio across all blood groups.

  • In blood group A the number of smokers must be 20% of 400 = 80.
  • In blood group B the number of smokers must be 20% of 110 = 22.
  • In blood group O the number of smokers must be 20% of 450 = 90.
  • In blood group AB the number of smokers must be 20% of 40 = 8.

We can now fill in all the cell frequencies to produce a frequency table where the row and column variables are perfectly independent. (Of course, such “perfection” rarely occurs in real life!)

Table 8 Smoking Status by Blood Type (Frequency Table)
Smoking Status Blood Group A Blood Group B Blood Group O Blood Group AB Row Totals
Yes 80 22 90 8 200
No 320 88 360 32 800
Column Totals 400 110 450 40 1000

Notice now that the conditional distribution of smoking status (20% Yes, 80% No) is the same for each blood group, and in turn it equals the marginal distribution of smoking status. That’s independence!

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Probability and Statistics Copyright © 2023 by Thompson Rivers University is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book