Supplementary Notes 1.4

Iain Pardoe

Lesson 1.4: Summarizing Categorical Data

Supplementary Notes 1.4

Summarizing Categorical Data

Frequency Tables

Consider building a frequency table from the following categorical data table from a student survey:

Table 1 Student Phone Carrier Preference
Student	Carrier
1	Fido
2	Rogers
3	Bell
4	Rogers
5	Telus
6	Rogers
7	Rogers
8	Bell
9	Rogers
10	Telus
11	Telus
12	Rogers
13	Rogers
14	Telus
15	Fido
16	Rogers

Table 1 Student Phone Carrier Preference
Student	Carrier
1	Fido
2	Rogers
3	Bell
4	Rogers
5	Telus
6	Rogers
7	Rogers
8	Bell
9	Rogers
10	Telus
11	Telus
12	Rogers
13	Rogers
14	Telus
15	Fido
16	Rogers

First, let’s consider the data:

Who? Students in a statistics class who own a cell phone.
What? Cell phone carrier
Where? Capilano College
When? September 2013
Why? To gain information on student choice of cell phone carriers.
How? Students in a statistics class filled-out a questionnaire.

Now from this data table, construct a frequency table by making “frequency counts” for each cell phone carrier. To display this frequency distribution, you could construct a bar, pie, or segmented bar graph. To draw a pie graph by hand, you would need a protractor to mark out the angles. (Don’t worry––you won’t be asked to do this in the course!)

Frequency table:

Table 2 Student Phone Carrier Preference (Frequency Table)
Carrier	Frequency
Bell	2
Fido	2
Rogers	8
Telus	4
Total	16

Suppose we collected another sample of student cell phone carrier preferences and wanted to compare the frequency distributions of the two samples.

Table 3 Student Phone Carrier Preference
(Frequency Table Comparison of Two Samples: Sample 1 n = 40 students and Sample 2 n = 16 students)
Carrier	Sample 1 Frequency	Sample 2 Frequency
Bell	5	2
Fido	5	2
Rogers	17	8
Telus	13	4
Total	40	16

It’s hard to compare these two distributions in this form since they are based on different sample sizes. The comparison will be much easier if we convert the frequencies to relative frequencies (or percentages).

Table 4 Student Phone Carrier Preferences
(Relative Frequency of Two Samples: Sample 1 n= 40 and Sample 2 n=16)
Carrier	Sample 1 Relative Frequency	Sample 2 Percent	Sample 2 Relative Frequency	Sample 2 Percent
Bell	5/40 = .125	12.5%	2/16 = .125	12.5%
Fido	5/40 = .125	12.5%	2/16 = .125	12.5%
Rogers	17/40 = .425	42.5%	8/16 = .5	50%
Telus	13/40 = .325	32.5%	4/16 = .25	25%
Total	1.0	100%	1.0	100%

Now the comparison is much clearer. Both samples show the same percentage support for Bell and Fido. Rogers has the highest percentage of support in both samples, although somewhat lower in Sample 1 compared to Sample 2. The corresponding bar graphs for these two distributions would now be drawn using percent on the vertical scale instead of frequency.

Contingency Tables

When subject responses are cross-classified according to row variable and column variable categories, the resulting frequency (or percent) table is called a contingency table.

Building a Contingency Table from a Data Table

In the cell phone survey, students were also asked to rate their satisfaction level with the service provided by the carrier.

Table 5 Student Satisfaction in Phone Carrier
Student	Carrier	Satisfaction Level
1	Fido	Somewhat
2	Rogers	Somewhat
3	Bell	Somewhat
4	Rogers	Very
5	Telus	Somewhat
6	Rogers	Somewhat
7	Rogers	Very
8	Bell	Somewhat
9	Rogers	Neutral
10	Telus	Very
11	Telus	Very
12	Rogers	Neutral
13	Rogers	Neutral
14	Telus	Somewhat
15	Fido	Neutral
16	Rogers	Very

To construct a contingency table for this data, you choose one of the two variables for the row variable and the other for the column variable. Let’s choose “Carrier“ as the column variable and “Satisfaction Level” as the row variable. Now you build “frequency counts” for each cell in the table.

Here’s the resulting contingency table for this data set:

Table 6 Student Satisfaction in Phone Carrier (Contingency Table)
Satisfaction Level	Bell Carrier	Fido Carrier	Rogers Carrier	Telus Carrier	Row Totals
Very	0	0	3	2	5
Somewhat	2	1	2	2	7
Neutral	0	1	3	0	4
Column Totals	2	2	8	4	16

Calculating Percentages on a Contingency Table

Caution! Read the questions very carefully––some of these percentage calculations can be tricky.

What percentage of students use Rogers and are very satisfied?

3/16 = 0.1875 = 18.75%

Overall, what percentage of users is very satisfied?

5/16 = 0.3125 = 31.25%

What percentage of Rogers users is very satisfied?

3/8 = 0.375 = 37.5%

Notice here this percentage is based on only the 8 students who use Rogers; whereas in the previous two questions, the percentages are based on the full 16 students. In the language of statistics, we are “conditioning” on the Rogers column to get the conditional percentage of 37.5%. Notice that this conditional percentage of Rogers users that are very satisfied (37.5%) is not equal to the overall percentage of very satisfied users (31.25%). This gives us some evidence of an association between satisfaction level and carrier company.

What percentage of Telus users is very satisfied?

2/4 = 0.5 = 50%

Now this percentage is based on only the 4 students who use Telus. We are conditioning on the Telus column. Since this is not equal to the overall percentage of very satisfied users, this again suggests an association between satisfaction level and carrier company (however, with such a small sample size we shouldn’t overstate the significance of the differences in these percentages).

Of those that are somewhat satisfied, what percentage use Fido?

1/7 0.143 = 14.3%

This is just another common phrasing of a conditional percentage calculation where we are now conditioning on the somewhat satisfied row.

Independence in Contingency Tables

The concept of independence is very important in probability and statistics, but it’s also a little slippery! The objective at this point in the course is to give you an intuitive introduction to independence in the context of a contingency table. Then, later in the course, we’ll get much more formal in dealing with independence.

Before doing any calculations, let’s see if you already have a feel for the meaning of independence for two variables:

Is a student’s statistics grade independent of the number of hours that they study?

Clearly not! Students who study more are more likely to get better grades.

Is a person’s shoe size independent of their height?

Well no. People with larger shoes sizes are more likely to be taller.

Is a person’s eye colour independent of their hair colour?

Again no. Dark-eyed people are more likely to have dark hair than, say, blue-eyed people.

Is a person’s smoking status (smoker or not) independent of their education level?

No again. Generally, the higher a person’s education level, the less likely they are to smoke.

Is a person’s smoking status independent of their blood type?

Seems plausible. It’s hard to see how or why these two variables might be associated. (But you never know!)

As the above examples illustrate, it’s actually hard to think of two variables that are clearly independent. The association between the two variables might be weak, and the reason for the association might be unclear. And remember, just because two variables are associated it does not follow that one causes the other. They both could be responding to a third variable lurking in the background.

The next example explores in greater detail what it means when we say two variables in a contingency table are independent.

Checking for Independence in a Contingency Table

In the following contingency table, the row and column totals have been given, but the cell frequencies have not.

What would these cell frequencies be if we assume that that smoking status and blood group are perfectly independent?

Table 7 Smoking Status by Blood Type (Contingency Table)
Smoking Status	Blood Group A	Blood Group B	Blood Group O	Blood Group AB	Row Totals
Yes	?	?	?	?	200
No	?	?	?	?	800
Column Totals	400	110	450	40	1000

First notice that 200/1000 = 20% are smokers, and 80% are not. So for smoking status and blood group to be perfectly independent, we must preserve this 20% smoker ratio across all blood groups.

In blood group A the number of smokers must be 20% of 400 = 80.
In blood group B the number of smokers must be 20% of 110 = 22.
In blood group O the number of smokers must be 20% of 450 = 90.
In blood group AB the number of smokers must be 20% of 40 = 8.

We can now fill in all the cell frequencies to produce a frequency table where the row and column variables are perfectly independent. (Of course, such “perfection” rarely occurs in real life!)

Table 8 Smoking Status by Blood Type (Frequency Table)
Smoking Status	Blood Group A	Blood Group B	Blood Group O	Blood Group AB	Row Totals
Yes	80	22	90	8	200
No	320	88	360	32	800
Column Totals	400	110	450	40	1000

Notice now that the conditional distribution of smoking status (20% Yes, 80% No) is the same for each blood group, and in turn it equals the marginal distribution of smoking status. That’s independence!

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Probability and Statistics Copyright © 2023 by Thompson Rivers University is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.