Lesson 1.4: Summarizing Categorical Data
Supplementary Notes 1.4
Summarizing Categorical Data
Frequency Tables
Consider building a frequency table from the following categorical data table from a student survey:
Student | Carrier |
1 | Fido |
2 | Rogers |
3 | Bell |
4 | Rogers |
5 | Telus |
6 | Rogers |
7 | Rogers |
8 | Bell |
9 | Rogers |
10 | Telus |
11 | Telus |
12 | Rogers |
13 | Rogers |
14 | Telus |
15 | Fido |
16 | Rogers |
Student | Carrier |
1 | Fido |
2 | Rogers |
3 | Bell |
4 | Rogers |
5 | Telus |
6 | Rogers |
7 | Rogers |
8 | Bell |
9 | Rogers |
10 | Telus |
11 | Telus |
12 | Rogers |
13 | Rogers |
14 | Telus |
15 | Fido |
16 | Rogers |
First, let’s consider the data:
- Who? Students in a statistics class who own a cell phone.
- What? Cell phone carrier
- Where? Capilano College
- When? September 2013
- Why? To gain information on student choice of cell phone carriers.
- How? Students in a statistics class filled-out a questionnaire.
Now from this data table, construct a frequency table by making “frequency counts” for each cell phone carrier. To display this frequency distribution, you could construct a bar, pie, or segmented bar graph. To draw a pie graph by hand, you would need a protractor to mark out the angles. (Don’t worry––you won’t be asked to do this in the course!)
Frequency table:
Carrier | Frequency |
Bell | 2 |
Fido | 2 |
Rogers | 8 |
Telus | 4 |
Total | 16 |
Suppose we collected another sample of student cell phone carrier preferences and wanted to compare the frequency distributions of the two samples.
Carrier |
Sample 1 Frequency |
Sample 2 Frequency |
Bell | 5 | 2 |
Fido | 5 | 2 |
Rogers | 17 | 8 |
Telus |
13 |
4 |
Total | 40 | 16 |
It’s hard to compare these two distributions in this form since they are based on different sample sizes. The comparison will be much easier if we convert the frequencies to relative frequencies (or percentages).
Carrier | Sample 1 Relative Frequency |
Sample 2 Percent | Sample 2 Relative Frequency | Sample 2 Percent |
Bell | 5/40 = .125 | 12.5% | 2/16 = .125 | 12.5% |
Fido | 5/40 = .125 | 12.5% | 2/16 = .125 | 12.5% |
Rogers | 17/40 = .425 | 42.5% | 8/16 = .5 | 50% |
Telus | 13/40 = .325 | 32.5% | 4/16 = .25 | 25% |
Total | 1.0 | 100% | 1.0 | 100% |
Now the comparison is much clearer. Both samples show the same percentage support for Bell and Fido. Rogers has the highest percentage of support in both samples, although somewhat lower in Sample 1 compared to Sample 2. The corresponding bar graphs for these two distributions would now be drawn using percent on the vertical scale instead of frequency.
Contingency Tables
When subject responses are cross-classified according to row variable and column variable categories, the resulting frequency (or percent) table is called a contingency table.
Building a Contingency Table from a Data Table
In the cell phone survey, students were also asked to rate their satisfaction level with the service provided by the carrier.
Student | Carrier | Satisfaction Level |
1 | Fido | Somewhat |
2 | Rogers | Somewhat |
3 | Bell | Somewhat |
4 | Rogers | Very |
5 | Telus | Somewhat |
6 | Rogers | Somewhat |
7 | Rogers | Very |
8 | Bell | Somewhat |
9 | Rogers | Neutral |
10 | Telus | Very |
11 | Telus | Very |
12 | Rogers | Neutral |
13 | Rogers | Neutral |
14 | Telus | Somewhat |
15 | Fido | Neutral |
16 | Rogers | Very |
To construct a contingency table for this data, you choose one of the two variables for the row variable and the other for the column variable. Let’s choose “Carrier“ as the column variable and “Satisfaction Level” as the row variable. Now you build “frequency counts” for each cell in the table.
Here’s the resulting contingency table for this data set:
Satisfaction Level | Bell Carrier |
Fido Carrier |
Rogers Carrier |
Telus Carrier |
Row Totals |
Very | 0 | 0 | 3 | 2 | 5 |
Somewhat | 2 | 1 | 2 | 2 | 7 |
Neutral | 0 | 1 | 3 | 0 | 4 |
Column Totals | 2 | 2 | 8 | 4 | 16 |
Calculating Percentages on a Contingency Table
Caution! Read the questions very carefully––some of these percentage calculations can be tricky.
What percentage of students use Rogers and are very satisfied?
3/16 = 0.1875 = 18.75%
Overall, what percentage of users is very satisfied?
5/16 = 0.3125 = 31.25%
What percentage of Rogers users is very satisfied?
3/8 = 0.375 = 37.5%
Notice here this percentage is based on only the 8 students who use Rogers; whereas in the previous two questions, the percentages are based on the full 16 students. In the language of statistics, we are “conditioning” on the Rogers column to get the conditional percentage of 37.5%. Notice that this conditional percentage of Rogers users that are very satisfied (37.5%) is not equal to the overall percentage of very satisfied users (31.25%). This gives us some evidence of an association between satisfaction level and carrier company.
What percentage of Telus users is very satisfied?
2/4 = 0.5 = 50%
Now this percentage is based on only the 4 students who use Telus. We are conditioning on the Telus column. Since this is not equal to the overall percentage of very satisfied users, this again suggests an association between satisfaction level and carrier company (however, with such a small sample size we shouldn’t overstate the significance of the differences in these percentages).
Of those that are somewhat satisfied, what percentage use Fido?
1/7 0.143 = 14.3%
This is just another common phrasing of a conditional percentage calculation where we are now conditioning on the somewhat satisfied row.
Independence in Contingency Tables
The concept of independence is very important in probability and statistics, but it’s also a little slippery! The objective at this point in the course is to give you an intuitive introduction to independence in the context of a contingency table. Then, later in the course, we’ll get much more formal in dealing with independence.
Before doing any calculations, let’s see if you already have a feel for the meaning of independence for two variables:
Is a student’s statistics grade independent of the number of hours that they study?
Clearly not! Students who study more are more likely to get better grades.
Is a person’s shoe size independent of their height?
Well no. People with larger shoes sizes are more likely to be taller.
Is a person’s eye colour independent of their hair colour?
Again no. Dark-eyed people are more likely to have dark hair than, say, blue-eyed people.
Is a person’s smoking status (smoker or not) independent of their education level?
No again. Generally, the higher a person’s education level, the less likely they are to smoke.
Is a person’s smoking status independent of their blood type?
Seems plausible. It’s hard to see how or why these two variables might be associated. (But you never know!)
As the above examples illustrate, it’s actually hard to think of two variables that are clearly independent. The association between the two variables might be weak, and the reason for the association might be unclear. And remember, just because two variables are associated it does not follow that one causes the other. They both could be responding to a third variable lurking in the background.
The next example explores in greater detail what it means when we say two variables in a contingency table are independent.
Checking for Independence in a Contingency Table
In the following contingency table, the row and column totals have been given, but the cell frequencies have not.
What would these cell frequencies be if we assume that that smoking status and blood group are perfectly independent?
Smoking Status | Blood Group A | Blood Group B | Blood Group O | Blood Group AB | Row Totals |
Yes | ? | ? | ? | ? | 200 |
No | ? | ? | ? | ? | 800 |
Column Totals | 400 | 110 | 450 | 40 | 1000 |
First notice that 200/1000 = 20% are smokers, and 80% are not. So for smoking status and blood group to be perfectly independent, we must preserve this 20% smoker ratio across all blood groups.
- In blood group A the number of smokers must be 20% of 400 = 80.
- In blood group B the number of smokers must be 20% of 110 = 22.
- In blood group O the number of smokers must be 20% of 450 = 90.
- In blood group AB the number of smokers must be 20% of 40 = 8.
We can now fill in all the cell frequencies to produce a frequency table where the row and column variables are perfectly independent. (Of course, such “perfection” rarely occurs in real life!)
Smoking Status | Blood Group A | Blood Group B | Blood Group O | Blood Group AB | Row Totals |
Yes | 80 | 22 | 90 | 8 | 200 |
No | 320 | 88 | 360 | 32 | 800 |
Column Totals | 400 | 110 | 450 | 40 | 1000 |
Notice now that the conditional distribution of smoking status (20% Yes, 80% No) is the same for each blood group, and in turn it equals the marginal distribution of smoking status. That’s independence!