Lesson 3.1: Sampling Variability
Software Lab 3.1
Sampling Distributions
This software lab is adapted from the Sampling Distributions lab (OpenIntro, n.d.-a) CC BY-SA 4.0. In this lab, we will investigate the ways in which a statistic from a random sample of data can serve as a point estimate for a population parameter. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
As you work through the lab, answer the ungraded exercises in the shaded boxes. Check your answers by consulting the Software Lab 3.1 Solutions.
Remember to complete the graded Software Lab Questions for this section in Moodle.
Getting Started
The Data
A 2019 Gallup blog post reported the following:
The premise that scientific progress benefits people has been embodied in discoveries throughout the ages—from the development of vaccinations to the explosion of technology in the past few decades, resulting in billions of supercomputers now resting in the hands and pockets of people worldwide. Still, not everyone around the world feels science benefits them personally. ( … ) The Wellcome Global Monitor finds that 20% of people globally believe that scientists’ work does not benefit people like them. (Stevens & Dugan, 2019)
In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking small samples from the population. We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population believe scientists’ work does not benefit people like them and the remaining 80,000 believe it does.
Download the global_monitor [CSV file] (OpenIntro, n.d.-b) data frame, and load it into jamovi. The scientist_work
variable contains responses to the question: Do you believe that the work scientists do benefits people like you? The relevant values are Benefits
or Doesn't benefit
.
We can quickly visualize the distribution of these responses using a bar plot. You can find this in the Exploration
menu, by clicking Bar plot
in Descriptives > Plots
. We can also obtain summary statistics to confirm we constructed the data frame correctly by checking the Frequency tables
box (confirm counts of 80,000 for Benefits
and 20,000 for Doesn't benefit
).
The Unknown Sampling Distribution
In this lab, you have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population. If you are interested in estimating the proportion of people who believe scientists’ work doesn’t benefit people like them, you can use the sampling function in jamovi to survey the population.
Click the Data
tab, and add a new computed variable
by double-clicking the empty column next to the column containing the scientist_work
variable. Name this variable sample1
and define it using the function SAMPLE(scientist_work, 50)
. This command collects a simple random sample of size 50 from the global_monitor
dataset, and assigns the result to sample1
. This is similar to randomly drawing names from a hat that contains the names of everyone in the population. Working with these 50 names is considerably simpler than working with all 100,000 people in the population.
You won’t see any data show up in the data viewer, but you can get summaries and graphs of this new variable. Create a frequency table of this new variable to see how many people responded in each way: click the Analyses
tab; then click Exploration > Descriptives
; and select Frequency tables
for variable sample1
.
If you’re interested in estimating the proportion of all people who believe that scientists’ work doesn’t benefit people like them, but you do not have access to the population data, your best single estimate is the sample proportion.
Depending on which 50 people you selected, your estimate could be a bit above or a bit below the true population proportion. In general, though, the sample proportion turns out to be a pretty good estimate of the true population proportion, and you were able to get it by sampling 50 people out of 100,000, which is just 0.05% of the population.
sample2
. How does the sample proportion for sample2
compare with that of sample1
? Suppose we took two more samples, one of size 100 and one of size 1,000. Which provides a more accurate estimate of the population proportion?Not surprisingly, every time you take another random sample, you might get a different sample proportion. It’s useful to get a sense of just how much variability you should expect when estimating the population mean this way. The distribution of sample proportions, called the sampling distribution of the proportion, can help you understand this variability.
In this lab, because you have access to the population, you could build up the sampling distribution for the sample proportion by repeating the above steps many times. We actually did this by repeating this process 15,000 times. In other words we generated 15,000 samples of size 50 from the population, calculated the proportion of Doesn't benefit
responses in each sample, stored the results, and saved these results in the file at sample_props50 [CSV file] (OpenIntro, n.d.-c). Download and open this dataset to get a sense of what it tells us.
p_hat
, in the sample_props50
dataset. Describe the sampling distribution of p_hat
, and specify its centre and spread.Interlude: Sampling Distributions
The idea behind this process is repetition. Earlier, you took a single sample of size 50 from the population of all people in the population. With this process, you can repeat this sampling procedure many times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
In practice we rarely get to build true sampling distributions, because one rarely has access to data from the entire population. Note that for each of the 15,000 times we computed a proportion, we did so from a different sample!
Sample Size and the Sampling Distribution
Mechanics aside, let’s return to the reason we generated this data: to compute a sampling distribution. Specifically, we want the sampling distribution based on the proportions from 15,000 samples of 50 people.
The sampling distribution that you computed tells us a lot about estimating the true proportion of people who believe that scientists’ work doesn’t benefit people them. Because the sample proportion is an unbiased estimator, the sampling distribution is centred at the true population proportion, and the spread of the distribution indicates how much variability is incurred by sampling only 50 people at a time from the population.
In the remainder of this section, you will work on getting a sense of the effect that sample size has on sampling distribution.
Doesn’t benefit
from samples of size 10, 50, and 100. Use 5,000 samples, and remember the true proportion of the population who believe scientists’ work doesn’t benefit people like them is 0.2. What does each observation in the sampling distribution represent? How (if at all) does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) does the mean, standard error, and shape of the sampling distribution change if you increase the number of samples?More Practice
So far, we have only focused on estimating the proportion of those who believe scientists’ work doesn’t benefit people like them. Now, you’ll try to estimate the proportion of those who believe it does benefit people like them.
References
CPM Educational Program. (2023). Proportion sampling distribution simulator [Application]. https://stats.cpm.org/propsamples/
OpenIntro. (n.d.-a) CC BY-SA 4.0. Foundations for statistical inference – Sampling distributions. OpenIntro Labs for jamovi. https://openintro.shinyapps.io/sampling_distributions_jamovi/
OpenIntro. (n.d.-b). Global_monitor [Data set]. https://github.com/OpenIntroStat/oilabs-jamovi/raw/main/05a_sampling_distributions/more/global_monitor.csv
OpenIntro. (n.d.-c). Sample_props50 [Data set]. https://github.com/OpenIntroStat/oilabs-jamovi/raw/main/05a_sampling_distributions/more/sample_props50.csv
Stevens, L., & Dugan, A. (2019, Nov. 8). World science day: Is knowledge power? Gallup. https://news.gallup.com/opinion/gallup/268121/world-science-day-knowledge-power.aspx