Software Lab 3.1

Iain Pardoe

Lesson 3.1: Sampling Variability

Software Lab 3.1

Sampling Distributions

This software lab is adapted from the Sampling Distributions lab (OpenIntro, n.d.-a) CC BY-SA 4.0. In this lab, we will investigate the ways in which a statistic from a random sample of data can serve as a point estimate for a population parameter. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

As you work through the lab, answer the ungraded exercises in the shaded boxes. Check your answers by consulting the Software Lab 3.1 Solutions.

Remember to complete the graded Software Lab Questions for this section in Moodle.

Getting Started

The Data

A 2019 Gallup blog post reported the following:

The premise that scientific progress benefits people has been embodied in discoveries throughout the ages—from the development of vaccinations to the explosion of technology in the past few decades, resulting in billions of supercomputers now resting in the hands and pockets of people worldwide. Still, not everyone around the world feels science benefits them personally. ( … ) The Wellcome Global Monitor finds that 20% of people globally believe that scientists’ work does not benefit people like them. (Stevens & Dugan, 2019)

In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking small samples from the population. We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population believe scientists’ work does not benefit people like them and the remaining 80,000 believe it does.

Download the global_monitor [CSV file] (OpenIntro, n.d.-b) data frame, and load it into jamovi. The scientist_work variable contains responses to the question: Do you believe that the work scientists do benefits people like you? The relevant values are Benefits or Doesn't benefit.

We can quickly visualize the distribution of these responses using a bar plot. You can find this in the Exploration menu, by clicking Bar plot in Descriptives > Plots. We can also obtain summary statistics to confirm we constructed the data frame correctly by checking the Frequency tables box (confirm counts of 80,000 for Benefits and 20,000 for Doesn't benefit).

The Unknown Sampling Distribution

In this lab, you have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population. If you are interested in estimating the proportion of people who believe scientists’ work doesn’t benefit people like them, you can use the sampling function in jamovi to survey the population.

Click the Data tab, and add a new computed variable by double-clicking the empty column next to the column containing the scientist_work variable. Name this variable sample1 and define it using the function SAMPLE(scientist_work, 50). This command collects a simple random sample of size 50 from the global_monitor dataset, and assigns the result to sample1. This is similar to randomly drawing names from a hat that contains the names of everyone in the population. Working with these 50 names is considerably simpler than working with all 100,000 people in the population.

You won’t see any data show up in the data viewer, but you can get summaries and graphs of this new variable. Create a frequency table of this new variable to see how many people responded in each way: click the Analyses tab; then click Exploration > Descriptives; and select Frequency tables for variable sample1.

1. Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population? Hint: What proportion of the sample believes scientists’ work doesn’t benefit people like them? What proportion of the population believes this? Check your answer by consulting the Software Lab 3.1 Solutions.

If you’re interested in estimating the proportion of all people who believe that scientists’ work doesn’t benefit people like them, but you do not have access to the population data, your best single estimate is the sample proportion.

Depending on which 50 people you selected, your estimate could be a bit above or a bit below the true population proportion. In general, though, the sample proportion turns out to be a pretty good estimate of the true population proportion, and you were able to get it by sampling 50 people out of 100,000, which is just 0.05% of the population.

2. Would you expect the sample proportion to match the sample proportion of another random sample from this population? Why, or why not? If the answer is no, would you expect the proportions to be somewhat similar or very different?

3. Take a second sample, also of size 50, and call it sample2. How does the sample proportion for sample2 compare with that of sample1? Suppose we took two more samples, one of size 100 and one of size 1,000. Which provides a more accurate estimate of the population proportion?

Not surprisingly, every time you take another random sample, you might get a different sample proportion. It’s useful to get a sense of just how much variability you should expect when estimating the population mean this way. The distribution of sample proportions, called the sampling distribution of the proportion, can help you understand this variability.

In this lab, because you have access to the population, you could build up the sampling distribution for the sample proportion by repeating the above steps many times. We actually did this by repeating this process 15,000 times. In other words we generated 15,000 samples of size 50 from the population, calculated the proportion of Doesn't benefit responses in each sample, stored the results, and saved these results in the file at sample_props50 [CSV file] (OpenIntro, n.d.-c). Download and open this dataset to get a sense of what it tells us.

4. We can now visualize the distribution of the proportions with a histogram. Create a histogram of the proportions, p_hat, in the sample_props50 dataset. Describe the sampling distribution of p_hat, and specify its centre and spread.

Interlude: Sampling Distributions

The idea behind this process is repetition. Earlier, you took a single sample of size 50 from the population of all people in the population. With this process, you can repeat this sampling procedure many times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.

In practice we rarely get to build true sampling distributions, because one rarely has access to data from the entire population. Note that for each of the 15,000 times we computed a proportion, we did so from a different sample!

5. To make sure you understand how sampling distributions are built, and exactly what the process does, imagine modifying the process to create a sampling distribution based on 25 sample proportions from samples of size 10. How many observations are there in this distribution? What does each observation represent?

Sample Size and the Sampling Distribution

Mechanics aside, let’s return to the reason we generated this data: to compute a sampling distribution. Specifically, we want the sampling distribution based on the proportions from 15,000 samples of 50 people.

The sampling distribution that you computed tells us a lot about estimating the true proportion of people who believe that scientists’ work doesn’t benefit people them. Because the sample proportion is an unbiased estimator, the sampling distribution is centred at the true population proportion, and the spread of the distribution indicates how much variability is incurred by sampling only 50 people at a time from the population.

In the remainder of this section, you will work on getting a sense of the effect that sample size has on sampling distribution.

6. Use the Proportion Sampling Distribution Simulator [Application] (CPM Educational Program, 2023) to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 samples, and remember the true proportion of the population who believe scientists’ work doesn’t benefit people like them is 0.2. What does each observation in the sampling distribution represent? How (if at all) does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) does the mean, standard error, and shape of the sampling distribution change if you increase the number of samples?

More Practice

So far, we have only focused on estimating the proportion of those who believe scientists’ work doesn’t benefit people like them. Now, you’ll try to estimate the proportion of those who believe it does benefit people like them.

7. Adapt the steps above to take a sample of size 15 from the population, and calculate the proportion of people in this sample who believe scientists’ work benefits people like them. Using this sample, what is your best point estimate of the population proportion of people who believe scientists’ work benefits people like them?

8. Using the simulation app from question 6, simulate the sampling distribution of the proportion of those who believe scientists’ work benefits people like them for samples of size 15 based on 2,000 samples. Describe the shape of this sampling distribution. Based on this sampling distribution, what would you estimate is the true proportion of those who believe scientists’ work benefits people like them? What is the true population proportion in this case?

9. Change your sample size from 15 to 150, then simulate the sampling distribution using the same method in question 8. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 15 in question 8. Based on this sampling distribution, what would you estimate to be the true proportion of those who believe scientists’ work benefits people like them?

10. Of the sampling distributions from questions 8 and 9, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

References

CPM Educational Program. (2023). Proportion sampling distribution simulator [Application]. https://stats.cpm.org/propsamples/

OpenIntro. (n.d.-a) CC BY-SA 4.0. Foundations for statistical inference – Sampling distributions. OpenIntro Labs for jamovi. https://openintro.shinyapps.io/sampling_distributions_jamovi/

OpenIntro. (n.d.-b). Global_monitor [Data set]. https://github.com/OpenIntroStat/oilabs-jamovi/raw/main/05a_sampling_distributions/more/global_monitor.csv

OpenIntro. (n.d.-c). Sample_props50 [Data set]. https://github.com/OpenIntroStat/oilabs-jamovi/raw/main/05a_sampling_distributions/more/sample_props50.csv

Stevens, L., & Dugan, A. (2019, Nov. 8). World science day: Is knowledge power? Gallup. https://news.gallup.com/opinion/gallup/268121/world-science-day-knowledge-power.aspx

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License