Glossary
- 68-95-99.7 rule
-
Rule of thumb for the probability of falling within 1, 2, and 3 standard deviations of the mean in the normal distribution. Also called the empirical rule.
- Adjusted R-squared
-
An adjustment to R-squared to account for multiple predictors in a multiple linear regression model. Used as a model comparison tool.
- Alternative hypothesis
-
An alternative claim (to the null hypothesis) under consideration, often represented by a range of possible parameter values. Denoted H_A.
- Analysis of variance
-
A technique that compares variability between two or more group means relative to data variability within the groups and is used to test whether the mean outcome differs the groups. Also called ANOVA.
- Anecdotal evidence
-
Data collected in a haphazard fashion that may only represent extraordinary cases.
- ANOVA
-
A technique that compares variability between two or more group means relative to data variability within the groups and is used to test whether the mean outcome differs the groups. Also called Analysis of Variance.
- Associated variables
-
Variables that are associated with one another numerically. Also called dependent variables.
- Bar plot
-
A graph that displays counts or proportions of categorical data in different groups using vertical bars.
- Bias
-
A systematic tendency to over- or under-estimate the true population value of a parameter.
- Biased sample
-
A sample selected non-randomly that systematically misrepresents the population.
- Blinding
-
Ensuring researchers keep the patients uninformed about their treatment.
- Blocking
-
Grouping cases into blocks according to a blocking variable known to affect the response, before randomizing cases within each block to the treatment groups.
- Bonferroni correction
-
Dividing the significance level by the number of comparisons when comparing multiple pairs of means after an Analysis of Variance (ANOVA).
- Box plot
-
A graph that summarizes numerical data using a box going from the first quartile to the third quartile, a line inside the box at the median, whiskers that extend from the box by 1.5 times the inter-quartile range or the minimum/maximum excluding any outliers beyond the whiskers.
- Case
-
An individual in a sample dataset on whom one or more variables have been measured. Also called an observational unit.
- Categorical variable
-
A variable with values (levels) that represent categories.
- Causal relationship
-
A relationship between variables in which changing the value of one variable has an affect on the value of another variable.
- Central limit theorem
-
The principle that a sample proportion or a sample mean has a sampling distribution that is approximately normal if the observations are independent and the sample size is sufficiently large.
- Chi-square distribution
-
A probability distribution for a numerical variable that is non-negative and typically right skewed. Characterized by a parameter called its degrees of freedom.
- Chi-square statistic
-
A test statistic used in goodness of fit tests for one-way tables and independence tests for two-way tables and calculated as the sum of squared differences between observed and expected counts divided by expected counts.
- Chi-square test
-
A hypothesis test that uses a chi-square statistic with a chi-square distribution under the null hypothesis.
- Cluster sample
-
A sample in which the population is grouped into clusters and all cases from a sample of clusters are selected.
- Cohort
-
A group of individuals (cases) who share some characteristic(s) (e.g., all born in the same year).
- Collinear
-
In linear regression, when two numerical predictor variables are highly correlated, and this collinearity complicates model estimation. Also called multicollinearity.
- Complement of an event
-
All outcomes in the sample space that are not in the event.
- Conditional probability
-
The probability of an outcome for one variable or process conditioned on the outcome of another variable or process.
- Confidence interval
-
A range of plausible values around a point estimate where we are likely to find the population parameter. Calculated as a point estimate plus/minus a margin of error.
- Confidence level
-
The level of confidence associated with a confidence interval for a population parameter, e.g., 90%, 95%, ... The higher the confidence level, the greater the confidence we have that the interval contains the population parameter (and hence the wider the interval must be, all else equal).
- Confounding variable
-
A variable correlated with both the explanatory and response variables that can lead to misleading results if left out of the analysis. Also called a lurking variable, confounding factor, or a confounder.
- Contingency table
-
A table that summarizes cross-classified counts of two categorical data variables. Also called a two-way table.
- Continuous variable
-
A numerical variable that can take any numerical value within an interval (e.g., a decimal number between 0 and 1).
- Control group
-
Cases in the sample who do not receive a specific treatment being investigated.
- Controlling
-
Refers to controlling differences in treatment groups that arise from factors other than the treatment itself.
- Convenience sample
-
A sample in which easily accessible individuals are more likely to be included in the sample.
- Correlation
-
A measure between –1 and 1 of the strength and direction of the linear relationship between two numerical variables. Also called Pearson's correlation.
- Data
-
Measurements of one or more variables on a sample of observations.
- Data fishing
-
Informally analyzing data to find patterns without first determining group structure or stating hypotheses to be tested. Can lead to exaggerated claims about significant findings. Also called data snooping.
- Data matrix
-
A common way to organize data in which the matrix rows represent cases (observational units) and the matrix columns represent variables.
- Data snooping
-
Informally analyzing data to find patterns without first determining group structure or stating hypotheses to be tested. Can lead to exaggerated claims about significant findings. Also called data fishing.
- Degrees of freedom
-
A number (or numbers) that characterize a variety of probability distributions, including chi-square, t, and F.
- Dependent variables
-
Variables that are associated with one another numerically. Also called associated variables. [The term "dependent variable" is sometimes used for a "response variable," but that convention is not used in this course.]
- Discrete variable
-
A numerical variable that can only take numerical values with gaps between them (e.g., 0, 1, 2, ...).
- Disjoint events
-
Sets of outcomes that have no outcomes in common. Also called mutually exclusive events.
- Disjoint outcomes
-
Outcomes that cannot happen together. Also called mutually exclusive outcomes.
- Dot plot
-
A one-variable scatterplot that plots the values along the horizontal axis as dots, stacking the dots when there are repeated values.
- Double-blinding
-
Ensuring researchers keep the patients uninformed about their treatment and researchers are also unaware of which patients receive which treatment.
- Empirical rule
-
Rule of thumb for the probability of falling within 1, 2, and 3 standard deviations of the mean in the normal distribution. Also called the 68-95-99.7 rule.
- Error
-
The difference between a sample estimate of a population parameter and the true value of the parameter. Usually unknown since we generally don't know the true value of the parameter.
- Error sum of squares
-
The sum of squared differences between the observed response values and the fitted (predicted) response values. Also called sum of squared errors or residual sum of squares.
- Event
-
A set of outcomes in a random process.
- Expected counts
-
Category counts or frequencies calculated under the null hypothesis in a chi-square test.
- Experimental study
-
A study in which researchers conduct an experiment to investigate the possibility of a causal connection between variables by controlling the values of the explanatory variable(s) for selected cases.
- Explanatory variable
-
When we suspect one variable might causally affect another, we label the first variable the explanatory variable. Also called a predictor variable.
- Extrapolation
-
Using a linear regression model to make a prediction outside the scope of the sample predictor variable values.
- F distribution
-
A probability distribution for a numerical variable that is non-negative and typically right skewed. Characterized by two parameters: numerator degrees of freedom and denominator degrees of freedom.
- F statistic
-
A test statistic used in ANOVA where it is calculated as the mean square between groups divided by the mean square error.
- F test
-
A hypothesis test that uses an F statistic with an F distribution under the null hypothesis.
- First quartile
-
The value for numerical data such that a quarter (25% ) of the data fall below this value. Also called lower quartile or 25th percentile. Abbreviated Q1.
- Fitted values
-
The fitted or predicted response values based on a linear model. Also called predicted values.
- Histogram
-
A graph that displays numerical data grouped into intervals or bins using vertical bars.
- Hollow histograms
-
A graph that summarizes numerical data in different groups using outlines of histograms put on the same plot.
- Hypothesis testing
-
A statistical framework used to rigorously evaluate competing ideas and claims.
- Independent events
-
Two events are independent if the events result from independent processes and the probability that both events occur is equal to the product of the probabilities that each event occurs.
- Independent observations
-
Observations are independent if knowing the value(s) of one observation provides no useful information about the value(s) of another observation.
- Independent processes
-
Two random processes are independent if knowing the outcome of one provides no useful information about the outcome of the other.
- Independent variables
-
Two variables that are not associated and have no evident relationship. [Do not mix this up with the concepts of independent processes and independent events in probability. Also the term "independent variable" is sometimes used for an "explanatory variable," but that convention is not used in this course.]
- Indicator variable
-
A binary 0-1 variable that takes the value 1 for one category of a categorical variable and 0 for all other categories. Used as a way to include categorical variables in a linear regression model.
- Influential point
-
In linear regression, an observation that has a particularly large influence on the fit of the model in the sense that its removal would drastically alter the results.
- Interquartile range
-
The difference between the third quartile and the first quartile for numerical data. Encloses the middle 50% of the data. Abbreviated IQR.
- Joint probability
-
A probability of outcomes for two or more variables or processes.
- Law of large numbers
-
As more observations are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome.
- Least squares criterion
-
Fitting a linear regression model by minimizing the sum of squared residuals.
- Least squares line
-
The simple linear regression fitted line that results from minimizing the sum of squared residuals.
- Least squares regression
-
A linear regression model fit by minimizing the sum of the squared residuals.
- Levels
-
The possible values of a categorical variable.
- Leverage
-
In linear regression, points with predictor values that are far away from the main distribution of predictor values have high leverage.
- Linear regression
-
A statistical technique for modeling the relationship between a single numerical response variable and one or more numerical or categorical predictor variables using a linear equation. Called simple linear regression (SLR) if there is one numerical predictor or multiple linear regression (MLR) if there are two or more numerical or categorical predictors.
- Linear relationship
-
A relationship between two numerical variables that is linear, i.e., can be described by a straight line in which a fixed increase in one variable is associated with a fixed change in the other variable.
- Margin of error
-
The amount added to or subtracted from a point estimate in the construction of a confidence interval for a population parameter. Calculated by multiplying a percentile from a probability distribution by the standard error of the point estimate.
- Marginal probability
-
A probability based on a single variable without regard to any other variables.
- Mean
-
A measure of the centre of a distribution of numerical data, calculated by summing the values and dividing by the sample size. Also called the average.
- Median
-
A measure of the centre of a distribution of numerical data, equal to the middle value (for odd sample sizes) or the average of the middle two values (for even sample sizes).
- Mode
-
Value(s) represented by prominent peak(s) in a data distribution.
- Mosaic plot
-
A standardized stacked bar plot for contingency table counts that also displays relative group sizes of the primary variable.
- Multicollinearity
-
In linear regression, when two numerical predictor variables are highly correlated, and this collinearity complicates model estimation.
- Multiple comparisons
-
Comparing multiple pairs of means after an Analysis of Variance (ANOVA).
- Multiple linear regression
-
A statistical technique for modeling the relationship between a single numerical response variable and two or more numerical or categorical predictor variables.
- Multistage sample
-
A sample in which the population is grouped into clusters and random samples of cases from a sample of clusters are selected.
- Mutually exclusive events
-
Sets of outcomes that have no outcomes in common. Also called disjoint events.
- Mutually exclusive outcomes
-
Outcomes that cannot happen together. Also called disjoint outcomes.
- Negative association
-
An association between two numerical variable in which an increase in one variable tends to be associated with a decrease in the other variable.
- Nominal variable
-
A categorical variable in which the levels have no meaningful, natural order.
- Non-response bias
-
Bias introduced into a sample when non-responders are systematically different from responders.
- Nonlinear relationship
-
A relationship between two numerical variables that is not linear, i.e., cannot be described by a straight line and is better summarized by a curved line.
- Normal curve
-
A symmetric, unimodal, bell-shaped distribution for a numerical variable. Also called a normal distribution.
- Normal distribution
-
A symmetric, unimodal, bell-shaped distribution for a numerical variable. Also called a normal curve.
- Null distribution
-
In hypothesis testing, the sampling distribution of a test statistic when the null hypothesis is true.
- Null hypothesis
-
A skeptical perspective or claim to be tested, often represented by a single parameter value. Denoted H_0 (H-nought).
- Null value
-
In hypothesis testing, the population parameter value stated in the null hypothesis to which we compare the sample statistic.
- Numerical variable
-
A variable that can take a range of numerical values, and it is meaningful to add, subtract, or take averages with those values. Also called a quantitative variable.
- Observational study
-
A study in which researchers collect data in a way that does not directly interfere with how the data arise.
- Observational unit
-
An individual in a sample dataset on whom one or more variables have been measured. Also called a case.
- Observed counts
-
Category counts or frequencies observed in the sample data for a chi-square test.
- One-sided hypothesis test
-
A hypothesis test in which the p-value is the tail area in one tail of the null distribution. The alternative hypothesis for a two-sided hypothesis test involving a single population parameter has a "less than" or "greater than" sign. Also called a one-tailed hypothesis test.
- One-way table
-
A table of counts or frequencies representing categories of a single categorical data variable.
- Ordinal variable
-
A categorical variable in which the levels have a meaningful, natural order.
- Outlier
-
A numerical data observation that appears extreme relative to the rest of the data, i.e., is unusually low or high. In the context of a single variable, a value that is less than Q1 – 1.5 x IQR or greater than Q3 + 1.5 x IQR. In the context of linear regression, an observation that falls far from the cloud of points.
- p-value
-
In hypothesis testing, the probability of observing data at least as favourable to the alternative hypothesis as our current data set, if the null hypothesis were true.
- Paired data
-
Two sets of observations are paired if each observation in one set has a special correspondence or connection with exactly one observation in the other set.
- Pearson's correlation
-
A measure between –1 and 1 of the strength and direction of the linear relationship between two numerical variables. Also simply called correlation.
- Percentile
-
The value of a numerical variable such that the specified percentage of the data is less than that value, e.g., 95% of the data is less than the 95th percentile. Also called a quantile.
- Pie chart
-
A graph that displays counts or proportions of categorical data in different groups using wedges of a pie.
- Placebo
-
A "sham" treatment with no known impact on a response variable (e.g., a sugar pill).
- Point estimate
-
A single number used to estimate a population parameter such as a proportion or a mean.
- Pooled proportion
-
An estimate of the population proportion across the entire study when the population is divided into two independent groups.
- Pooled standard deviation
-
An estimate of the population standard deviation across the entire study when the population is divided into two independent groups.
- Population
-
The group of cases that the research questions are about. Also called target population.
- Population mean
-
The mean of the population values for a numerical variable, denoted using the Greek letter "mu."
- Population parameter
-
A summary quantity calculated over the entire population. Usually unknown unless we have a census of the entire population.
- Positive association
-
An association between two numerical variables in which an increase in one variable tends to be associated with an increase in the other variable.
- Post-hoc tests
-
Methods for performing multiple comparisons (of means) after an Analysis of Variance (ANOVA). Includes the Bonferroni correction and Tukey's range test.
- Power
-
In hypothesis testing, the probability of rejecting a null hypothesis when the alternative hypothesis is true. Equal to one minus the probability of making a type 2 error (which is denoted using the Greek letter “beta”). Thus, power = "one minus beta."
- Predicted values
-
The fitted or predicted response values based on a linear model. Also called fitted values.
- Predictor variable
-
When we suspect one variable might causally affect another, we label the first variable the explanatory variable. Also called an explanatory variable.
- Probability
-
The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times.
- Probability distribution
-
A table of all disjoint outcomes of a random process and their associated probabilities.
- Probability tree diagram
-
A graphical tool to organize outcomes and probabilities around the structure of categorical data.
- Prospective study
-
An observational study that identifies individuals and collects information as events unfold.
- Quantile
-
The value of a numerical variable such that the specified proportion of the data is less than that value, e.g., 0.95 (or 95%) of the data is less than the 0.95 quantile. Also called a percentile.
- Quantitative variable
-
A variable that can take a range of numerical values, and it is meaningful to add, subtract, or take averages with those values. Also called a numerical variable.
- R-squared
-
In linear regression, the amount of variation in the response variable that is explained by the fitted least squares equation. Also called the coefficient of determination.
- Randomization
-
Ensuring that patients are randomly placed into treatment groups to account for variables that cannot be controlled.
- Randomized experiment
-
An experiment in which individuals (cases) are randomly assigned to a group (e.g., "treatment" or "control").
- Reference category
-
In linear regression, the category of a categorical variable that has the value 0 for the corresponding indicator variable(s). Used as the "base comparison" for the other categories. Also called reference level.
- Reference level
-
In linear regression, the category of a categorical variable that has the value 0 for the corresponding indicator variable(s). Used as the "base comparison" for the other categories. Also called reference category.
- Regression intercept
-
In simple linear regression, the expected value of the response variable when the predictor variable is 0.
- Regression slope
-
The slope of the simple linear regression fitted line, calculated as "rise over run." In other words, the expected change in the response variable for a one-unit increase in the predictor variable.
- Replication
-
Refers both to collecting a sufficiently large sample of cases (sometimes called replicates) and also replication of an entire study to confirm an earlier finding.
- Residual plot
-
A scatterplot with residuals from a fitted linear model on the vertical axis and either fitted values or a predictor variable on the horizontal axis.
- Residuals
-
Differences between the observed response values and the fitted (predicted) response values.
- Response variable
-
When we suspect one variable might causally affect another, we label the second variable the response variable.
- Retrospective study
-
An observational study that collects data after events have taken place.
- Robust statistics
-
Summary statistics for which extreme observations have little effect on their values, e.g., median and IQR.
- Sample
-
A subset of cases selected from the population and analyzed to provide information about the population.
- Sample mean
-
The mean of the sample values for a numerical variable, denoted by putting a horizontal bar over the variable name.
- Sample size
-
The number of observations in a sample, denoted by n.
- Sample space
-
All possible outcomes of a random process.
- Sample statistic
-
A summary quantity calculated from the sample data. Usually used to estimate the corresponding population parameter.
- Sampling distribution
-
The hypothetical probability distribution of a sample estimate under repeated sampling.
- Sampling error
-
A measure of how much an estimate tends to vary from one sample to the next. Also called sampling uncertainty.
- Sampling uncertainty
-
A measure of how much an estimate tends to vary from one sample to the next. Also called sampling error.
- Scatterplot
-
A graph of two numerical variables in which each sample point is plotted at the intersection of the value of one variable on the horizontal axis and the other variable on the vertical axis.
- Side-by-side bar plot
-
A graph that displays counts or proportions of contingency table counts using side-by-side vertical bars.
- Side-by-side box plot
-
A graph that summarizes numerical data in different groups using side-by-side box plots.
- Significance level
-
In hypothesis testing, the probability of making a type 1 error, that is rejecting a true null hypothesis. Denoted using the Greek letter "alpha" and often set at 0.05.
- Simple linear regression
-
A statistical technique for modeling the relationship between a single numerical response variable and one numerical predictor variable.
- Simple random sample
-
A sample in which each case in the population has an equal chance of being selected and the sample cases have no implied connection.
- Simulation
-
Using a random number generator on a computer to simulate the null distribution of a sample statistic to gauge the unusualness of an observed test statistic.
- Skewed
-
A type of numerical data distribution that is not symmetric but instead has the majority of the data values on one side and a smaller amount of more extreme data trailing off to one side. Left skewed data has a longer left tail, while right skewed data has a longer right tail.
- Stacked bar plot
-
A graph that displays counts or proportions of contingency table counts using stacked vertical bars.
- Standard deviation
-
A measure of the spread of a distribution of numerical data, calculated by the square root of the average squared distance from the mean, i.e., the square root of the variance. Roughly describes how far away the typical observation is from the mean.
- Standard error
-
The standard deviation of a sample estimate based on its sampling distribution.
- Standard normal distribution
-
A normal distribution with a mean of 0 and a standard deviation of 1.
- Statistics
-
The study of how best to collect, analyze, and draw conclusions from data.
- Stratified sample
-
A sample in which the population is divided into groups (strata) of similar cases and cases are sampled from each stratum.
- Sum of squared errors
-
The sum of squared differences between the observed response values and the fitted (predicted) response values. Also called error sum of squares or residual sum of squares.
- Sum of squares total
-
The sum of squared differences between the response values and the sample mean of the response values. Also called total sum of squares.
- Summary statistic
-
A single number summarizing the values of a variable for a sample of observations.
- t distribution
-
A symmetric, unimodal, bell-shaped distribution for a numerical variable that has slightly wider tails than a normal distribution. Characterized by a parameter called its degrees of freedom (df). Its shape becomes closer to a standard normal distribution as df increases. Also called student's t distribution.
- t statistic
-
A test statistic calculated as the difference between a sample estimate and the hypothesized value divided by the standard error. Also called a T-score.
- t test
-
A hypothesis test that uses a t statistic with a t distribution under the null hypothesis.
- T-score
-
A test statistic calculated as the difference between a sample estimate and the hypothesized value divided by the standard error. Also called a t statistic.
- Third quartile
-
The value for numerical data such that three-quarters (75% ) of the data fall below this value. Also called upper quartile or 75th percentile. Abbreviated Q3.
- Total sum of squares
-
The sum of squared differences between the response values and the sample mean of the response values. Also called sum of squares total.
- Transformation
-
A rescaling of numerical data using a mathematical function, e.g., the logarithm.
- Treatment group
-
Cases in the sample who receive a specific treatment being investigated.
- Tukey's range test
-
A method for performing multiple comparisons after an Analysis of Variance (ANOVA). Also called Tukey's HSD.
- Two-sided hypothesis test
-
A hypothesis test in which the p-value is the sum of the tail areas in both tails of the null distribution. The alternative hypothesis for a two-sided hypothesis test involving a single population parameter has a "not equals" sign. Also called a two-tailed hypothesis test.
- Two-way table
-
A table that summarizes cross-classified counts of two categorical data variables. Also called a contingency table.
- Type 1 error
-
In hypothesis testing, rejecting a true null hypothesis.
- Type 2 error
-
In hypothesis testing, failing to reject a null hypothesis when the alternative hypothesis is true.
- Variable
-
A characteristic of an observational unit that has been measured.
- Variance
-
A measure of the spread of a distribution of numerical data, calculated by the average squared distance from the mean.
- Venn diagram
-
A diagram that uses overlapping ovals to show the number of outcomes belonging to particular events for a random process.
- Z-score
-
A standardized numerical data value calculated by subtracting the mean and dividing by the standard deviation. Measures the number of stand deviations that the data value is away from the mean.