If you find that our articles are useful for your data science journey, consider subscribing to our paid version with a price of one cup of cappuccino per month.
Before going into the detail of sampling and selection bias, we need to clarify the concept of population and sample in statistics. Those who are familiar with the concept can skip the next section
Population and sample
To put it in very simple words, a population is the whole set of statistical unit (such as all people living in US when conducting an US income survey) that we would like to draw statistical conclusion on.
However, in reality, it is impossible to survey or gather data for every statistical unit. Therefore, we often sample a finite number of statistical unit and perform statistical inference on this finite group to infer information on the whole population. The finite group is a sample.
It is important to note that since we are effectively extrapolating conclusions drawn with a sample to a population, it is important to respect a few statistical concepts, sampling and selection bias are one of these concepts to respect. This ability of extrapolation is called external validity in statistics.
Random sampling
Random sampling constitutes a selection process where every element in the target population possesses an identical probability of inclusion in the sample at each draw. The resulting collection is termed a simple random sample.
This process can be implemented with replacement, where observations return to the population pool after each selection, allowing for potential re-selection in subsequent draws. Alternatively, random sampling can be conducted without replacement, where chosen observations are no longer available for future selections.
Random sampling is the simplest form of sampling. It requires a complete sampling frame (i.e. access to all statistical units), which could be or not be available. This sampling method is also a representative of the underlying population.
Selection bias
Selection bias occurs when statistical units are selected (or sampled) from a population in such a way that randomisation is not properly realised, hence the obtained sample is not a representative of the population.
While there are many types of selection bias in statistics, in this article we will explain a few of them below.
Sampling bias
Sampling bias occurs when certain statistical units are selected consistently than other units. Note that sampling bias is one of the most common ways leading to selection bias.
Early stopping
Selection bias can arise from early termination of an AB test at a time when its results support the desired conclusion.
Data handling
Selection bias can also arise from filtering on a sample obtained from a randomised trial based on certain data properties, such as rejecting bad data
Cherry-picking
Selection bias can also occurs when a subset of sample is specifically selected to support a ‘conclusion’, or when a set of studies are specifically selected when conducting meta-analysis.
Summary
While in some cases it is trivial to avoid selection bias, in other cases it is impossible to avoid them. It is important to understand that such bias exists and how it could even affect conclusions drawn from properly-designed experiments.