# 3.9: Cluster Sampling

- Page ID
- 5713

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

Suppose you are hoping to predict the most popular favorite movie among U.S. high school students. Since your population is all high school students in the U.S.A., a simple random sample is just not feasible since you cannot possibly number each student individually. How then could you manage to get a **representative sample** to use for extrapolation?

Look to the end of the lesson for the answer.

## Cluster Sampling

Cluster sampling is ideal for extremely large populations and/or populations distributed over a large geographic area. The concept of cluster sampling is that we use SRS (simple random sampling) to choose a limited number of groups or ** clusters** of samples from a population, and then again apply SRS to the chosen clusters in order to identify specific samples.

Since you complete each step in the cluster sampling process using SRS, the results can be used for extrapolation. However, there is still a danger of ending up with a non-representative sample if the clusters you are choosing from are not each representative of the population. (See Example B)

The prime benefit of cluster sampling is that it can do an excellent job of reducing the size of a very large population down to something more manageable without ruining your ability to gather a **representative sample.**

### Using Cluster Sampling

A consumer report journalist wants to publish a blog about the most popular cars in the U.S. She has decided to use publicly available vehicle registration data to identify the most often registered car makes. How could she use cluster sampling to help her build a representative sample of U.S. car owners?

Axion23 - https://www.flickr.com/photos/gfreeman23/11613922755

One way to get a representative sample of vehicle registrations across the whole country would be to number a list of all of the sates in the U.S., and then use a **random number generator (RNG)** to pick out 4 or 5 states. From each state, she could then number the counties, use an RNG to pick a county or two, and then repeat to identify cities or towns. By narrowing down the extremely large initial population in this way, she can maintain the randomness of her sample without needing to number every car owner in the U.S.

### Understanding Errors in Cluster Sampling

Kevin is attempting to create a representative sample of students in his school for a poll asking students’ opinions on shortening the school day by 1hr for students over 18yrs old. The results of his survey suggest that over 95% of students think it is a bad idea. Kevin is rather surprised that the results are so overwhelmingly negative, and he wonders if he did something wrong when selecting his sample.

If Kevin chose his sample with the cluster sampling method, and started by clustering the students by grade level, can you see why his results might be suspect?

Did you recall from the lesson that we mentioned that each cluster should be representative of the population? By clustering his samples by grade level, Kevin opened himself up to bias right away. Given the results he received, it is likely that he ended up with all of his samples being freshman who (approximately 15yrs old) thought it unfair that older students should have a shorter schedule!

### Understanding How to Use Cluster Sampling

How could you use a cluster sample to estimate the average density of various tree types in a large forest?

A common method for this type of study is to use a map. If you lay a virtual grid over a map of the forest, you can then number the squares and use an RNG to identify a number of square clusters of trees. You can then count the number of each type of tree in each cluster.

### Earlier Problem Revisited

*Suppose you are hoping to predict the most popular favorite movie among U.S. high school students. Since your population is all high school students in the U.S.A., a simple random sample is just not feasible since you cannot possibly number each student individually. How then could you manage to get a representative sample to use for extrapolation?*

This is an ideal opportunity to use a cluster sample. You could number each state and use an RNG to choose a few states, then repeat to choose a couple of school districts in each state, then a few schools from each district, and finally 1 or 2 classes from each school.

## Examples

For examples 1-3, describe why or why not each scenario describes a cluster sample.

### Example 1

Armand chooses 4 of the 10 busses in front of his school, and polls 10 students from each to see if they think buses are comfortable.

This is valid cluster sample because it is reasonable to assume that the students in each bus are representative of the population of bus riders.

### Example 2

A cup of milk is selected from 10 of the 50 gallons being studied.

This is not a cluster sample, it is merely an SRS, since each gallon can be considered a single unit, and the cup is just a smaller portion of the sample.

### Example 3

5 dogs are chosen from each breed at the show.

This is a stratified sample, not a cluster sample, since the groups are not each representative of the population of show dogs.

### Example 4

How could you use the cluster method to select a representative sample of the types of energy drink carried by gas stations in Colorado?

You might start with an overlay of a map of Colorado, and use an RNG to identify a few areas. Then sample the types of drink at one store of each gasoline brand located in the chosen areas (since different stores in the same geographical area from the same company usually carry the same inventory).

## Review

For questions 1-10, decide if each situation is an example of a properly selected cluster sample.

- 150 light bulbs are evaluated from 1 randomly selected pallet every 30 minutes.
- 5 light bulbs are evaluated from each case of light bulbs.
- 10 cars are reviewed from each of 10 randomly selected used-car dealers.
- 15 candy bars are tested from each shipment.
- 150 laptops are tested from each company.
- 100 laptops are evaluated from each of 5 randomly selected dealers.
- 25 students from each grade were asked the names of their favorite bands.
- 25 students from each school were asked the names of their favorite bands.
- Gas prices were sampled from each gas station in town to find the cheapest.
- 15 gas stations were sampled from each town to find the town with the cheapest.

## Vocabulary

Term | Definition |
---|---|

cluster sampling |
Cluster Sampling involves choosing representatives which are close to other representatives based on a particular factor such as location, age, color, size, etc. |

clusters |
A cluster is a naturally occurring subgroup of a population. |

representative sample |
A representative sample is a smaller number of members of a population whose responses to events model those of the entire population. |

## Additional Resources

Practice: Cluster Sampling