# 3.6: Sampling Methods

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\avec}{\mathbf a}$$ $$\newcommand{\bvec}{\mathbf b}$$ $$\newcommand{\cvec}{\mathbf c}$$ $$\newcommand{\dvec}{\mathbf d}$$ $$\newcommand{\dtil}{\widetilde{\mathbf d}}$$ $$\newcommand{\evec}{\mathbf e}$$ $$\newcommand{\fvec}{\mathbf f}$$ $$\newcommand{\nvec}{\mathbf n}$$ $$\newcommand{\pvec}{\mathbf p}$$ $$\newcommand{\qvec}{\mathbf q}$$ $$\newcommand{\svec}{\mathbf s}$$ $$\newcommand{\tvec}{\mathbf t}$$ $$\newcommand{\uvec}{\mathbf u}$$ $$\newcommand{\vvec}{\mathbf v}$$ $$\newcommand{\wvec}{\mathbf w}$$ $$\newcommand{\xvec}{\mathbf x}$$ $$\newcommand{\yvec}{\mathbf y}$$ $$\newcommand{\zvec}{\mathbf z}$$ $$\newcommand{\rvec}{\mathbf r}$$ $$\newcommand{\mvec}{\mathbf m}$$ $$\newcommand{\zerovec}{\mathbf 0}$$ $$\newcommand{\onevec}{\mathbf 1}$$ $$\newcommand{\real}{\mathbb R}$$ $$\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$$ $$\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$$ $$\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$$ $$\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$$ $$\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$$ $$\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$$ $$\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$$ $$\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$$ $$\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$$ $$\newcommand{\laspan}[1]{\text{Span}\{#1\}}$$ $$\newcommand{\bcal}{\cal B}$$ $$\newcommand{\ccal}{\cal C}$$ $$\newcommand{\scal}{\cal S}$$ $$\newcommand{\wcal}{\cal W}$$ $$\newcommand{\ecal}{\cal E}$$ $$\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$$ $$\newcommand{\gray}[1]{\color{gray}{#1}}$$ $$\newcommand{\lgray}[1]{\color{lightgray}{#1}}$$ $$\newcommand{\rank}{\operatorname{rank}}$$ $$\newcommand{\row}{\text{Row}}$$ $$\newcommand{\col}{\text{Col}}$$ $$\renewcommand{\row}{\text{Row}}$$ $$\newcommand{\nul}{\text{Nul}}$$ $$\newcommand{\var}{\text{Var}}$$ $$\newcommand{\corr}{\text{corr}}$$ $$\newcommand{\len}[1]{\left|#1\right|}$$ $$\newcommand{\bbar}{\overline{\bvec}}$$ $$\newcommand{\bhat}{\widehat{\bvec}}$$ $$\newcommand{\bperp}{\bvec^\perp}$$ $$\newcommand{\xhat}{\widehat{\xvec}}$$ $$\newcommand{\vhat}{\widehat{\vvec}}$$ $$\newcommand{\uhat}{\widehat{\uvec}}$$ $$\newcommand{\what}{\widehat{\wvec}}$$ $$\newcommand{\Sighat}{\widehat{\Sigma}}$$ $$\newcommand{\lt}{<}$$ $$\newcommand{\gt}{>}$$ $$\newcommand{\amp}{&}$$ $$\definecolor{fillinmathshade}{gray}{0.9}$$

## Sampling Methods

One of the most important applications of statistics is collecting information. Statistical studies are done for many purposes: A government agency may want to collect data on weather patterns. An advertising firm might seek information about what people buy. A consumer group could conduct a statistical study on gas consumption of cars, or a biologist might study primates to find out more about animal behaviors. All of these applications and many more rely on the collection and analysis of information.

One method to collect information is to conduct a census. In a census, information is collected on all the members of the population of interest. For example, when voting for a class president at school every person in the class votes, so this is an example of a census. With this method, the whole population is polled.

It’s sensible to include everyone’s opinion when the population is small, like that of a high school. But conducting a census on a very large population can be very time-consuming and expensive. An alternate method for collecting information is by using a sampling method. This means that information is collected from a small sample that represents the population with which the study is concerned. The information from the sample is then extrapolated to the population—that is, we assume the results for the whole population would be about the same as the results for the sample.

### Sampling Methods

The word population in statistics means the group of people we wish to study, as opposed to the population at large. When we use sampling to conduct a statistical study, first we need to decide how to choose the sample population. It is essential that the sample is a representative sample of the population we are studying. For example, if we are trying to determine the effect of a drug on teenage girls, it would make no sense to include males or older women in our sample population.

There are several ways to choose a population sample from a larger group. The two main types of sampling are random sampling and stratified sampling.

Random Sampling

This method simply involves picking people at random from the population we wish to poll. However, this doesn’t mean we can simply ask the first fifty people who walk by in the street. For instance, if you were conducting a survey on people’s eating habits, you’d get different results if you were standing in front of a fast-food restaurant than if you were standing in front of a health food store. In a true random sample, everyone in the population must have the same chance of being chosen. Calling people on the phone, for example, might be a better way of getting a random sample for a survey about eating habits.

Stratified Sampling

This method of sampling actively seeks to poll people from many different backgrounds. The population is first divided into different categories (or strata) and the number of members in each category is determined. Gender and age groups are commonly used strata, but others could include salary, education level or even hair color. Then, a sample is made up by picking members from each category in the same proportion as they are in the population. For example, imagine you are conducting a survey that calls for a sample size of 100 people. If you know that 10% of the population you’re studying are males between the ages of 10 and 25, then you would seek 10 males in that age group to be part of your sample. Once those 10 have responded, no more males between 10 and 25 may take part in the survey.

Sample Size

In order for sampling to work well, the sample size must be large enough to lessen the effect of a biased sample. For example, if you randomly sample 6 children, there is a fairly good chance that most or all of them will be boys. If you randomly sample 6000 children, it’s far more likely that they will be approximately equally spread between boys and girls. Even in stratified sampling (when we would likely poll equal numbers of boys and girls) it’s important to have a large enough sample to include other kinds of different viewpoints.

The sample size is determined by the precision desired for the population. The larger the sample size is, the more precise the estimate is. However, the larger the sample size, the more expensive and time consuming the statistical study becomes. In more advanced statistics classes you’ll learn how to use statistical methods to determine the best sample size for a given survey.

### Choosing a Sampling Method

For a class assignment you have been asked to find if students in your school are planning to attend university after graduating high-school. Students can respond with “yes”, “no” or “undecided”. How will you choose which students to interview if you want your results to be reliable?

The best method for obtaining a representative sample would be stratified sampling. Students in the upper grades might be more sure of their post-graduation plans than students in the lower grades, so it makes sense to divide your sample by grade level. You’ll need to find out what proportion of the total student population is included in each grade, then interview the same proportion of students from each grade when conducting the survey.

### Identifying Biased Samples

Once we have identified our population, it is important that the sample we choose accurately reflect the spread of people present in the population. If the sample we choose ends up with one or more sub-groups that are either over-represented or under-represented, then the sample is biased. The results of a biased sample might not really represent the entire population, so we want to avoid selecting one. Stratified sampling helps, but it doesn’t always eliminate bias in a sample. Even with a large sample size, we may be consistently picking one group over another.

Some samples may deliberately seek a biased sample in order to bolster a particular viewpoint. For example, if a group of students were trying to petition the school to allow eating candy in the classroom, they might try to show that a lot of students support this idea by surveying students immediately before lunchtime when they are all hungry. The practice of polling only those who you believe will support your cause is sometimes referred to as cherry picking.

Many surveys may have a non-response bias. For example, if researchers simply hand out questionnaires on a street corner and ask people to fill them out and then mail them in, most people will just throw the questionnaires away. Only people who are really interested in the subject will bother to send them in, and those might also be the people who are more likely to answer the questions a certain way. (Imagine if the questionnaire asked “Do you care a lot about surveys?” People who cared about surveys would answer it, people who didn’t care wouldn’t bother, and a researcher just looking at the surveys that got sent in would conclude that everybody cares about surveys, because everybody who actually answered the survey said yes!)

Non-response bias may be reduced by conducting face-to-face interviews. When you talk to people in person, you can get them to agree to answer a question before you tell them what it is, and then the people you get answers from won’t just be the people who care a lot about the question.

Self-selected respondents tend to have stronger opinions on subjects than others and are more motivated to respond. For this reason, phone-in and online polls also tend to be poor representations of the overall population. Even if it looks like both sides are responding, the poll may disproportionately represent extreme viewpoints from both sides, while ignoring more moderate opinions which may, in fact, be the majority view. Self-selected polls are generally regarded as unscientific.

A classic example of a biased sample occurred in the 1948 Presidential Election. On Election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN, which turned out to be mistaken. The reason the paper was mistaken is that their editor trusted the results of a phone survey. Telephones were still relatively new at the time, so the people who had them tended to be wealthier than average; therefore, a sample of people who had telephones was not a representative sample of the population at large.

### Identifying Bias in Samples

Identify each sample as biased or unbiased. If the sample is biased explain how you would improve your sampling method.

a) Asking people shopping at a farmer’s market if they think locally grown fruit and vegetables are healthier than supermarket fruits and vegetables.

This would be a biased sample because people who shop at farmer’s markets are more likely than the average person to think that locally grown produce is better. The study could be improved by interviewing an equal number of people coming out of a supermarket, or by interviewing people in a more neutral environment such as the post office.

b) You want to find out public opinion on whether teachers get paid a sufficient salary by interviewing the teachers in your school.

This is a biased sample because teachers probably would think they should get a higher salary, but that doesn’t mean everybody else would agree. A better sample could be obtained by constructing a stratified sample with people in different income categories.

c) You want to find out if your school needs to improve its communications with parents by sending home a survey written in English.

This is a biased sample because only English-speaking parents would understand the survey, and parents who don’t speak English would be more likely to find that the school doesn’t communicate with them well. The study could be improved by sending different versions of the survey written in languages spoken at the students’ homes.

### Identify Biased Questions

When you are creating a survey, you must think very carefully about the questions you should ask, how many questions are appropriate and even the order in which the questions should be asked. A biased question is a question that is worded in such a way (whether intentional or not) that it causes a swing in the way people answer it. Biased questions can lead even a representative, non-biased population sample to answer in a way that does not accurately reflect the larger population.

While biased questions are a bad way to judge the overall mood of a population, they are sometimes used by politicians or advertising companies to falsely suggest that a product or policy is more or less popular than it really is.

There are several ways to spot biased questions:

• They may use polarizing language, words and phrases that people associate with emotions:
• Is it right that farmers murder animals to feed people?
• How much of your time do you waste on TV every week?
• Should we be able to remove a person’s freedom of choice over cigarette smoking?
• They may refer to a majority or to a supposed authority:
• Would you agree with the American Heart and Lung Association that smoking is bad for your health?
• The president believes that criminals should serve longer prison sentences. Do you agree?
• Do you agree with 90% of the public that the car on the right looks better?
• The question may be phrased so as to suggest the person asking the question already knows the answer:
• It’s OK to smoke so long as you do it on your own, right?
• You shouldn’t be forced to give your money to the government, should you?
• You wouldn’t want criminals free to roam the streets, would you?
• The question may be phrased in ambiguous way (often with double negatives) which may confuse people:
• Do you reject the possibility that the moon landings never took place?
• Do you disagree with people who oppose the ban on smoking in public places?

In addition to biased questions, the overall design of a survey can be biased in other ways. In particular, question order can play a role. For example, a survey may contain several questions on people’s attitudes to cigarette smoking. Then, if the question “What, in your opinion, are the three biggest threats to public health today?” is asked at the end of the survey, people will be more likely to give “smoking” as one of their answers than they would be if that question had been asked as part of a different survey, or if it had been placed at the beginning of this survey instead of at the end.

## Example

### Example 1

Suppose you are interested in learning how popular the internet music program Spotify is at your school. You select a random sample of your friends. Is this sample likely to be representative of your school?

By selecting a random sample of your friends, not everyone in your school has an equal chance to be selected, in fact, students who are not your friends do not have a chance of being selected at all. Therefore, this is not a random sample of students at your school. Your sample may be biased because your circle of friends is likely to represent similar interests, and not represent all interests of the students at the school. At best, this sample could represent how popular Spotify is with your friends.

## Review

For 1-6, comment on the way the following samples have been chosen. For the unsatisfactory cases, suggest a way to improve the sample choice.

1. You want to find whether wealthier people have more nutritious diets by interviewing people coming out of a five-star restaurant.
2. You want to find if there is there a pedestrian crossing needed at a certain intersection by interviewing people walking by that intersection.
3. You want to find out if women talk more than men by interviewing an equal number of men and women.
4. You want to find whether students in your school get too much homework by interviewing a stratified sample of students from each grade level.
5. You want to find out whether there should be more public busses running during rush hour by interviewing people getting off the bus.
6. You want to find out whether children should be allowed to listen to music while doing their homework by interviewing a stratified sample of male and female students in your school.

For 7-10, a university wants to know if its statistics course challenging enough for students. Every semester, the university offers several sections of the course. Explain the type(s) of bias most evident in each sampling technique and/or what sampling method is most evident. Be sure to justify your choice.

1. The first 30 students to buy the textbook at the beginning of the next semester.
2. The name of a color is selected at random, and on a given day, all statistics professors ask students wearing that color their opinion on the statistics course.
3. A flier is passed out on campus, asking students who have taken statistics at the university to reply by mail.
4. Five students are selected at random from each section of the statistics course during a given semester.
5. There are 35 students taking statistics in your school, and you want to choose 10 of them for a survey about their impressions of the course. Use your calculator to select a SRS of 10 students. (Seed your random number generator with the number 10 before starting.) Assuming the students are assigned numbers from 1 to 35, which students are chosen for the sample?
6. For a class assignment, you have been asked to find out how students get to school. Do they take public transportation, drive themselves, get a ride from their parents, carpool, walk, or bike? You decide to interview a sample of students. How will you choose those you wish to interview if you want your results to be reliable?

## Vocabulary

Term Definition
counterexample A counterexample is an example that disproves a conjecture.