# Lesson 2: The Statistical Process & Design of Studies

(Redirected from Lesson 2)

These optional videos discuss the contents of this lesson.

## 1 Lesson Outcomes

By the end of this lesson, you should be able to:

• Distinguish between a categorical and a quantitative variable.
• Distinguish between an observational study and an experiment.
• Distinguish between a population and a sample.
• Distinguish and give an example of each of the following sampling schemes:
• Simple random sampling
• Systematic sampling
• Cluster sampling
• Stratified sampling
• Convenience sampling
• Explain the significance of using a random sample.

## 2 Introduction

Statistics are used in every aspect of society. Every statistical analysis follows a pattern we will call the Statistical Process. This process will be introduced in this lesson and will be used throughout the course.

## 3 The Statistical Process and Daniel's Experiment

Stained-glass depiction of Daniel's deliverance from the lions' den
in the old Dominican priory church at Hawkesyard in Staffordshire, England
(Photo credit: Fr Lawrence Lew, O.P. Used by permission.)

The Old Testament prophet Daniel planned one of the earliest recorded scientific research studies. We will use his example to illustrate the following five steps of The Statistical Process.

 Step 1: Design the study Step 2: Collect data Step 3: Describe the data Step 4: Make inferences Step 5: Take action

The following icons can help you remember these steps. Notice that each icon has a letter and an image to help you remember the five steps of the Statistical Process.

### 3.1 Step 1: Design the Study

An important step in scientific inquiry or problem solving can be to state a research question such as:

• Will internet advertising increase a company's revenue?
• Does expressing gratitude increase a person’s satisfaction with life in general?
• Does a newly developed vaccine prevent the spread of disease?

Researchers also investigate the background of the situation. What have other people discovered about this situation? How can we find the answer to the research question? What do we need to do? What is the population (or total collection of all individuals) under consideration? What kind of data need to be collected?

Before collecting data, researchers make a hypothesis, or an educated guess about the outcome of their research. A hypothesis is a statement such as the following:

• Using internet advertising will increase the company’s sales revenue.
• People who express gratitude will be more satisfied with life than those who do not.
• A newly-developed vaccine is effective at preventing tuberculosis.

Daniel’s Experiment

Watch the first 6 minutes and 20 seconds of the following video:

After taking Israel captive, Babylon’s King Nebuchadnezzar wanted Israelites to serve in his palace. He asked his chief officer to bring Israelite children who were “well favoured, and skilful in all wisdom, and cunning in knowledge, and understanding science…to stand in the king’s palaces” (Daniel 1:4). To aid their preparation, Nebuchadnezzar planned to feed them his meat and wine for three years (Daniel 1:5).

Daniel did not want to defile himself by partaking of the king’s meat and wine. He asked permission to eat pulse[1] and drink water instead. His supervisor, Melzar, was afraid to displease the king. He thought that after eating pulse and water, the selected Israelites would look worse than their peers, and he would be punished (Daniel 1:8-10.)

With an understanding of the background of the situation, Daniel proposed an experiment. He said, “Prove thy servants, I beseech thee, ten days; and let them give us pulse to eat, and water to drink. Then let our countenances be looked upon before thee, and the countenance of the children that eat of the portion of the king’s meat: and as thou seest, deal with thy servants” (Daniel 1:12-13.) In short, Daniel’s implied research question can be stated as: Will those who eat pulse and drink water appear healthier than those who eat the king’s meat and drink his wine? Melzar agreed to the experiment.

Answer the following question:
1. What is Daniel’s hypothesis?
Daniel's hypothesis is that the Israelite children who eat pulse and drink water will appear healthier in just ten days, compared to those who eat the king’s meat and drink his wine.

### 3.2 Step 2: Collect Data

When designing a study, much attention is given to the process by which data are observed. When examining data, it is also important to understand the data collection procedures. A sample is a subset (a portion) of a population. How is this sample obtained? How are the observations made?

Daniel’s study design required that data be collected at the end of 10 days. Melzar would compare the appearances of two groups of people: (1) Israelites who ate pulse and drank water versus (2) Israelites who ate the king’s meat and drank his wine.

### 3.3 Step 3: Describe the Data

When we describe data, we use any tools appropriate to the situation. This can include creating graphs or calculating statistics to help understand or visualize the data.

For Daniel’s experiment, the data are described in Daniel 1:15: “And at the end of ten days [the] countenances [of those who ate pulse] appeared fairer and fatter in flesh than all the children which did eat the portion of the king’s meat.”

### 3.4 Step 4: Make Inferences

Inference is the process of using the information contained in a sample from a population to make a general statement (i.e. to infer something) about the entire population. Later in the course we will learn techniques that make this type of analysis possible.

Melzar made an inference. Based on the results of the sample, he determined that (in general) those who eat pulse and drink water will be healthier than those who eat the king’s meat and drink his wine (Daniel 1:15-16.)

### 3.5 Step 5: Take Action

The goal of a statistical analysis is to determine which action to take in a particular situation. Actions can include many things: launching an internet ad campaign (or not), expressing gratitude (or not), getting vaccinated (or not), etc.

Melzar took action as described in Daniel 1:16: “Thus Melzar took away the portion of their meat, and the wine that they should drink; and gave [all the Israelite children] pulse.”

Was the experiment a success? “Now at the end of the days that the king had said he should bring them in…the king communed with them; and among them all was found none like Daniel, Hananiah, Mishael, and Azariah… And in all matters of wisdom and understanding, that the king enquired of them, he found them ten times better than all the magicians and astrologers that were in all his realm” (Daniel 1:18-20.)

### 3.6 Summary of the Statistical Process

Daniel’s experience can also help you learn the Statistical Process. Look at the first letter of each of the steps in the Statistical Process. You can use the phrase "Daniel Can Discern More Truth" to help you to help you remember the five steps in the Statistical Process.

The Statistical Process

 Step 01: Daniel Design the study Step 02: Can Collect data Step 03: Discern Describe the data Step 04: More Make inferences Step 05: Truth Take action

The Statistical Process will be used throughout the course. Take time to memorize the five steps.

The study designed by the Old Testament prophet Daniel provides an ancient example of a designed experiment. Daniel’s experiment included two groups of people: those who had the experimental treatment—eating pulse and drinking water (called the treatment group) and those who ate the standard food—the king’s meat (called the control group.) The treatment group receives the experimental procedure. The control group is used for comparison.

Answer the following question:
2. Why was it important that Daniel’s experiment included a control group?
If there was no control group, then there would be no way to compare the effect of the diets (the treatments). Having a control group allows a researcher to see the effect of not taking any action. For Daniel, the control group (who ate the king's meat and drank his wine) provided a basis for comparing the effect of the new treatment (i.e. eating pulse and drinking water.)

"Piled Higher and Deeper" by Jorge Cham

## 4 Design of Studies

Most research projects can be classified into one of two basic categories: observational studies or designed experiments. In an experiment, researchers control (to some extent) the conditions under which measurements are made. In an observational study, researchers simply observe what happens, without controlling the conditions under which measurements are made. Both types of study follow the five steps of the Statistical Process.

### 4.1 Designed Experiments

In a designed experiment, researchers manipulate the conditions that the participants experience. They often do this by randomly assigning subjects to one of two groups, a "treatment" group and a "control" group. The experiment is conducted by applying some kind of treatment to the subjects in the treatment group, and observing the effect of the treatment. Those in the control group do not receive the treatment and are also observed. In this way researchers can determine the effects of the treatment. The following example illustrates the use of these two groups.

Jonas Salk’s First Polio Vaccine Trial

Answer the questions about the video in each of the following steps of the Statistical Process.

Step 1: Design the study.

1. What year did the Salk polio vaccine trials occur?
2. At that time, parents were afraid to let their children…
3. How many children participated in the trial?
4. The children were separated into two groups. What did the video say would practically guarantee that the two groups would be nearly identical?
5. How was Salk’s vaccine administered?

Step 2: Collect Data

6. What was the primary outcome for the researchers in the Salk trial?
7. How did the researchers collect the data?

Step 3: Describe the Data

8. Complete the quote from the video: “As the Salk trial progressed, it did appear as if the group that got the vaccine had ________ instances of polio.”

Step 4: Make Inferences

9. Complete the quote from the video: “As the researchers tabulated the data, they found that the differences were...

Step 5: Take Action.

10. Which of the following actions is most likely at the conclusion of the Salk trial?
Elizabeth Toy, a former BYU-Idaho teacher, was one of the children in the Salk trials.
Jonas Salk’s vaccine trial was a great example of a designed experiment. However, it did not start out that way. Initially, there were some serious flaws in their design. Other researchers sharply criticized these errors. Almost 1.1 million children participated in the initial study. Even though the sample size was large, the flawed study design made the data useless!

'Jonas Salk's Second Study'

After fixing the design, Jonas Salk enrolled hundreds of thousands of additional children for the second phase of his study. Consider how Jonas Salk applied the Statistical Process in this second study.

Step 1: Design the study

Jonas Salk enrolled hundreds of thousands of children in his second study. The participants in a study are commonly called subjects. Sometimes subjects are called experimental units or simply units. In the Salk trials, the children who participated are the subjects.
The subjects were randomly assigned to one of two groups. The first group was given the experimental vaccine, the treatment. The treatment is the new or experimental condition that is imposed on the subjects. The subjects who receive the treatment make up the treatment group.
The second group was given a placebo. ''Placebo'' is another name for a control. In this study, the placebo was an injection that looked just like the vaccine, but contained a harmless saline solution. The placebo group or control group is made up of the subjects assigned to receive the placebo.
This study was double blind. Neither the children's parents nor their doctors knew whether a particular child received the treatment or the control. Both parties were blinded to this information.
Answer the following questions:
3. Some children can be identified as having a higher risk of developing polio. Would it have been better if they were assigned to the treatment group so they could get the vaccine?
No. The two groups need to be as similar as possible. Specifically, the people in the treatment group need to have the same potential (on average) of contracting polio as the people in the control group. If we put the people who are at a higher risk of developing polio in the treatment group, we run the risk of having more people in the treatment group getting polio simply because they are more likely to get it, whether they are vaccinated or not. Likewise, we might have fewer people in the control group getting polio just because they are less likely to get it, whether they are vaccinated or not.
These two effects would create a bias against the vaccine, by making the vaccine look like it doesn't work, or doesn't work as well as it does. It might also make it appear that people who aren't vaccinated stay healthy and the vaccine is not needed. There is even a chance that people will conclude that the vaccine actually gives people polio.
Randomly assigning subjects to the two groups tends to yield groups with similar characteristics---in this example, similar potential for contracting polio. Randomly assigning subjects to groups therefore defends us against problems like those mentioned in the previous paragraph.

4. Why is it important for the subject and those who assess the health of the subject to be unaware of whether or not that child received the vaccine?
Subjects: Suppose a subject in the study thinks they're being treated. It has been documented that subjects with such knowledge tend to show improvement whether they are receiving the treatment or not. To see why, consider how you might feel and act if you were told you had been vaccinated. You might have a more hopeful outlook, leading to healthier living habits such as better hygiene and nutrition. Such changes would tend to reduce your chance of contracting polio whether you've received the vaccine or not. This might make the vaccine look like it works better than it does. It also might make the vaccine look like it works, even if it doesn't.
Now suppose subjects in the control group know they are not being treated. This can also change the way they feel and act, in ways that can make them more likely to contract polio than they would be if they weren't in the study. This could make it look like the incidence of polio among unvaccinated persons is higher than it is, again making the vaccine look like it works better than it does.
To reduce bias caused by such errors, subjects should not know to which group they are assigned.
Researchers: Suppose a researcher assessing the health of a subject is told that the subject is in the control group. It has been documented that in such a case, the researcher is more likely to record that the subject has symptoms even if the subject is not actually in the control group. This makes it look like unvaccinated persons are more likely to get polio than they really are, which makes it look like the vaccine works better than it does.
There are other effects of knowing to which group the subject belongs, such as doctors treating or advising the patient differently than they would without such knowledge. Such differences can make it harder to tell whether the vaccine works, and how well.
To reduce bias caused by such effects, those assessing the health of the subjects should not be told to which group the subject belongs.

The null and alternative hypotheses for this study are:
$H_0$: The proportion of children who develop polio will be the same for the treatment and control groups.
$H_a$: The proportion of children in the treatment group who develop polio will be lower than the proportion of children in the control group who develop polio.

Step 2: Collect data.

The researchers followed up with each child to determine if they contracted polio. They recorded the number of children in each group that developed polio during the study period. Not all of Salk's experiments were double-blind. Here is a summary of the results from the regions where a double-blind study was conducted (Francis et al., 1955; Brownlee, 1955 ):

Children Who Developed Polio
Yes No Total
Treatment Group 57 200,688 200,745
Placebo Group 142 201,087 201,229

Step 3: Describe the data.

One way to summarize the data is to compute the proportion of children in each group that developed polio. The proportion of children in the treatment group that developed polio during the study period is:
$$\frac{57}{200745} = 0.000~283~9$$
Answer the following questions:
5. Calculate the proportion of children in the placebo group that developed polio during the study period.
$\displaystyle{\frac{142}{201229} = 0.000~705~7}$

6. Compare the two proportions. What do you observe?
The proportion of children in the placebo group that develop polio during the study period was more than double the proportion of children in the treatment group that developed polio during the study period. That suggests that the treatment is effective in reducing the proportion of children that will develop polio.

Step 4: Make inferences

When the hypothesis test to compare the proportions was conducted, the probability of observing a difference in these proportions (between those who received the vaccine and those who received the placebo) at least as extreme as the results Salk obtained was computed. The probability was 0.00000000093. Because this probability is so small, it is highly unlikely that these results are due to chance. This probability is called the $P$-value, which will be discussed more later.

Step 5: Take action

Once it was clear that the vaccine was effective, children who were unvaccinated or had received the placebo were given Salk’s vaccine. Since 1954, there has been a marked decrease in the number of polio cases worldwide (Offit, 2005). Public health researchers continue to work to eradicate this disease around the world.

### 4.2 Observational Studies

In an observational study researchers observe the responses of the individuals, without controlling the conditions experienced by the individuals. Therefore, they do not assign the participants to treatment or control groups.

Observational studies commonly occur in business settings. One example is a financial audit. The purpose of a financial audit is to assess the accuracy of a company’s financial business practices. ImmunAvance Ltd., a non-government health care organization, hired the Accounting Office at Global Optimization Unlimited to perform an independent audit of their financial practices. ImmunAvance provides inoculation and other preventative health care services in rural African communities.

Imagine you are assigned to serve as a member of the team that will perform this audit. The audit illustrates the five steps of the Research Process.

Step 1: Design the study

The volume of financial transactions conducted by ImmunAvance makes it impossible to conduct a census or an examination of the entire collection of ImmunAvance’s financial documents. Instead, you will collect a manageable group of items (called the sample) from the entire collection of financial documents (called the population.) A sample is a subset or a portion of a population. The information gained from the sample is used to make an inference (or generalization) about the population.

Auditors typically cannot consider every item in a population, because there are too many. When it is not possible to conduct a census, auditors face sampling risk. Sampling risk is the risk affiliated with not auditing every item in the population. It is the risk that the sample may not adequately reflect the population. The only way to eliminate sampling risk is to conduct a census, which is usually not practical. Auditors can reduce sampling risk by obtaining a sample randomly. This is called random selection. Another way to reduce sampling risk is to increase the sample size, the number of items sampled.

### 4.3 Sampling Methods

Step 2: Collect data

There are several procedures that can be used to select a random sample from a population, including: simple random sampling (SRS), systematic sampling, cluster sampling, stratified sampling, and convenience sampling (or, haphazard sampling). These are examples of sampling methods.
A simple random sampling (SRS) is the best method for obtaining a random sample. If there is a list of all items in the population and they are all accessible, a SRS can be collected. For example, suppose there are 2,000 accounts receivable items in the population. Auditors can use a random number generator to choose values between 1 and 2,000 to identify which items are to be audited. Software can be used to create a list of random numbers corresponding to items for the audit. In Excel, the command to obtain a random number between 1 and 2,000 is =RANDBETWEEN(1,2000). Sometimes it is necessary for auditors to renumber the items (1 to 2000) to help create random sample. A simple random sample can be obtained any time there is a complete list of the items to be sampled and they are all assessible. All the statistical procedures in this course assume that simple random sampling has been used.
A systematic sample is where auditors select every $k^{\text{th}}$ item in the population, beginning at a random starting point. As an example, suppose there are 2,000 accounts receivable items to be audited, and the auditors want to get a sample of 40 items from this population. The auditors would sample every 2000÷40=50th item. The auditors will use a random starting point in the first 50 items. So, suppose a random number generator gave 41 as a random number between 1 and 50. The auditors select the 41st item. The next item chosen will be item number 41+50=91. After that, the auditors will choose item number 91+50=141, and so on. Systematic sampling works well when the items are in a random, sequential ordering. If the items are not arranged randomly, a systematic sample can miss important parts of the population.
A cluster sample (sometimes called a block sample) consists of taking all items in one or more randomly selected clusters, or blocks. For example, all the accounts receivable items in a few randomly selected months (the blocks) could be sampled. When the variation from one block to another is relatively low, compared to the variation within the block, cluster sampling is a reasonable way to get a sample. However, in many cases, the number of blocks that must be sampled must be large in order to draw a reasonable audit conclusion. For this reason, a cluster sample is usually not recommended for audits.

## 7 References

Bible Dictionary, “Pulse” at http://lds.org/scriptures/bd/pulse.

Brownlee, K. A. (1955). Statistics of the 1954 polio vaccine trials. Journal of the American Statistical Association, 50(272), pp. 1005-1013.

Francis, T., et. al. (1955). An evaluation of the 1954 poliomyletis vaccine trials. American Journal of Public Health and the Nation's Health, 45(5)

Offit, P. A. (2005). Why are pharmaceutical companies gradually abandoning vaccines? Health Affairs, 24(3), 622-630. doi:10.1377/hlthaff.24.3.622