Lesson 17: Inference for One Proportion

From BYU-I Statistics Text
Jump to: navigation, search

These optional videos discuss the contents of this lesson.



 


1 Lesson Outcomes

By the end of this lesson, you should be able to:

  • Confidence Intervals for a single proportion:
    • Calculate and interpret a confidence interval for a single proportion given a confidence level.
    • Identify a point estimate and margin of error for the confidence interval.
    • Show the appropriate connections between the numerical and graphical summaries that support the confidence interval.
    • Check the requirements for the confidence interval.
    • Calculate a desired sample size given a level of confidence and margin of error (with or without a prior estimate of the population proportion).
  • Hypothesis Testing for a single proportion:
    • State the null and alternative hypothesis.
    • Calculate the test-statistic and p-value of the hypothesis test.
    • Assess the statistical significance by comparing the p-value to the α-level.
    • Check the requirements for the hypothesis test.
    • Show the appropriate connections between the numerical and graphical summaries that support the hypothesis test.
    • Draw a correct conclusion for the hypothesis test.


2 Confidence Interval for One Proportion

2.1 Honesty at Medical School

Medical Symbol.png

Frederick Sierles and his colleagues distributed an anonymous survey to students at two American medical schools. The questionnaire was given during class without any prior announcement to students. The authors of the study personally supervised the distribution and collection of the surveys. 95% of the students completed the survey, and students from all four years of medical school training were represented. A total of 428 individuals participated in the survey. Among this group, 249 people indicated that they had cheated in some way during medical school. The results were published in a journal article in 1980.

We want to use the data from this study to generalize to a larger population. We are not usually interested in the particular individuals' responses. The reason the study was conducted is to provide an estimate of the true population proportion, $p$. $\hat p$ is called a point estimate of $p$. The sample proportion, $\hat p$ is one point on the number line that estimates the value of the true proportion, $p$.

A point estimate like $\hat p$ is helpful, but it does not give us direct information on how close it is to the true parameter, $p$. We use a confidence interval to find a range of plausible values for the parameter.

2.2 Confidence Intervals

To find a confidence interval for one population proportion, $p$, we follow the same pattern as was done in the estimates for $\mu$ in the lesson titled Inference for One Mean: Sigma Known (Confidence Interval). We start with the point estimate of $p$ and we add and subtract a certain number of standard deviations from this value.

The point estimate for $p$ is $\hat p$. You might want to review the mean and standard deviation of the random variable $\hat p$ in the lesson on Describing Categorical Data: Proportions; Sampling Distribution of a Sample Proportion. Traditionally, people have used these equations to create confidence intervals for the population proportion.

The formula for the confidence interval for one proportion is: $$ \left( \displaystyle {\hat p - z^* \sqrt{\frac{\hat p (1-\hat p)}{n}}, \hat p + z^* \sqrt{\frac{\hat p (1-\hat p)}{n}}} \right) $$ where $\displaystyle{ \hat p = \frac{x}{n} }$.

You can use the normal probability applet to compute $z^*$. Please see the lesson on Inference for One Mean: Sigma Known (Confidence Interval) if you need to review this procedure.

Be sure that you do not round any values until the last step. Please perform this entire computation without rounding.

Remember that for a 95% confidence interval, $z^* = 1.96$. So, the lower bound for the 95% confidence interval for the true proportion $p$ is: $$ \displaystyle { \hat p - z^* \sqrt{\frac{\hat p (1-\hat p)}{n}} = \frac{249}{428} - 1.96 \sqrt{\frac{\frac{249}{428} \left(1-\frac{249}{428}\right)}{428}} = 0.535 } $$ The upper bound for the 95% confidence interval for the true proportion $p$ is: $$ \displaystyle { \hat p + z^* \sqrt{\frac{\hat p (1-\hat p)}{n}} = \frac{249}{428} + 1.96 \sqrt{\frac{\frac{249}{428} \left(1-\frac{249}{428}\right)}{428}} = 0.629 } $$

The 95% confidence interval for the true proportion of medical students who cheat is: $(0.535, 0.629)$. To interpret this interval, we say that we are 95% confident that the true proportion of people who cheat in medical school is between 0.535 and 0.629. This represents the range of plausible values for the true proportion of students who cheat at these medical schools.

2.2.1 Requirement

Like other procedures, there are requirements that must be checked in order for this confidence interval to be valid. The confidence intervals are valid whenever $n \hat p \ge 10$ and $n(1-\hat p) \ge 10$. Notice that for the data on cheating in medical school, we have $428 * 0.582 = 249$ and $428 * (1-0.582) = 179$ which are both greater than 10, so this requirement is satisfied.

2.3 Using Excel to perform these calculations

Finding confidence intervals for one proportion using only a calculator is tedious. An Excel spreadsheet has been created to help you quickly and accurately perform these calculations. You will use this spreadsheet throughout this and other lessons.

To download this file, click here: CategoricalDataAnalysis.xls


Click on the link at right for instructions on using this spreadsheet to calculate confidence intervals.

For this example we will consider the "Honesty at Medical School" data above.

Step 1: Open the Excel file CategoricalDataAnalysis.xls
Lesson 17 pic 1.PNG
The blue boxes indicate the input spaces. These are the only cells into which you will enter data. For this example we will only be considering the first three blue boxes.


Step 2: Input the appropriate values into the designated cells. We will input the value of $x$ and the value of $n$ from the data above. The third box indicates our desired level of confidence. By default it is set to 0.95, meaning a 95% confidence level.
Lesson 17 pic 4.PNG
You may have noticed that after you input the values of $x$ and $n$ into the blue cells that the values of the other cells changed automatically. Excel performs all of the necessary calculations for you.


Step 3: The 95% confidence interval is given in the output boxes.
Lesson 17 pic 5.PNG
Compare this confidence interval with the one you calculated by hand. They're the same!

2.4 Another Study on Honesty at Medical School

DeWitt C. Baldwin, Jr. and others conducted a larger study to assess how widespread cheating is in medical schools. Elected class officers at 40 schools were invited to distribute a survey to their second-year classmates. Surveys were completed by students from 31 of the 40 schools. Among all students attending the 31 schools, 62% participated in the survey, yielding a total of $n=2426$ surveys. Out of this group, $x=114$ admitted to cheating in medical school. These results were published in Academic Medicine in 1996.

Answer the following questions:
1. Are the requirements for creating a confidence interval satisfied?
Yes, $2426*0.047 = 114$ and $2426*(1-0.047) = 2312$, so it is appropriate to use this procedure to estimate the true proportion of students who cheat in medical school.


2. What is the value of $\hat p$ in this study?

$ \displaystyle{ \hat p = \frac{114}{2426} = 0.047} $


3. Use Excel to calculate the lower bound for the 95% confidence interval for the true proportion $p$.

$ \displaystyle{\hat p - z^* \sqrt{\frac{\hat p (1-\hat p)}{n}} = \frac{114}{2426} - 1.96 \sqrt{\frac{\frac{114}{2426} \left(1-\frac{114}{2426}\right)}{2426}} = 0.039} $


4. Use Excel to help you find the 95% confidence interval for the true proportion of medical students who cheat based on the data from this larger study.
The upper bound for the 95% confidence interval for the true proportion of students who cheat in medical school is:

$$ \hat p + z^* \sqrt{\frac{\hat p (1-\hat p)}{n}} = \frac{114}{2426} + 1.96 \sqrt{\frac{\frac{114}{2426} \left(1-\frac{114}{2426}\right)}{2426}} = 0.055 $$

So, the 95% confidence interval for the true proportion of students who cheat at medical school is: $(0.039, 0.055)$


5. Compare the confidence intervals obtained from the Sierles study to the confidence interval from Baldwin's study. How do the results compare to each other?
The first study concluded that the mean proportion of cheaters in medical school is in the range (0.535,0.629), while the second study concluded a much lower range of possible proportions of cheaters (0.039,0.055). It seems quite likely that at least one of the studies is not accurate.


6. What are some possible factors that might explain the discrepancy in these two studies?
Possible factors could include: Elected class officers giving the surveys may have skewed results, perhaps the 31 schools have less cheating than the original 2, the non-participating schools may have skewed results. Another possibility is that cheating is more prevalent in the later years of medical school,since the second study only examined second-year students.


7. How would you feel if you knew that your doctor cheated in medical school?
Any thoughtful answer is sufficient, but we would likely not be happy!


8. Write a paragraph explaining why it is important to you to be honest in all your dealings with your fellow men--including your academic pursuits. Be sure to include a discussion of your future plans with regard to this issue.
Any thoughtful paragraph is sufficient.

 

3 Sample Size Calculations


Think about it: What happens to the margin of error in a confidence interval if the sample size is increased?



If you can reduce the margin of error by increasing the sample size, then you can achieve a specific margin of error by choosing a large enough sample. So, if you are planning a future study, you can estimate the sample size you need to obtain a desired margin of error, $m$.

The formula for the margin of error is: $$ m = z^* \sqrt{\frac{\hat{p} (1- \hat{p})}{n}} $$ If we solve this equation for $n$, we get: $$ n = \left( \frac{z^*}{m} \right)^2 \hat{p} (1-\hat{p}) $$ Note that this equation requires us to know the value of $\hat{p}$. Unless we do a study, we do not know the value of $\hat{p}$. Sometimes we have a prior estimate of the true proportion of successes, denoted $p^*$.

If we have a prior estimate for $\hat{p}$, (namely $p^*$,) we can plug this value into the equation above to compute the sample size required to obtain our desired margin of error: $$ n = \left( \frac{z^*}{m} \right)^2 p^* (1-p^*) $$ where $z^*$ is determined by your confidence level, $m$ is your desired margin of error, and $p^*$ is an estimate of the true proportion of successes. If no prior estimate for $p$ is available, we can use the following formula to compute our sample size: $$ n = \left( \frac{z^*}{2m} \right)^2 $$ The latter formula (where no prior estimate for $p$ is available) will result in excessively large sample sizes if $p$ is small (say, less than 0.3) or large (say, greater than 0.7.) Otherwise, the results for the two equations will be fairly similar.

No matter what value you obtain for the sample size, if it is not a whole number round it up to the nearest whole number.

3.1 Example

If you want to find the sample size required to get a margin of error of $m=0.03$ with 95% confidence, and previous studies have shown that the true proportion is approximately equal to $p^*=0.82$, then the sample size required would be: $$ \displaystyle { n = \left( \frac{z^*}{m} \right)^2 p^* (1-p^*) = \left( \frac{1.96}{0.03} \right)^2 (0.82) (1-0.82) = 630.02 } $$ We need to round this answer up to the next larger whole number. So, you would need to collect $n=631$ observations to obtain the desired margin of error.

4 Hypothesis Test for One Proportion

StepsAll.png


4.1 Can You Taste PTC?

Step1.png

The ability to taste the chemical Phenylthiocarbamide (PTC) is hereditary. Some people can taste it, while others cannot. The ability to taste PTC is typically assessed using paper test strips. When a PTC test strip is placed on the tongue, it will either taste like regular paper or else have a bitter taste.


Step2.png

It is believed that 70% of all people are able to taste PTC. Data were collected by Elise Johnson to investigate this claim. Volunteers were provided with PTC test strips and asked if they could taste anything besides paper.


Step3.png

Out of the 118 people who participated in the research, 89 indicated that they can taste PTC. The proportion of people in the sample who could taste PTC is $$ \hat p = \frac{89}{118} = 0.754 $$ In other words, 75.4% of the people surveyed could taste the chemical.

PTC Pie Graph Excel.PNG


Review: For a review of how to make pie graphs and bar charts in
Excel
, read Describing Categorical Data: Proportions; Sampling Distribution of a Sample Proportion



Step4.png

The empirical research suggested that the proportion of people who can taste PTC is $\frac{89}{118} = 0.754$, or 75.4%. Is this significantly different from the assumed value of 0.70 (i.e., 70%)? We can test this question using a hypothesis test.

If the following conditions are satisfied:

  • $np \ge 10$
  • $n(1-p) \ge 10$

then the sample size is large enough that the Central Limit Theorem suggests the sample proportion, $\hat p$, is approximately normal. Also, the true mean of $\hat p$ is $p$, and the standard deviation is $\sqrt{\frac{p \cdot (1-p)}{n}}$.

Notice that the requirements are satisfied for the PTC data: $$ \begin{array}{ll} np = 118 \cdot 0.70 = 82.6 \ge 10 & \surd \\ n(1-p) = 118 \cdot (1-0.70) = 35.4 \ge 10 & \surd \end{array} $$

We can use a procedure that mimics the test for a single mean with $\sigma$ known from the lesson titled Inference for One Mean: Sigma Known (Hypothesis Test) to conduct a test for a single proportion.

It is assumed that the true proportion of people who can taste PTC is 0.70. This is the null hypothesis. The alternative hypothesis is that the true proportion is different from 0.70. $$ \begin{align} H_0: & p = 0.70 \\ H_a: & p \ne 0.70 \end{align} $$ We will use the $\alpha=0.05$ level of significance in this test.

If the requirements are satisfied, then $\hat p$ is approximately normal with mean $p$ and standard deviation $\sqrt{\frac{p \cdot (1-p)}{n}}$. The test can be based on the standard normal ($z$) distribution. The test statistic is: $$ z = \frac{\textrm{value}-\textrm{mean}}{\textrm{standard deviation}} = \frac{\hat p - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{\frac{89}{118} - 0.70}{\sqrt{\frac{0.70(1-0.70)}{118}}} = 1.286 $$

Remember, we assume that the null hypothesis is true, so we use the value given in the null hypothesis for $p$. Using the NormalApplet, you can find the $P$-value. This is a two-tailed test, since the alternative hypothesis includes both values above 0.70 and below 0.70. In the applet, make sure both tails are shaded, then enter the $z$-score of 1.286.

ZShadeBothTails-1-286.png

The combined area in the two tails is 0.1984, which is greater than $\alpha = 0.05$. We fail to reject the null hypothesis.


Step5.png

We conclude that there is insufficient evidence to suggest that the true proportion of the population that can taste PTC is different from 0.70. There is no reason to revise existing perspectives on the prevalence of the ability to taste PTC.

Answer the following question:
14. Compare and contrast the test for one mean with $\sigma$ known and the test for one proportion. Give at least two similarities and two differences.
Similarities: There is one population. Both test are based on the $z$ statistic. Both tests require the use of the normal probability applet.
Differences: The test for means involves quantitative data, the test for a proportion involves categorical data. The formulas for the $z$-score differ.

 

4.2 Using Excel to perform these calculations

The Excel spreadsheet CategoricalDataAnalysis can also be used to perform hypothesis tests for one proportion.

To download this file, click here: CategoricalDataAnalysis.xls

Click on the link at right for instructions on using this spreadsheet to perform hypothesis testing.

For this example we will consider the "PTC" data above.

Step 1: Open the Excel file CategoricalDataAnalysis
Lesson 17 pic 1.PNG
The blue boxes indicate the input spaces. These are the only cells into which you will enter data. For this example we will be considering all the blue boxes.


Step 2: Input the appropriate values into the designated cells. We will input the value of $x$ and the value of $n$ from the data above. The third box indicates our desired level of confidence for our confidence interval and is also used to give the level of significance, $\alpha$. The level of significance will be $1-\text{value of the cell}$. By default, the cell contains the value 0.95. This means we have a level of significance of $\alpha=1-0.95=0.05$. The fourth cell contains the value of our null hypothesis and is a drop-down list where we can select the type of hypothesis test, i.e. "Greater Than", "Less Than", or "Not Equal To."
Lesson 17 pic 2.PNG
You may have noticed that after you input the values into the blue cells that the values of the other cells changed automatically. Excel performs all of the necessary calculations for you.


Step 3: The results of the hypothesis test are given in the output boxes.
Lesson 17 pic 3.PNG

4.3 Water Quality

Step1.png

Macroinvertebrates are small insects (without an internal skeleton) that live on the bottom of a stream. These insects are ideal for monitoring changes in water quality, because they (1) live nearly all their life in the water, (2) are easy to collect and identify, (3) often live for several years, (4) have a limited ability to migrate, and (5) they are influenced by environmental conditions.

In any population of macroinvertebrates, there will be indicators of good health and indicators of poor health. Data are collected by capturing macroinvertebrates and recording whether they indicate good health or poor health for the river. In particular sections of a small river near Bozeman, Montana, about 60% of the indicators observed have historically been associated with good health.


Step2.png

Researchers suspect that the water quality in the area has decreased, suggesting that less than 60% of the indicators will show good health. A random sample of macroinvertebrates were captured from the river.


Step3.png

Among the $n=40$ observed indicators of health, $x=19$ suggested good health. Use this information to answer the following question.

Answer the following question:
15. What is the proportion of the observed indicators that suggested good health? Express your answer as a decimal and a percentage.

$ \displaystyle {\hat p = \frac{x}{n} = \frac{19}{40} = 0.475~\text{or}~47.5\%} $

Water quality pie graph Excel.PNG
Water quality pie graph SPSS.PNG

 


Step4.png

The following questions will guide you through the process of conducting a hypothesis test to determine if the water quality has decreased. Use $\alpha=0.05$ for this test.

Answer the following questions:
16. The two requirements required to conduct a hypothesis test for one proportion are

$ \begin{array}{l} np \ge 10 \\ n(1-p) \ge 10 \end{array} $

Are these requirements satisfied?
Yes, the requirements are satisfied.

$ \begin{array}{ll} np = 40 \cdot 0.6 = 24 \ge 10 & \surd \\ n(1-p) = 40 \cdot (1-0.6) = 16 \ge 10 & \surd \end{array} $


17. The null hypothesis is $H_0: p = 0.6$ What is the alternative hypothesis?

$H_a: p < 0.6$


18. Fill in the blanks to compute the $z$-score.

$ \displaystyle{ z = \frac{\hat p - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{()-0.60}{\sqrt{\frac{0.60(1-0.60)}{40}}} = -1.614} $

The missing value is:

$\displaystyle{\hat p = \frac{19}{40}=0.475}$


19. The $P$-value will be the area under the normal curve to the left of $z$. Why will you only shade the left tail?
We want to test if the water quality has decreased.
The alternative hypothesis is that the proportion of healthy indicators is less than 0.6.


20. Using the Normal Probability Applet, it is determined that the area to the left of $z=-1.614$ is 0.053.
ShadeLeftZ-1-614.png
The shaded area in this figure (0.053) represents the $P$-value for this test. What is the decision for this test, do we reject the null hypothesis or fail to reject the null hypothesis? Give an English sentence summarizing the conclusion.
$P\textrm{-value} = 0.053 > 0.05 = \alpha$.
We fail to reject the null hypothesis.
There is insufficient evidence to suggest that the true proportion of indicators that suggest good health is less than 0.6.

 


Step5.png

Even though the proportion of indicators that suggested good health was less that 60%, it was not statistically significantly less than 60%. Unless future research indicates to the contrary, we cannot say that the water quality in this river has decreased.

5 Summary

Remember...
  • The estimator of $p$ is $\hat p$. $\displaystyle{ \hat p = \frac {x}{n}}$ and is used for both confidence intervals and hypothesis testing.
  • You will use the Excel spreadsheet
to perform hypothesis testing and calculate confidence intervals for problems involving one proportion.
  • The requirements for a confidence interval are $n \hat p \ge 10$ and $n(1-\hat p) \ge 10$. The requirements for hypothesis tests involving one proportion are $np\ge10$ and $n(1-p)\ge10$.
  • We can determine the sample size we need to obtain a desired margin of error using the formula $\displaystyle{ n=\left(\frac{z^*}{m}\right)^2 p^*(1-p^*)}$ where $p^*$ is a prior estimate of $p$. If no prior estimate is available, the formula $\displaystyle{ \left(\frac{z^*}{2m}\right)^2}$ is used.


6 Navigation

Previous Reading:
Lesson 16:
Describing Categorical Data: Proportions; Sampling Distribution of a Sample Proportion
                   This Reading:
Lesson 17:
Inference for One Proportion
                   Next Reading:
Lesson 18:
Inference for Two Proportions