Exam Reviews

From BYU-I Statistics Text
Jump to: navigation, search

1 Exam 1 Review

Lesson 8

Click on the link at right for a review of the summaries from each lesson.

Here are the summaries for each lesson in unit 1. Reviewing these key points from each lesson will help you in your preparation for the exam.

Lesson 01 Recap
  • Each lesson follows the same schedule: Individual and Group Preparation, Class Meeting, and Homework Assignment and Quiz. Understanding this layout will help you successfully manage the workload of this class.
  • In this class you will use the online textbook that has been written for you by your statistics teachers. All of the assignments and quizzes will be based on the readings, so study it well.
  • By doing the work, staying on schedule, and living the Honor Code you can succeed in this class!


Lesson 02 Recap
  • The Statistical Process has five steps: Design the study, Collect the data, Describe the data, Make inferences, Take action.
  • In a designed experiment, researchers control the conditions of the study. In an observational study, researchers don't control the conditions but only observe what happens.
  • There are many sampling methods used to obtain a sample from a population:
  • A simple random sample (SRS) is a random selection taken from a population
  • A systematic sample is every kth item in the population, beginning at a random starting point
  • A cluster sample is all items in one or more randomly selected clusters, or blocks
  • A stratified sample divides data into similar groups and an SRS is taken from each group
  • A convenience sample is one easily obtained in a less-than-systematic way and should be avoided whenever possible
  • Quantitative variables represent things that are numeric in nature, such as the value of a car or the number of students in a classroom. Categorical variables represent nonnumerical data that can only be considered as labels, such as colors or brands of shoes.
  • The null hypothesis ($H_0$) is the foundational assumption about a population and represents the status quo. The alternative hypothesis ($H_a$) is a different assumption about a population. Using a hypothesis test, we determine whether it is more likely that the null hypothesis or the alternative hypothesis is true.


Lesson 03 Recap
  • A histogram allows us to visually interpret data. Histograms can be left-skewed, right-skewed, or symmetrical and bell-shaped.
  • The mean, median, and mode are measures of the center of a distribution. The mean is the most common measure of center, and is computed by adding up the observed data and dividing by the number of observations in the data set.
  • The standard deviation is a number that describes how spread out the data are. A larger standard deviation means the data are more spread out than data with a smaller standard deviation.
  • A parameter is a true (but usually unknown) number that describes a population. A statistic is an estimate of a parameter obtained from a sample.
  • Quartiles/percentiles, Five-Number Summaries, and Boxplots are tools that help us understand data. The five-number summary of a data set contains the minimum value, the first quartile, the median, the third quartile, and the maximum value. A boxplot is a graphical representation of the five-number summary.


Lesson 04 Recap
  • The three rules of probability are:
1. A probability is a number between 0 and 1.

$$ 0 ≤ P(X) ≤ 1 $$

2. If you list all the outcomes of a probability experiment (such as rolling a die) the probability that one of these outcomes will occur is 1. In other words, the sum of the probabilities in any probability is 1.

$$ \sum P(X) = 1 $$

3. The probability that an outcome will not occur is 1 minus the probability that it will occur.

$$ P(\text{not}~X) = 1 - P(X) $$



Lesson 05 Recap
  • A normal density curve is symmetric and bell-shaped. The curve lies above the horizontal axis and the total area under the curve is equal to 1.
  • A standard normal distribution has a mean of 0 and a standard deviation of 1. The 68-95-99.7% rule states that when data are normally distributed, approximately 68% of the data lie within 1 standard deviation from the mean, approximately 95% of the data lie within 2 standard deviations from the mean, and approximately 99.7% of the data lie within 3 standard deviations from the mean.
  • A z-score tells us how many standard deviations away from the mean a given value is. It is calculated as: $\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{x-\mu}{\sigma}}$
  • The probability applet allows us to use z-scores to calculate proportions, probabilities, and percentiles.
  • A Q-Q plot is used to assess whether or not a set of data is normally distributed.


Lesson 06 Recap
  • The distribution of sample means is a distribution of all possible sample means ($\bar x$) for a particular sample size. It has a mean of $\mu$ and a standard deviation of $\sigma/\sqrt{n}$.
  • The distribution of sample means is normal when $\bar x$ is normally distributed or when, thanks to the Central Limit Theorom (CLT), our sample size ($n$) is large.
  • The Law of Large Numbers states that as the sample size ($n$) gets larger, the sample mean ($\bar x$) will get closer to the population mean ($\mu$).


Lesson 07 Recap
  • When the distribution of sample means is normally distributed, we can use z-scores and the probability applet to calculate proportions and probabilities. A z-score is calculated as: $\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{\bar x-\mu}{\sigma/\sqrt{n}}}$
  • The $P$-value is the probability of getting a test statistic at least as extreme as the one you got, assuming $H_0$ is true. A $P$-value is calculated by finding the area under the normal distribution curve that is more extreme (farther away from the mean) than the z-score.

2 Exam 2 Review

Lesson 15

Click on the link at right for a review of the summaries from each lesson.

Here are the summaries for each lesson in unit 2. Reviewing these key points from each lesson will help you in your preparation for the exam.

Lesson 09 Recap
  • The null hypothesis ($H_0$) is the foundational assumption about a population and represents the status quo. It is a statement of equality ($=$). The alternative hypothesis ($H_a$) is a different assumption about a population and is a statement of inequality ($<$, $>$, or $\ne$). Using a hypothesis test, we determine whether it is more likely that the null hypothesis or the alternative hypothesis is true.
  • The $P$-value is the probability of getting a test statistic at least as extreme as the one you got, assuming $H_0$ is true. A $P$-value is calculated by finding the area under the normal distribution curve that is more extreme (farther away from the mean) than the z-score. The alternative hypothesis tells us whether we look at both tails or only one.
  • The level of significance ($\alpha$) is the standard for determining whether or not the null hypothesis should be rejected. Typical values for $\alpha$ are $0.05$, $0.10$, and $0.01$. If the $P$-value is less than $\alpha$ we reject the null. If the $P$-value is not less than $\alpha$ we fail to reject the null.
  • A Type I error is committed when we reject a null hypothesis that is, in reality, true. A Type II error is committed when we fail to reject a null hypothesis that is, in reality, not true. The value of $\alpha$ is the probability of committing a Type I error.


Lesson 10 Recap
  • The margin of error gives an estimate of the variability of responses. It is calculated as $\displaystyle{m=z^*\frac{\sigma}{\sqrt{n}}}$ where $z^*$ represents a calculated z-score corresponding to a particular confidence level.
  • A confidence interval is an interval estimator used to give a range of plausible values for a parameter. The width of a confidence interval depends on the chosen confidence level (and its corresponding value of $z^*$) as well as the sample size ($n$). This is the equation for calculating confidence intervals:

$$\displaystyle{\left(\bar x-z^*\frac{\sigma}{\sqrt{n}},~\bar x+z^*\frac{\sigma}{\sqrt{n}}\right)}$$

  • The sample size formula allows us to estimate the number of observations required to obtain a specific margin of error. $\displaystyle{n=\left(\frac{z^*\sigma}{m}\right)^2}$


Lesson 11 Recap
  • In practice we rarely know the true standard deviation $\sigma$ and will therefore be unable to calculate a z-score. Student's t-distribution gives us a new test statistic, $t$, that is calculated using the sample standard deviation ($s$) instead.

$$ \displaystyle{ t = \frac {\bar x - \mu} {s / \sqrt{n}} } $$

  • The $t$-distribution is similar to a normal distribution in that it is bell-shaped and symmetrical, but the exact shape of the $t$-distribution depends on the degrees of freedom ($df$).

$$df=n-1$$

  • You will use Excel to carry out hypothesis testing and create confidence intervals involving $t$-distributions.


Lesson 12 Recap
  • The key characteristic of dependent samples (or matched pairs) is that knowing which subjects will be in group 1 determines which subjects will be in group 2.
  • We use slightly different variables when conducting inference using dependent samples:
Group 1 values: $x_1$  Group 2 values: $x_2$  Differences: $d$  Population mean: $\mu_d$  Sample mean: $\bar d$  Sample standard deviation: $s_d$
  • When conducting hypothesis tests using dependent samples, the null hypothesis is always $\mu_d=0$, indicating that there is no change between the first population and the second population. The alternative hypothesis can be left-tailed ($<$), right-tailed($>$), or two-tailed($\ne$).


Lesson 13 Recap
  • In contrast to dependent samples, two samples are independent if knowing which subjects are in group 1 tells you nothing about which subjects will be in group 2. With independent samples, there is no pairing between the groups.
  • When conducting inference using independent samples we use $\bar x_1$, $s_1$, and $n_1$ for the mean, standard deviation, and sample size, respectively, of group 1. We use the symbols $\bar x_2$, $s_2$, and $n_2$ for group 2.
  • When working with independent samples it is important to graphically illustrate each sample separately. Combining the groups to create a single graph is not appropriate.
  • When conducting hypothesis tests using independent samples, the null hypothesis is always $\mu_1=\mu_2$, indicating that there is no difference between the two populations. The alternative hypothesis can be left-tailed ($<$), right-tailed($>$), or two-tailed($\ne$).
  • Whenever zero is contained in the confidence interval of the difference of the true means we conclude that there is no significant difference between the two populations.


Lesson 14 Recap
  • ANOVA is used to compare the means for several groups. The hypotheses for the test are always:

$$ \begin{align} H_0: & ~ \textrm{All the means are equal} \\ H_a: & ~ \textrm{At least one of the means differs} \end{align} $$

  • For ANOVA testing we use an $F$-distribution, which is right-skewed. The $P$-value of an ANOVA test is always the area to the right of the $F$-statistic.
  • We can conduct ANOVA testing when the following three requirements are satisfied:
1. The data come from a simple random sample.
2. The data are normally distributed within each group.
  • This is satisfied when Q-Q Plots for the data in each group roughly follow a straight line.
3. The variance is constant.
  • This is satisfied when the largest variance is not more than four times the smallest variance.

3 Exam 3 Review

Lesson 20

Click on the link at right for a review of the summaries from each lesson.

Here are the summaries for each lesson in unit 3. Reviewing these key points from each lesson will help you in your preparation for the exam.

Lesson 16 Recap
  • Pie charts are used when you want to represent the observations as part of a whole, where each slice (sector) of the pie chart represents a proportion or percentage of the whole.
  • Bar charts present the same information as pie charts and are used when our data represent counts. A Pareto chart is a bar chart where the height of the bars is presented in descending order.
  • $\hat p$ is a point estimator for true proportion $p$. $\displaystyle{\hat p = \frac{x}{n}}$
  • The sampling distribution of $\hat p$ has a mean of $p$ and a standard deviation of $\displaystyle{\sqrt{\frac{p\cdot(1-p)}{n}}}$
  • If $np \ge 10$ and $n(1-p) \ge 10$, you can conduct probability calculations using the Normal Probability Applet. $\displaystyle {z = \frac{\textrm{value} - \textrm{mean}}{\textrm{standard deviation}} = \frac{\hat p - p}{\sqrt{\frac{p \cdot (1-p)}{n}}}}$


Lesson 17 Recap
  • The estimator of $p$ is $\hat p$. $\displaystyle{ \hat p = \frac {x}{n}}$ and is used for both confidence intervals and hypothesis testing.
  • You will use the Excel spreadsheet
to perform hypothesis testing and calculate confidence intervals for problems involving one proportion.
  • The requirements for a confidence interval are $n \hat p \ge 10$ and $n(1-\hat p) \ge 10$. The requirements for hypothesis tests involving one proportion are $np\ge10$ and $n(1-p)\ge10$.
  • We can determine the sample size we need to obtain a desired margin of error using the formula $\displaystyle{ n=\left(\frac{z^*}{m}\right)^2 p^*(1-p^*)}$ where $p^*$ is a prior estimate of $p$. If no prior estimate is available, the formula $\displaystyle{ \left(\frac{z^*}{2m}\right)^2}$ is used.


Lesson 18 Recap
  • When conducting hypothesis tests using two proportions, the null hypothesis is always $p_1=p_2$, indicating that there is no difference between the two proportions. The alternative hypothesis can be left-tailed ($<$), right-tailed($>$), or two-tailed($\ne$).
  • For a hypothesis test and confidence interval of two proportions, we use the following symbols:

$$ \begin{array}{lcl} \text{Sample proportion for group 1:} & \hat p_1 = \displaystyle{\frac{x_1}{n_1}} \\ \text{Sample proportion for group 2:} & \hat p_2 = \displaystyle{\frac{x_2}{n_2}} \end{array} $$

  • For a hypothesis test only, we use the following symbols:

$$ \begin{array}{lcl} \text{Overall sample proportion:} & \hat p = \displaystyle{\frac{x_1+x_2}{n_1+n_2}} \end{array} $$

  • Whenever zero is contained in the confidence interval of the difference of the true proportions we conclude that there is no significant difference between the two proportions.
  • You will use the Excel spreadsheet
to perform hypothesis testing and calculate confidence intervals for problems involving two proportions.


Lesson 19 Recap
  • The $\chi^2$ hypothesis test is a test of independence between two variables. These variables are either associated or they are not. Therefore, the null and alternative hypotheses are the same for every test:

$$ \begin{array}{1cl} H_0: & \text{The (first variable) and the (second variable) are independent.} \\ H_a: & \text{The (first variable) and the (second variable) are not independent.} \end{array} $$

  • The degrees of freedom ($df$) for a $\chi^2$ test of independence are calculated using the formula $df=(\text{number of rows}-1)(\text{number of columns}-1)$
  • In our hypothesis testing for $\chi^2$ we never conclude that two variables are dependent. Instead, we say that two variables are not independent.

4 Exam 4 Review

Lesson 24

Click on the link at right for a review of the summaries from each lesson.

Here are the summaries for each lesson in unit 4. Reviewing these key points from each lesson will help you in your preparation for the exam.

Lesson 21 Recap
  • Creating scatterplots of bivariate data allows us to visualize the data by helping us understand its shape (linear or nonlinear), direction (positive, negative, or neither), and strength (strong, moderate, or weak).
  • The correlation coefficient ($r$) is a number between $-1$ and $1$ that tells us the direction and strength of the linear association between two variables. A positive $r$ corresponds to a positive association while a negative $r$ corresponds to a negative association. A value of $r$ closer to $-1$ or $1$ indicates a stronger association than a value of $r$ closer to zero.
  • The covariance is a measure of how two variables vary together. The formula for the covariance is $s_{xy}=r \cdot s_x \cdot s_y$.


Lesson 22 Recap
  • In statistics, we write the linear regression equation as $\hat Y=b_0+b_1X$ where $b_0$ is the Y-intercept of the line and $b_1$ is the slope of the line. The values of $b_0$ and $b_1$ are calculated using software.
  • Linear regression allows us to predict values of $Y$ for a given $X$. This is done by first calculating the coefficients $b_0$ and $b_1$ and then plugging in the desired value of $X$ and solving for $Y$.
  • The independent (or explanatory) variable ($X$) is the variable which is not affected by what happens to the other variable. The dependent (or response) variable ($Y$) is the variable which is affected by what happens to the other variable. For example, in the correlation between number of powerboats and number of manatee deaths, the number of deaths is affected by the number of powerboats in the water, but not the other way around. So, we would assign $X$ to represent the number of powerboats and $Y$ to represent the number of manatee deaths.


Lesson 23 Recap
  • The unknown true linear regression line is $Y=\beta_0+\beta_1X$ where $\beta_0$ is the true y-intercept of the line and $\beta_1$ is the true slope of the line.
  • A residual is the difference between the observed value of $Y$ for a given $X$ and the predicted value of $Y$ on the regression line for the same $X$. It can be expressed as:

$$ Residual = Y - \hat Y = Y - (b_0 + b_1 X) $$

  • To check all the requirements for bivariate inference you will need to create a scatterplot of $X$ and $Y$, a residual plot, and a Q-Q plot of the residuals.
  • We conduct a hypothesis test on bivariate data to know if there is a linear relationship between the two variables. To determine this, we test the slope ($\beta_1$) on whether or not it equals zero. The appropriate hypotheses for this test are:

$$ \begin{array}{1cl} H_0: & \beta_1=0 \\ H_a: & \beta_1\ne0 \end{array} $$

  • For bivariate inference we use software to calculate the sample coefficients, residuals, test statistic, $P$-value, and confidence intervals of the true linear regression coefficients.