ANOVA

class: center, middle, inverse, title-slide

# ANOVA
## <br> <br> <br>
### Andrew Farina
### 2020-06-22

---

layout: true

background-image: url(img/hexlogo.png)
background-size: 65px 75px
background-position: 0.5% 1%

---
class: inverse, center, middle

.left[
# Understanding the Data
* ### Loading the Data
* ### Summary Statistics
* ### Visualizing the Data
]
---

## Data: *stalker*

#### Install the `{pl462}` data package and load the tidyverse:

```r
devtools::install_github("A-Farina/pl462", ref = "main")
library(tidyverse) # For data manipulation
library(ggpubr) # For result output
library(rstatix) # Tidy friendly simple models
```

--
#### From `{pl462}`, load the stalker dataset:
.pull-left100[

```r
data(stalker, package = "pl462")
dat <- stalker
glimpse(dat)
```

```
## Rows: 78
## Columns: 3
## $ Therapy <fct> Cruel to be Kind Therapy, Cruel to be Kind Therapy, Cruel to …
## $ AmtTher <fct> 6 Weeks, 6 Weeks, 6 Weeks, 6 Weeks, 6 Weeks, 6 Weeks, 6 Weeks…
## $ Stalker <dbl> 47, 50, 51, 52, 53, 57, 57, 60, 63, 66, 68, 72, 72, 35, 34, 3…
```
]

---

## Data: Categorical Counts

```r
dat %>% 
  group_by(Therapy, AmtTher) %>% 
  select(-Stalker) %>% 
  table()
```

```
##                           AmtTher
## Therapy                    6 Weeks 12 Weeks
##   Cruel to be Kind Therapy      13       13
##   Psychodyshamic Therapy        13       13
##   None                          13       13
```

---

## Data: Continuous Distributions

```r
dat %>% ggplot(aes(x = Stalker, fill = AmtTher)) + 
  geom_density() + facet_grid(Therapy ~ .) +
  labs(title = "Stalking Hour Distributions",
       subtitle = "According to Therapy Amount and Type", 
       fill = "Therapy Amount")
```

![](index_files/figure-html/unnamed-chunk-2-1.png)

---

## Data: Summary Statistics

```r
dat %>% 
  group_by(Therapy, AmtTher) %>% 
  get_summary_stats(Stalker, type = "mean_sd")
```

```
## # A tibble: 6 x 6
##   Therapy                  AmtTher  variable     n  mean    sd
##   <fct>                    <fct>    <chr>    <dbl> <dbl> <dbl>
## 1 Cruel to be Kind Therapy 6 Weeks  Stalker     13  59.1  8.5 
## 2 Cruel to be Kind Therapy 12 Weeks Stalker     13  37.8 14.2 
## 3 Psychodyshamic Therapy   6 Weeks  Stalker     13  60.1  5.01
## 4 Psychodyshamic Therapy   12 Weeks Stalker     13  42.9 14.9 
## 5 None                     6 Weeks  Stalker     13  76.3 17.1 
## 6 None                     12 Weeks Stalker     13  66.2 20.6
```

---

## Data: Visualize the Data

```r
ggboxplot(dat, x = "AmtTher", y = "Stalker",
          color = "Therapy")
```

![](index_files/figure-html/unnamed-chunk-4-1.png)

---

# Research Question:

Dr. Field is interested in seeing the <span style='color: red;'>most effective</span> therapeutic intervention for stalkers. In addition, in an effort to cut back on the number of weeks he spends treating stalkers, he is trying to discover the <span style='color: red;'>minimum amount of time</span> patients need to be in treatment to show improvement. In order to test his hypotheses, he recruits 78 convicted stalkers to participate in either a 6 week or 12 week therapy program. His post intervention measure of success is the <span style='color: blue;'>number of hours a week spent stalking</span>.

---
class: inverse, center, middle
.left[
# Assumption Testing in R
* ### Outliers
* ### Normality of Residuals
* ### Normality by Group
* ### Homogeneity of Variance
]
---
## Outliers

```r
dat %>% 
  group_by(Therapy, AmtTher) %>% 
  identify_outliers(Stalker)
```

```
## # A tibble: 2 x 5
##   Therapy                  AmtTher  Stalker is.outlier is.extreme
##   <fct>                    <fct>      <dbl> <lgl>      <lgl>     
## 1 Cruel to be Kind Therapy 12 Weeks      12 TRUE       FALSE     
## 2 Psychodyshamic Therapy   12 Weeks      80 TRUE       TRUE
```

.footnote[
One extreme outlier exists in these data. This can be due to: 1) data entry errors, measurement errors, or unusual values.

You can include the outlier in the analysis anyway if you do not believe the result will be substantially affected. This can be evaluated by comparing the result of the ANOVA test with and without the outlier.
]

---
## Normality of Residuals- QQ Plot

```r
model <- lm(Stalker ~ Therapy * AmtTher, data = dat)
ggqqplot(residuals(model))
```

.footnote[
The QQ plot compares the theoretical distribution (normal) with these data. All of the points fall roughly along the reference line so we can assume normality of these data.
]

---
## Normality of Residuals- Shapiro-Wilks Test

```r
shapiro_test(residuals(model))
```

```
## # A tibble: 1 x 3
##   variable         statistic p.value
##   <chr>                <dbl>   <dbl>
## 1 residuals(model)     0.983   0.364
```

.footnote[
The P-Value from the Shapiro-Wilk’s test describes these data as not being significantly different than normal.
]

---
## Normality by group- Shapiro-Wilks Test

```r
dat %>% group_by(AmtTher) %>% 
  shapiro_test(Stalker)
```

```
## # A tibble: 2 x 4
##   AmtTher  variable statistic       p
##   <fct>    <chr>        <dbl>   <dbl>
## 1 6 Weeks  Stalker      0.890 0.00115
## 2 12 Weeks Stalker      0.922 0.00974
```

.footnote[
The Shapiro-Wilk’s method is widely recommended for normality tests. It is based on the correlation between the data and the corresponding normal scores.
]

---
## Normal by group- QQ Plot
.pull-left[

```r
ggqqplot(dat, "Stalker", 
         ggtheme = theme_bw()) + 
  facet_grid(Therapy~AmtTher)
```
]
.pull-right[
![](index_files/figure-html/qq1a-1.png)
]

.footnote[
In order to determine normality graphically we can use the output of a normal Q-Q Plot. If the data are normally distributed then the data points will be close to the diagonal line. 
]

---
## Homogeneity of Variance

```r
dat %>% levene_test(Stalker ~ Therapy * AmtTher)
```

```
## # A tibble: 1 x 4
##     df1   df2 statistic       p
##   <int> <int>     <dbl>   <dbl>
## 1     5    72      3.50 0.00687
```
.footnote[
Levene's test is significant (p<.05). Therefore, we cannot assume the homogeneity of variances in the different groups. <span style='color: red;'>R output will be slightly different from SPSS because R uses a median centered approach and SPSS uses a mean centered approach.</span>. To mirror SPSS output, add: `center = mean` to the function. 
]
---
class: inverse, center, middle

# Two-way ANOVA using R

---

## Tests of Between-Subject Effects

```r
(stalker_aov <- dat %>% 
  anova_test(Stalker ~ Therapy * AmtTher))
```

```
## ANOVA Table (type II tests)
## 
##            Effect DFn DFd      F        p p<.05   ges
## 1         Therapy   2  72 19.244 2.02e-07     * 0.348
## 2         AmtTher   1  72 24.722 4.34e-06     * 0.256
## 3 Therapy:AmtTher   2  72  0.988 3.77e-01       0.027
```
.footnote[
We are interested in the Therapy, Amt_Therapy and Therapy x Amt_Therapy rows of the table. We must first look at the Therapy x AmtTher interaction. We can see from the Sig. column that we do not have a statistically significant interaction. We can see from the above table that there was a significant difference in stalker hours between Therapy Types (P < .001) and between therapy amounts (P < .001).
]
---

## Multiple Comparisons: Therapy Type

```r
dat %>% pairwise_t_test(Stalker ~ Therapy, 
                        p.adjust.method = 'bonferroni') %>% 
  select(-`.y.`, -n1, -n2) %>% knitr::kable("html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> group1 </th>
   <th style="text-align:left;"> group2 </th>
   <th style="text-align:right;"> p </th>
   <th style="text-align:left;"> p.signif </th>
   <th style="text-align:right;"> p.adj </th>
   <th style="text-align:left;"> p.adj.signif </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Cruel to be Kind Therapy </td>
   <td style="text-align:left;"> Psychodyshamic Therapy </td>
   <td style="text-align:right;"> 0.508000 </td>
   <td style="text-align:left;"> ns </td>
   <td style="text-align:right;"> 1.00e+00 </td>
   <td style="text-align:left;"> ns </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cruel to be Kind Therapy </td>
   <td style="text-align:left;"> None </td>
   <td style="text-align:right;"> 0.000004 </td>
   <td style="text-align:left;"> **** </td>
   <td style="text-align:right;"> 1.18e-05 </td>
   <td style="text-align:left;"> **** </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Psychodyshamic Therapy </td>
   <td style="text-align:left;"> None </td>
   <td style="text-align:right;"> 0.000048 </td>
   <td style="text-align:left;"> **** </td>
   <td style="text-align:right;"> 1.44e-04 </td>
   <td style="text-align:left;"> *** </td>
  </tr>
</tbody>
</table>

.footnote[
We are interested in the differences between (1) Cruel to be Kind Therapy and Psychodyshamic Therapy, (2) Cruel to be Kind Therapy and None, and (3) Psychodyshamic Therapy and None. From the results we can see that there is a significant difference between Cruel to be Kind Therapy and none and Psychodyshamic Therapy and none, but not Cruel to be Kind and Psychodyshamic Therapies.
]
---

## Multiple Comparisons: Therapy Amount

```r
dat %>% pairwise_t_test(Stalker ~ AmtTher, 
                        p.adjust.method = 'bonferroni') %>% 
  select(-`.y.`, -n1, -n2) %>% knitr::kable("html")
```

.footnote[
We are interested in the differences between 6 Week and 12 Week therapy amounts. From the results we can see that there is a significant difference between 6 Week and 12 Week therapy amounts.
]
---

## Visualizing The Data

```r
stalker %>% group_by(Therapy, AmtTher) %>% 
  summarize(mean = mean(Stalker)) %>% 
  ggplot(aes(x = AmtTher, y = mean, group = Therapy)) + 
  geom_line(aes(color = Therapy)) + 
  geom_point(aes(color = Therapy)) + 
  labs(title = "Estimated Marginal Means of Time Spent Stalking (hours per week)",
       x = "Amount of Therapy",
       y = "Estimated Marginal Means")
```
.center[
![](index_files/figure-html/means2-1.png)
]

---

## Visualizing The Data

The following plot is not of sufficient quality to present in your reports but provides a good graphical illustration of your results. In addition, we can get an idea of whether there is an interaction effect by inspecting whether the lines are parallel or not.

From this plot we can see how our results from the previous table might make sense. Remember that if the lines are not parallel then there is the possibility of an interaction taking place.
<br><br>

.center[
![](index_files/figure-html/means3-1.png)
]
---
## Simple Main Effects

If we observed a statistically significant interaction, we would then test the simple main effects by grouping one of the variables (AmtTher) and conducting one-way ANOVAs between each level of Therapy. In the R code below, we will group the data by AmtTher and analyze the simple main effects of Therapy on hours of stalking. Here is the code we would use:

```r
model <- lm(Stalker ~ Therapy * AmtTher, data = dat)

dat %>% group_by(AmtTher) %>% 
  anova_test(Stalker ~ Therapy, error = model)
```

```
## # A tibble: 2 x 8
##   AmtTher  Effect    DFn   DFd     F          p `p<.05`   ges
##   <fct>    <chr>   <dbl> <dbl> <dbl>      <dbl> <chr>   <dbl>
## 1 6 Weeks  Therapy     2    72  5.89 0.004      *       0.141
## 2 12 Weeks Therapy     2    72 14.3  0.00000572 *       0.285
```

---

## References

- ### These analyses and [xaringan slides](https://github.com/yihui/xaringan) were produced using only R code
- ### The Code was derived from a datanovia tutorial on ANOVA Analyses. Additional explainations and information can be found in this [datanovia tutorial](https://www.datanovia.com/en/lessons/anova-in-r/)