---
title: "Using hypr for linear regression"
author: "Daniel J. Schad & Maximilian M. Rabe"
date: "Oct 9th, 2019"
output:
  html_vignette:
    number_sections: no
    toc: yes
    toc_depth: 3
editor_options: 
  chunk_output_type: console
vignette: >
  %\VignetteIndexEntry{Using hypr for linear regression}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(hypr)
```

## Background

`hypr` is a package for easy translation between experimental (null) hypotheses, hypothesis matrices and contrast matrices, as used for coding factor contrasts in linear regression models. The package can be used to derive contrasts from hypotheses and vice versa. The first step is to define the hypotheses. This step is independent of the package per se and requires some theoretical background knowledge in null hypothesis significance testing (NHST). This vignette shows two examples of deriving contrasts and using them for statistical analyses.

For a general introduction to `hypr`, see the `hypr-intro` vignette:
```{r, eval = FALSE}
vignette("hypr-intro", package = "hypr")
```

## Simulated dataset

For the examples in this vignette, we are using a simulated dataset with one factor `X` with four levels `X1`, `X2`, `X3`, and `X4`:

```{r}
set.seed(123)
M <- c(mu1 = 10, mu2 = 20, mu3 = 10, mu4 = 40) # condition means
N <- 5
SD <- 10
simdat <- do.call(rbind, lapply(names(M), function(x) {
  data.frame(X = x, DV = as.numeric(MASS::mvrnorm(N, unname(M[x]), SD^2, empirical = TRUE)))
}))
simdat$X <- factor(simdat$X)
simdat$id <- 1:nrow(simdat)
simdat
```

## Example: Treatment contrasts

Assume we would like to test three treatments against a baseline. In a typical treatment contrast, we typically test whether any of the treatment conditions $\mu_2$, $\mu_3$ or $\mu_4$ is significantly different from the baseline condition $\mu_1$. Including the baseline intercept (testing the baseline against zero), this allows us to generate four null hypotheses:

\begin{align}
H_{0_1}:& \; \mu_1 = 0 \\
H_{0_2}:& \; \mu_2 = \mu_1 \\
H_{0_3}:& \; \mu_3 = \mu_1 \\
H_{0_4}:& \; \mu_4 = \mu_1
\end{align}

The `hypr()` function accepts any set of such equations as comma-separated arguments:

```{r}
trtC <- hypr(mu1~0, mu2~mu1, mu3~mu1, mu4~mu1)
```

When calling this function, a `hypr` object named `trtC` is generated which contains all four hypotheses from above as well as the hypothesis and contrast matrices derived from those. We can display a summary like any other object in R:

```{r}
trtC
```

We can use this object to set the factor contrasts of `X` in the `simdat` dataframe:

```{r}
contrasts(simdat$X) <- contr.hypothesis(trtC)
contrasts(simdat$X)
```

```{r}
round(coef(summary(lm(DV ~ X, data=simdat))), 3)
```

The linear regression returns the expected estimates: The intercept is the baseline condition and the three main effects are the differences between the baseline and the three conditions.

## Example: Sum contrast coding

A sum contrast, such as used for ANOVA, with four levels could generate the following null hypotheses:

\begin{align}
H_{0_1}:& \; \mu_1 = \frac{\mu_1 + \mu_2 + \mu_3 + \mu_4}{4} \\
H_{0_2}:& \; \mu_2 = \frac{\mu_1 + \mu_2 + \mu_3 + \mu_4}{4} \\
H_{0_3}:& \; \mu_3 = \frac{\mu_1 + \mu_2 + \mu_3 + \mu_4}{4}
\end{align}

We rewrite them into `hypr`:

```{r}
sumC <- hypr(mu1 ~ (mu1+mu2+mu3+mu4)/4, mu2 ~ (mu1+mu2+mu3+mu4)/4, mu3 ~ (mu1+mu2+mu3+mu4)/4)
sumC
```

We next assign the contrast matrix to the factor `X`:

```{r}
contrasts(simdat$X) <- contr.hypothesis(sumC)
contrasts(simdat$X)
```

Without creating the intermediate `hypr` object, you can also set the contrasts directly like this:

```{r}
contrasts(simdat$X) <- contr.hypothesis(
  mu1 ~ (mu1+mu2+mu3+mu4)/4, 
  mu2 ~ (mu1+mu2+mu3+mu4)/4, 
  mu3 ~ (mu1+mu2+mu3+mu4)/4
)
contrasts(simdat$X)
```

Finally, we run the linear regression: 

```{r}
round(coef(summary(lm(DV ~ X, data=simdat))),3)
```