--- title: "From Design to Dataframe" author: "Maximilian M. Rabe & Reinhold Kliegl" date: '2019-09-28' output: html_vignette: # highlight: pygments number_sections: yes # theme: cosmo toc: yes toc_depth: 3 editor_options: chunk_output_type: console vignette: > %\VignetteIndexEntry{From Design to Dataframe} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Setup ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(dplyr) library(designr) # Set a "seed" for the random numnber generator set.seed(12345) ``` # Experimental Design(s) In an experimental design, we distinguish between random and fixed factors. The "levels" of the random factors are quasi-random samples from a population of persons (subjects) or material (items). To avoid confusion with levels of fixed factors we will refer to levels of random factors as _instances_. For fixed factors, usual (quasi-)experimental ones, we must specify whether they are between- or within-subjects and between- or within-items. For a given fixed factor all four combinations are possible in principle. We also need to decide on a counterbalancing scheme; a common example is a Latin square applied to all or a subset of the factors. In this vignette, we illustrate how to set up an experiment using subject (Subj) and item (Item) as random factors. In this fictive experiment, the words of a text are presented serially one at a time at a _slow_, _medium_, or _high_ rate (i.e., fixed factor _Speed_ with three levels) at the center of the screen. A second factor is cognitive load varying whether subjects have keep six digits in memory while reading or not (i.e., fixed factor _Load_ with two levels _yes_ and _no_). Typically, in such an experiment, (1) each subject reads different texts (Item) in the 2 x 3 experimental conditions; (2) each subject reads the same number of texts in each condition; (3) across subjects each text (item) is presented equally often in the six experimental conditions. The final example in this vignette implements this design. However, for didactic reasons, we first show how within/between-subject and within/between-item features of the factors are specified without counterbalancing. These designs are preferred if repeated exposure to the same stimuli in an experimental condition does not have any confounding effects on the measure. ## Complete within-subject and within-item design; no counterbalancing In the first version of the experimental design, each subject sees each text in each of 2 x 3 experimental conditions. Thus, the two fixed factors _Speed_ and _Load_ are both within-subject and within-item. A minimum of six subjects and six texts is required for a complete within-subject/within-item design. We summarize the design with the **design formula:** ``` Load(2) x Speed(3) x 6 Item x 6 Subj ``` This is a completely crossed design. Note the difference between specification of levels for fixed and instances for random factors. The product of numbers in the formula informs about the number of observations generated by the design. In this case: **216 observations**. ```{r} design1 <- fixed.factor("Speed", levels=c("slow", "medium", "fast")) + fixed.factor("Load", levels=c("yes", "no")) + random.factor("Subj", instances=6) + random.factor("Item", instances=6) codes1 <- arrange(design.codes(design1), Subj, Item)[c(3, 4, 2, 1)] codes1 tail(codes1, 10) #xtabs( ~ Subj + Item + Load + Speed, codes1) xtabs(~ Load + Speed, codes1) xtabs(~ Subj + Load + Speed, codes1) xtabs(~ Item + Load + Speed, codes1) ``` The first command generates the list `design1`. The function `design.codes()` extracts the generated variable coding as a dataframe in the tibble format. After resorting and rearranging the variables, the code is converted to the long format (i.e, N=216). Obviously, having subjects read each text six times may lead to practice effects that would need to be taken into account by counterbalancing the order in which texts are presented across subjects. ## _Speed_ within-subject/within-item, _Text_ within-subject/between-item; no counterbalancing In the second example, we replace the factor _Load_ with a factor _Type_ of text. We assume that Items 1 to 3 are simple texts and items 4 to 6 are complex texts. Subjects read both simple and complex texts; _Type_ of text is a within-subject factor. Each text (item), however is either simple or complex. Thus, _Type_ is a between-item factor in this design. Such a design is realized by specifying _Type_ with the _groups_ argument in the corresponding _random.factor()_ command. We generate 3 items (_instances_) within each of the two levels of the factor _Type_, that is, as in the first example, we will have again six different items. **Design formula:** ``` Type(2) x Speed(3) x 3 Item[Type] x 6 Subj ``` We read the item-part of this formula: "3 Items nested under levels of Type." The total number of different instances for the random factor _Item_ is 3 items x 2 levels of _Type_, that is 6 items. The design generates **108 observations**; it is no longer completely crossed. ```{r} design2 <- fixed.factor("Speed", levels=c("slow", "medium", "fast")) + fixed.factor("Type", levels=c("simple", "complex")) + random.factor("Subj", instances=6) + random.factor("Item", groups="Type", instances=3) codes2 <- arrange(design.codes(design2), Subj, Item)[c(3, 4, 1, 2)] codes2 xtabs(~ Item + Type, codes2) xtabs(~ Subj + Type, codes2) #xtabs( ~ Subj + Item + Type + Speed, codes2) #xtabs(~ Type + Speed, codes2) #xtabs(~ Subj + Type + Speed, codes2) #xtabs(~ Item + Type + Speed, codes2) ``` The tables shows that for Items 1 to 3 all available codes for the factor _Type_ are _complex_ and for Items 4 to 6 all codes are _simple_. Thus, _Type_ is varied between items. Each item is read three times (three levels of _Speed_) by six subjects. yielding 18 codes in each of the 6 non-zero cells of the _Item_ x _Type_ table. Conversely, for all six subjects codes are available for _simple_ and _complex_ items. Thus, _Type_ is varied within subjects. Each text is read three times (i.e., the three speed rates). Therefore, there are 3 texts x 3 levels of speed = 9 codes in each cell of the _Subj_ x _Type_ table. The command to specify _Speed_ as between_item factor would be: ```` random.factor("Item", groups="Speed", instances=2) ``` We need 2 instances within each of the 3 levels of _Speed_ to obtain 6 items in total. **Design formula:** ``` Type(2) x Speed(3) x 2 Item[Speed] x 6 Subj ```` The total number of items is 2 x 3 = 6. This design generates **72 observations**. ## _Age_ between-subject/within-item, _Speed_ within-subject/within-item; no counterbalancing In this example, we replace the factor _Load_ (or _Type_) with a between-subject factor _Age_, assuming that half the subjects are young and the other half old. ```{r} design3 <- fixed.factor("Speed", levels=c("slow", "medium", "fast")) + fixed.factor("Age", levels=c("young", "old")) + random.factor("Item", instances=6) + random.factor("Subj", groups="Age", instances=3) codes3 <- arrange(design.codes(design3), Subj, Item)[c(4, 3, 2, 1)] codes3 xtabs(~ Subj + Age, codes3) xtabs(~ Item + Age, codes3) #xtabs( ~ Subj + Item + Age + Speed, codes3) #xtabs( ~ Subj + Age + Speed, codes3) ``` The tables show that subjects 1 to 3 are _old_ and subjects 4 to 6 are _young_ (i.e., _Age_ is a between-subject factor) and that all items are read by young and old subjects (i.e., _Age_ is a within-item factor). The formula for this design can be written as: Age(2) x Speed(3) x 6 Item x 3 Subj[Age], yielding **108 observations**. Note that _instances_ specifies the number of instances within _groups_. To generate code for 25 young and 25 old subjects (i.e., total N=50), we set `instances=25`. **Design formula:** ``` Age(2) x Speed(3) x 6 Item x 25 Subj[Age] ``` The total number of subjects is 25 x 2 = 50. This design generates **900 observations**. ## _Age_ between-subject/within-item, _Speed_ between-subject/within-item; no counterbalancing Continuing with the last example, it may also make sense to vary not only _Age_, but als _Speed_ between subjects. Thus, every subject is either _old_ or _young_ (i.e., a quasi-experimental factor) and is randomly assigned to one of the three _Speed_ conditions (i.e., an experimental factor). For this specification the two factors are included as a vector for the `groups` argument. For the minimal design we need only 1 instance because 2 x 3 = 6. This means we generate codes for 1 subject in each of the six design cells, but each subjects reads each text in this condition (i.e., there are six measures for each subject.) To get code for 10 subjects in each of the 2 x 3 = 6 design cells (i.e., a total of 60 subjects), we set `instances=10`. **Design formula:** ``` Age(2) x Speed(3) x 6 Item x 10 Subj[Age x Speed] ``` The total number of subjects is 10 x 2 x 3 = 60. This design generates **360 observations**. ```{r} design4 <- fixed.factor("Speed", levels=c("slow", "medium", "fast")) + fixed.factor("Age", levels=c("simple", "complex")) + random.factor("Subj", groups=c("Age", "Speed"), instances=10) + random.factor("Item", instances=6) codes4 <- arrange(design.codes(design4), Subj, Item)[c(3, 4, 2, 1)] codes4 xtabs( ~ Subj + Age, codes4) xtabs( ~ Subj + Speed, codes4) xtabs( ~ Item + Age, codes4) xtabs( ~ Item + Speed, codes4) #xtabs( ~ Subj + Item + Age + Speed, codes4) ``` The tables show that _Age_ and _Speed_ vary indeed between subjects and within items. # Counterbalancing _Speed_ and _Load_ In this final example, we modify the very first example such that each subject reads one different texts in each of the six conditions, respecting the constraint that design cells are counterbalanced (i.e., each text is read equally often in each condition, each subject reads the same number of texts in each condition). For this implementation we (1) add a third random factor defined as _Subj-by-Item_ and (2) specify factors _Speed_ and _Load_ as varying between _Subj-by-Item_. We start with the minimal design of 6 subjects reading 6 texts. **Design formula:** ``` Speed(3) x Load(2) x 1 Item[Speed x Load] x 1 Subj[Speed x Load] x (3 x 2) Item-by-Subj[Speed x Load x Item[Speed x Load] + Subj[Speed x Load]] ```` We have 1 item and 1 subject nested under the levels of the _Speed_ x _Load_ design. There are 36 instances of the random factor resulting from the multiplication of the random factors _Item_ and _Subj_. The design generates 3 x 2 x 1 x 1 x (3 x 2) **36 observations**. ```{r} design5 <- fixed.factor("Speed", levels=c("slow", "medium", "fast")) + fixed.factor("Load", levels=c("simple", "complex")) + random.factor("Subj", instances=1) + random.factor("Item", instances=1) + random.factor(c("Subj", "Item"), groups=c("Speed", "Load")) codes5 <- arrange(design.codes(design5), Subj, Item)[c(3, 4, 1, 2)] codes5 xtabs(~ Subj + Speed + Load, codes5) xtabs(~ Item + Speed + Load, codes5) xtabs( ~ Subj + Item + Load + Speed, codes1) ``` Number of subjects and items increase by six with each increment of the value of the `instances` argument. For example, ``` ... random.factor("Subj", instances=10) + random.factor("Item", instances= 4) + ... ``` will generate codes for 60 subjects and 24 texts. ```{r} design6 <- fixed.factor("Speed", levels=c("slow", "medium", "fast")) + fixed.factor("Load", levels=c("simple", "complex")) + random.factor("Subj", instances=10) + random.factor("Item", instances=4) + random.factor(c("Subj", "Item"), groups=c("Speed", "Load")) codes6 <- arrange(design.codes(design6), Subj, Item)[c(3, 4, 1, 2)] codes6 length(unique(codes6$Subj)) length(unique(codes6$Item)) length(unique(paste(codes6$Subj, codes6$Item))) ``` **Design formula:** ``` Speed(3) x Load(2) x 4 Item[Speed x Load] x 10 Subj[Speed x Load] x (3 x 2) Item-by-Subj[Speed x Load x Item[Speed x Load] x Subj[Speed x Load]] ```` The total number of items is 4 x 3 x 2 = 24; the total number of subjects is 10 x 3 x 2 = 60. The total number of instances of _Item-by-Subj_ is 3 x 2 x (3 x 2) x 4 x 10 = 1440. The design yields 3 x 2 x 4 x 10 x (3 x 2) = **1440 observations**. # Outlook The examples illustrate some of the basic functionalities. The generalization to a larger number of fixed or random factors and number of levels associated with them should be clear. The codes generated with the above specifications can be extended with different assignment of presentation orders according to `latin.square` (default), `random.order`, or `williams`. These options will be described in the second vignette. The function also allows the specifations of fixed effects, variance and correlation parameters to generate input suitable for linear (mixed) models and the determination of statistical power via simulations from the model. The third vignette is a tutorial about these functionalities. # Appendix ## Acknowledgement The development of this package was supported by German Research Foundation (DFG)/SFB 1287 _Limits of variability in language_ and Center for Interdisciplinary Research, Bielefeld (ZiF)/Cooperation Group _Statistical models for psychological and linguistic data_. ## Packages ```{r} sessionInfo() ```