All aggression is instrumental

Aggression is commonly defined as a behavior done with the intent to harm another individual who is motivated to avoid receiving the behavior. Some researchers go further and try to classify aggression as being either “reactive aggression” or “instrumental aggression.” I do not believe this distinction is useful.

Briefly, reactive aggression is supposedly an impulsive aggressive behavior in response to a provocation or instigation and is typically accompanied by feelings of anger or hostility. The supposed goal of reactive aggression is merely to “cause harm” to the recipient of the behavior. Think of snapping at another person in the heat-of-the-moment. Instrumental aggression is supposedly an aggressive behavior that is enacted to achieve a particular goal.  Think of a bank robber who shoots the guard while trying to make a getaway. 
Several researchers have pointed out that this distinction is difficult, if not impossible, to make (e.g., Bushman & Anderson, 2002; Tedeschi & Quigley, 1999). I agree. With a little thought, one can see that “snapping” at another person can be used to achieve several goals such as restoring a perceived slight to one’s reputation or exerting social control. Thus, the above example of reactive aggression also can be construed as instrumental. Similarly, one also can see that shooting a bank robber probably was in response to some feature of the situation such as the perception that the guard was impeding the goal of successfully executing the robbery.  Thus, the above example of instrumental aggression can be construed as being in response to something and, thus, reactive.

Wait! Am I saying that snapping at another person is the same as a bank robber shooting the guard? No. These are very different behaviors, but the distinctions is not that one is “reactive” and one is “instrumental.”

The argument that the reactive-instrumental distinction is a false distinctions is fairly simple. Aggression is, by definition, a behavior that was done intentionally (i.e., non-accidentally). Intentional behaviors are used to achieve social motives. Thus, aggression is one specific type of intentional behavior that is used to achieve social motives. What are some examples of social motives that can be achieved with aggressive behaviors? Protecting oneself, acquiring resources, restoring one’s reputation, enforcing a violated social norm, etc.

Further, the belief that aggression can be done “to cause harm” is logically incorrect. Because the definition of aggression requires the aggressive behaviors to have been done with intent and with the belief the recipient wants to avoid the behavior, some believe this definition implies that “causing harm” can be the end goal of the behavior rather than merely a means to achieving some other ends. Therefore, “causing harm” can seemingly be the goal behind reactive aggression. Although this is a common belief, this conflates the definitional criteria of aggression with the motive for why an individual would use an aggressive behavior. This is an easy conflation to make because “to cause harm” seems like a reasonable and satisfactory answer to the question “why did this person behave aggressively?” However, this only seems like a satisfactory answer, but it’s not. One cannot explain the causes of a phenomenon (aggression) merely by referring to a necessary component of the phenomenon (an intentionally-caused harmful behavior): A person who behaves aggressively did so with the intent to harm the recipient by definition.

I sincerely hope that we can move beyond the reactive-instrumental definition because I do not believe it is a scientifically useful distinction. Aggression is one behavior in our repertoire of behaviors we use to navigate our complex social environments. All aggression is instrumental. 

"Lab-based measure of aggression" are to "real aggression" what college students are to all humans

Aggression is a common feature of social interactions.  Therefore, it is important for social scientists to develop a well-rounded understanding of this phenomenon.  One valuable approach to understanding aggression is laboratory-based research, which requires researchers to have usable and valid methods for measuring aggression in laboratory settings.  However, behaviors that are clearly aggressive, such as one person forcefully striking another person with a weapon, are fraught with ethical and safety considerations for both participants and researchers.  Such behaviors are, therefore, not a viable option for displayed aggression within lab-based research.  For these reasons, aggression researchers have developed a repertoire of tasks that purportedly measure aggression, are believed to be safe for participants and researchers, and are ethically-palatable.  I collectively refer to these tasks as “lab-based aggression paradigms.”  The major concern herein is whether the behaviors exhibited within lab-based measures of aggression are representative of “real” aggression. 

A common definition of aggression is “a behavior done with the intent to harm an individual who is motivated to avoid receiving that behavior” (Baron & Richardson, 1994, p. 7). If one adheres to this definition, a behavior is considered aggressive when both (a) a harmful behavior has occurred and (b) the behavior was done (i) with intent to harm the target and (ii) the belief the target wanted to avoid receiving the behavior. A strength of this definition is the clear demarcation between harmful behaviors that are not aggressive (i.e., a dentist who causes pain in the process of pulling the tooth of a patient; inflicting consensual pain for sexual pleasure, etc.) and harmful behaviors that are aggressive (i.e., punching another person out of anger; yelling at another person and causing a fear response, etc.). 
As hinted to above, the degree of “harm” that is permissible within lab-based settings is very mild. In fact, the lower bound of harmfulness at which behaviors become unambiguously aggressive is likely the upper bound of harmfulness that is permissible within laboratory settings.  

Extending Baron and Richardson’s (1994) definition, Parrot and Giancola (2007) proposed a taxonomy of how such aggressive behaviors may manifest. Within their taxonomy, aggressive behaviors vary along the orthogonal dimensions of direct versus indirect expressions and active versus passive expressions. For example, a physical fight would be considered a direct and active form of physical aggression whereas not correcting knowingly-false gossip would be considered an indirect and passive form of verbal aggression (to the extent the individual believes their inaction will indirectly harm a target individual). Because Parrot and Giancola strongly adhere to the definition of aggression proposed by Baron and Richardson, each of these forms of aggression are still required to meet the criteria described in the previous paragraph. The purported usefulness of this taxonomy is that factors that incite one form of aggression may not incite other forms of aggression. Thus, Parrot and Giancola assert that using their taxonomy to classify the different behavioral manifestations of aggression, and which antecedents causes those different behavioral manifestations, will lead to a nuanced understanding of the causes and forms of aggression.

The first dimension of Parrot and Giancola’s (2007) taxonomy is the direct versus indirect nature of the aggressive behavior. In describing the distinction between direct and indirect aggression, Parrot and Giancola state that direct aggression involves “face-to-face interactions in which the perpetrator is easily identifiable by the victim. In contrast, indirect aggression is delivered more circuitously, and the perpetrator is able to remain unidentified and thereby avoid accusation, direct confrontation, and/or counterattack from the target” (p. 287). However, several lab-based aggression paradigms seemingly have features of both direct and indirect forms of aggression. Many of these paradigms involve contrived interactions where participants communicate with a generic “other participant,” for example, via computer or by evaluating one another’s essays. These contrived interactions are not really face-to-face and they are not really anonymous. So the behaviors within lab-based aggression paradigms are not cleanly classified as being either direct or indirect within Parrot and Giancola’s taxonomy. 

Similarly, participants’ behaviors exhibited within lab-based aggression paradigms are often not “directly” transmitted to the recipient of those behaviors. For example, participants do not make physical contact with their interaction partner at any point within these paradigms. The consequences of participants’ behaviors are often transmitted to the recipient via the ostensible features of the study in which they are participating. For example, in one lab-based aggression paradigm, participants’ harmful behavior is selecting how long another participant must submerge their hand in ice water (Pederson, Vasquez, Bartholow, Grosvenor, & Truong, 2014). Therefore, participants must believe (a) they can harm the recipient by varying how long they tell the experimenter to have the recipient hold their hand in ice water, (b) that a longer period of time causes more harm, and (c) the experimenter will successfully execute the harmful behavior at a later point in time.

Collectively then, it is questionable whether behaviors within lab-based aggression paradigms are considered “direct.” Nevertheless, it is clear that these behaviors do not include face-to-face aggression or physical aggression that includes direct physical contact. And the time of many of the behaviors within lab-based aggression paradigms are asynchronous with the (ostensible) delivery of harm to the recipient.

The second dimension of Parrot and Giancola’s (2007) taxonomy is the active versus passive nature of the behavior. Active aggression involves an individual actively engaging in a behavior that harms the recipient. In contrast, passive aggression is characterized by participants’ lack of action that is believed to cause harm to the recipient. All of the major lab-based aggression paradigms involve behaviors that are considered active.

In summary, within lab-based aggression paradigms, the harmfulness of the behaviors is on the extreme low end of the range of possible harmfulness, participants may believe their behaviors will only cause mild amounts of harm, participants may believe the recipient may only be mildly motivated to avoid the behaviors, and the form of participants’ behaviors may only cover a limited amount of the conceptual space of possible forms of aggression. Collectively, the behaviors exhibited in lab-based aggression paradigms seem to be limited and unrepresentative of the multi-faceted nature of aggression.

Is this potential un-representativeness a problem? On the one hand, the relationship between the behaviors within lab-based measures of aggression and “real” aggressive behaviors is like the relationship between a convenience sample of college students and “all humans.”  The former is not a representative sample of the latter, therefore, the generalizability from the former to the latter is potentially biased. On the other hand, to the extent that the behaviors exhibited within lab-based aggression paradigms are valid instances of very mild and specific forms of aggression, lab-based research has a valuable place within a robust science of aggressive behaviors. 

A glimpse into my academic writing habits

The other day I was talking to a student who was interested in my approach to academic writing. Where do I write? When do I write? How often do I write? Etc. Later, this student expressed that our conversation was helpful. Here is the gist of my response. I hope you find at least one thing helpful.

I am not a naturally gifted writer, so producing writing that is considered a “scientific contribution” requires my sustained and focused mental effort. So the first thing I do is ensure there is time in my schedule to write. The second thing I do is ensure that I fill that time with cognitively-demanding and mentally-focused writing. Spend enough time doing mentally-focused writing: It’s really that easy. 

Perhaps it is because I come from a family of dairy farmers, or perhaps it is from my time in the military, but I am an early riser. I typically wake around 5 AM (except for holidays, vacations, etc.). From 5’ish until 6’ish I engage in what I call “deep writing” (inspired by the concept of “deep work”: http://calnewport.com/books/deep-work/). My morning writing time takes the same amount of time as drinking one cup of coffee.

Deep writing is not superficial writing. During this time I don’t just make bullet points or do mundane tasks like checking references or formatting a table. I focus intensely on the content of what I am writing. Is my writing clear? Is my writing accurate? Is my writing precise? During this time I am not checking my email or thinking of what is on my schedule for the day. It sometimes feels like a mental fight. The second I sharpen my focus, my mind seems to want me to check my email. Sometimes I am literally staring at the screen, but not doing any deep thinking. If I catch myself being unfocused, I refocus on the task of writing. There is a level of focus that my mind seems to be comfortable at. I try to push myself just past this point of comfort so that I am effortfully immersed in my writing. This is both hard work and extremely satisfying. I imagine this is the same satisfaction that artists get out of engaging in their work.

Given that I only do this for about an hour, and given the level of focus that I try to invest, there are some mornings where I only work on a single paragraph. That’s OK, my only goal is that I make at least one substantive change to what I am working on every morning. This goal of one substantive change is sustainable and attainable. It is a small goal, yet it ensures that I am making steady progress on whatever I am writing. No matter what else happens the rest of the day, I know that my current writing project is moving forward. I also find that mornings work best for me to engage in deep writing. My mind tires during the day and I find it harder to really intensify my writing focus as the day drags on.

There are mornings where I wander onto Twitter or I check the news and I don’t make my daily substantive change. These days bother me.

After my deep writing time, I get ready for the rest of my day.

When I get to my office I usually check my email right away. This is probably not an ideal habit, but I am working on improving it. I respond to quick emails and then check my calendar. I have time blocked off for my meetings, classes, conference calls, etc. I also have time blocked off time to engage in more deep writing. Some semesters I can only block off a one-hour writing chunk here and a three-hour writing chunk there, but I always put writing time onto my schedule. Always! It is a habit I developed in graduate school and it has served me well ever since. I like different environments to do my writing. I typically write in my office. I close out of my email. I don’t play any music. I only use the internet for looking up word definitions, finding articles, etc. Sometimes I write at the library or at Starbucks to minimize unscheduled interruptions. No matter where I am, I try to push myself to focus hard and produce the best writing that I am capable of during this time.

I also block off two hours each week for “professional development” where I read a chapter out of a stats text book or I try to learn a new skill. Currently I am using my professional development time to learn RMarkdown/knitr. During this time I don’t do anything else but focus on the new skill I am trying to develop. I also block off an hour each day for “busy work” where I do things like scan documents, fill out travel vouchers, clean off my desk, etc. During my “busy work” time I reward myself by playing music (I am currently listening to Bob Dylan).

Blocking off time for these non-writing activities helps me protect my writing time and allows me to mentally focus during my writing time. However, there are days when unexpected tasks arise, when I need more time than I allotted to complete a task, etc. Some days I miss my writing time. These days bother me.

I try to end my work day in the late afternoon. Most days I have invested enough focus on different activities that I am mentally spent. Some evenings I write. Writing in the evenings typically consists of superficial tasks like light editing because I usually don’t have enough mental energy to engage in deep writing. Sometimes in the evenings I am thinking about the overall structure of a manuscript or if there is an apt metaphor that I could incorporate into my writing. But I try to protect my evenings for non-work to the same extent I protect my writing time during the days. I also try to do non-academic reading in the evenings. I find the different style and pace of non-academic writing to be helpful in training my ear to hear well-written prose.

When I describe my writing schedule, people often think that I have the luxury of being able to block off regular writing time as if I am not otherwise busy. However, I feel like it requires a lot of effort to set aside writing time and then selfishly protect it. Without this effort of imposing control over my time, my schedule would be overrun and hectic.

That’s it. No secrets. No magic solution. No neat productivity tricks. Just planning and focus. In the end, am I the best writer? No. Am I the best writer that I can be? I am trying. 

(Fast) Food for Thought

Often I find myself walking to a local fast food establishment for lunch. The staff there is excellent: They keep the place clean, they always greet me with a smile, and they make delicious food. A few years ago, this particular fast food chain had a string of bad press where it was discovered that a very small number of employees were doing some unsavory things to customers’ food.
I felt bad for the staff at my local restaurant. They had no association with these trouble-makers other than they happened to work for the same restaurant chain, just like thousands of other individual employees. After the news broke, some customers were worried about what was happening behind closed doors in their local restaurant (e.g., Is some bad employee doing something unsanitary to my lunch?). And the staff was probably concerned about being perceived as being one of the trouble-makers (e.g., Do my customers think that I am doing something unsanitary to their lunch?). A few bad news stories ruined the whole employee-customer relationship.
 
The response by my local franchisee was simple and effective: They modified the store to have an open-kitchen design (http://business.time.com/2012/08/20/nothing-to-hide-why-restaurants-embrace-the-open-kitchen/). Now, I can order my lunch and watch the employees prepare my food. I can see into the kitchen and see exactly who is handling my food and how they are handling my food. It is transparent. I suspect the staff likes the open-kitchen concept too. They know that if they are following the proper procedures that customers will not erroneously suspect them of doing something unsavory to their food. By opening up the food preparation process, the whole employee-customer relationship was improved. Now, customers can receive their lunch with the confidence that it was made properly and the staff can provide customers their lunch with the confidence that customers are not suspicious.
I also suspect the open-kitchen concept had several secondary benefits too. For example, the staff probably keeps the kitchen cleaner and avoids cutting obvious corners when they know they are in plain sight of customers. When I go to a different restaurant that still has a “closed-kitchen” design, I wonder what I would see if I could peer into their kitchen. Consequently, all else being equal, I choose open-kitchen establishments over closed-kitchen establishments. Open-kitchen designs are good for the bottom line.
The parallels between the open-kitchen design and “open science” are obvious. As researchers, we produce information that other people consume and we consume information that other people produce.

Here is some (fast) food for thought. As a producer of research, would you feel comfortable allowing your consumers to transparently see your research workflow? As a consumer of research, if you were given the choice between consuming research from an open-science establishment or a closed-science establishment, which would you choose?  

Preregistration increases the informativeness of your data for theories

Theories predict observations. Observations are either consistent or inconsistent with the theory that that implied the observations. Observations that are consistent with the theory are said to corroborate the theory. Observations that are inconsistent with the theory should cast doubt on the theory, or cast doubt on one of the premises of the theory.
There are a few things that affect the extent to which data inform theory. Below I argue that preregistration can strengthen the informativeness of data for a theory in a few ways.
First, data only informs theory via a chain of auxiliary assumptions. All else being equal, data that inform a theory through fewer auxiliary assumptions are more informative to that theory than data that make contact with theory through more auxiliary assumptions.
For example, data are informative to a particular theory so long as readers assume the predictor is valid, the outcome is valid, the conditions relating the predictor to the outcome have been realized, the sample was not selected based on the obtained results, the stated hypotheses were not modified to match the obtained results, etc. Sometimes these auxiliary assumptions are not accepted (by some individuals) and the theory is treated (by some individuals) as uninformed by the data.
Preregistration can essentially eliminate some assumptions that are required to interpret the data. Readers do not need to accept the assumption of which hypotheses were indeed a priori. Readers do not need to accept the assumption of how the sample was determined. Etc. There is a date-stamped document declaring all of these features. The only assumption necessary is that the preregistration is legit. Thus, all else being equal, a preregistered study has fewer links in the chain of auxiliary assumptions linking the observed data to the theory that is being tested. Thus, all else being equal, data from a preregistered study are more informative to a theory than data from a non-preregistered study.
Second, the informativeness of the data is related to the degree to which the data can be consistent or inconsistent with the theory. If a theory really sticks its neck out there, the data can more strongly corroborate or disconfirm the theory. If a theory does not stick its neck out there, the data are less informative to a theory, regardless of the specific pattern of data.
Preregistration clearly specifies which outcomes are predicted and which are not prior to the data being analyzed. Predictions that were made prior to the data being analyzed are riskier than predictions that were made after the data have been analyzed. Why? Because with preregistration there is no ambiguity in whether the predictions were actually made independently of the results of those predictions. Preregistered predictions stick their neck out. Very specific preregistered predictions really stick their neck out.

I am not saying that you must preregister your studies. I am saying that choosing not to preregister your studies is also choosing not to maximize the informativeness of your data for your theories.

hostile priming effects seem to be robust

In 1979, Srull & Wyer published a study wherein participants were presented with a series of 4 words from which they had to construct grammatically correct 3-word phrases. Some phrases described aggressive behaviors (i.e., break his leg). Later, participants read a story about the day in the life of a man named Donald. In this story, Donald performed ambiguously aggressive behaviors (e.g., argued with his landlord). Finally, participants provided their judgments of Donald by rating him on a series of traits that were combined into a measure of hostility (e.g., hostile, unfriendly, dislikable, kind (r), considerate (r), and thoughtful (r)). Participants who completed more aggressive phrases in the first task subsequently rated Donald as more hostile. Exposure to hostile-relevant stimuli that subsequently affects some type of a subsequent hostile-relevant impression is generically referred to as a “hostile priming” effect.*
So, is there strong evidence for the robustness of a “hostile priming” effect?** If you asked me 6 years ago I would have said “yes.” Why? First, DeCoster and Claypool (2004) performed a meta-analysis on cognitive priming effects that used an impression formation outcome variable and found that, overall, there is an effect of about 1/3 of a standard deviation in the predicted direction (i.e., k = 45, N = 4794, d = 0.35, 95% CI[0.30, 0.41]). Further, several of the studies included in the DeCoster and Claypool meta-analysis primed the construct of “hostility” and had an outcome variable that was relevant to the construct “hostility.” Second, the hostile priming effect has been demonstrated in dozens of published studies.
However, in 2016, I feel there are a few reasons to question the robustness of the hostile priming effect. First, DeCoster and Claypool didn’t investigate the presence of publication bias. That is not a knock on their excellent meta-analysis. But I believe that everybody in 2016 is more cognizant of the potential problems of publication bias than 12+ years ago. And we currently have more tools to detect publication bias than we did 12+ years ago. Second, it seems that cognitive phenomena labeled as a “priming” effect are currently viewed more skeptically. Fair or not, that is my belief about the current perceptions of cognitive priming effects. Third, many of the studies in the DeCoster and Claypool meta-analysis were authored by Diederik Stapel. Obviously, we should interpret the studies authored by Stapel differently in 2016 than in 2004. 
In addition to being interested in the hostile priming effect, I also wanted to force myself to learn some new tools. (This is probably a good place to note that this exercise was mostly a way for me to practice using some new tools, so there may be errors involved or I may describe something in a slightly incorrect way. If you find an error, help me learn.*** Please and thank you!)
I gathered what I believe to be a comprehensive list of all of the publications with (a) an assimilative hostile priming manipulation and (b) some type of a hostile-relevant impression formation task. I found 27 publications with 38 individual studies (please let me know if you are aware of studies I missed).

First, I did a p-curve analysis on all of the studies. Here is a link to the p-curve disclosure table (https://mfr.osf.io/render?url=https://osf.io/ar5cf/?action=download%26mode=render). The analysis reveals these studies contain evidential value, z = -6.02, p < .001. The p-curve analysis estimated the average power of the studies to be 65%.

I was pretty liberal with my inclusion criteria for this first p-curve analysis. I next winnowed the studies down to ones that I believed were most focused on the effect of interest. First, I took out the Stapel studies (the continuous p-curve analyses implied “evidentiary” value in Stapel’s studies for those who are interested, z = -3.85, p = .0001).****  Next, there were some studies that were not really interested in the effects of hostile priming on impression formation per se.  Instead, some studies were interested in the relation between some construct they believed was associated with the construct of “hostility”, and these studies used an impression formation outcome variable to demonstrate this hypothesized relation.  For example, DeWall and Bushman (2009) proposed that people hold mental associations between the construct of “hot temperatures” and “hostility.”  This proposition was demonstrated by exposing individuals to words associated with hot temperatures prior to having them report their judgments of Donald.  Thus, the emphasis in studies of this ilk are not so much on impression formation, but these studies use impression formation tasks as a tool to test their other hypotheses.  This led me to take out studies that primed “hostility” with hot temperatures (e.g., DeWall & Bushman, 2009; McCarthy, 2014), alcohol (e.g., Bartholow & Heinz, 2006; Pederson et al., 2014), sex (Mussweiler & Damisch, 2008; Mussweiler & Forster, 2000), and aggressive sports (e.g., Wann & Branscombe, 1990). (collectively, these omitted studies did not have evidential value, z = -0.71, p = .24). 
A p-curve analysis on the remaining 18 effects still revealed evidentiary value for an effect, z = -4.91, < .001. The p-curve analysis estimated the average power of the remaining studies to be 74%. 

So, this is just a first pass on examining the evidentiary value within these studies.  And, based solely on these p-curve analyses, it looks like there is evidence for a hostile priming effect on subsequent impressions of hostility.  I plan on working through a few other tests such as Test for Insufficient Variance, meta-analysis of effect sizes, etc.  Again, this is my way of forcing myself to learn, possibly get some free feedback, identify obvious errors, etc.. 
* “Hostile priming” needn’t be limited to subsequent impression formation tasks. It could, for example, include a behavioral outcome measure (e.g., Carver et al., 1983). However, for the present purposes, I limit the discussion to outcome variables that broadly fall into the class of impression formation tasks. This choice is purely based on my personal interests and not on any theoretical justification.
** In this blog post I am only referring to an “assimilation effect”: That is, priming effects that cause subsequent judgments to possess more of the primed construct.  There are situations when the primed constructs cause subsequent judgments to possess less of the primed construct. These latter effects are referred to as “contrast effects” and are not discussed herein. Again, this choice is purely based on my personal interests and not on any theoretical justification.

***If you find an error, please find me at the next conference and say “hi.” You are entitled to one free beer/coffee from me (either one at any time of the day/night, I don’t judge). 

**** As Alanis Morissette says “isn’t it ironic” 

Getting a feel for equivalence hypothesis testing

A few weeks ago, Daniel Lakens posted an excellent blog about equivalence hypothesis testing (http://daniellakens.blogspot.com/2016/05/absence-of-evidence-is-not-evidence-of.html). Equivalence hypothesis testing is a method to use frequentist statistical analyses (specifically, p-values) to provide support for a null hypothesis. Briefly, Lakens describes a form of equivalence hypothesis testing wherein the “null hypothesis” is a range of effects that are considered to be smaller than the smallest effect of interest. To provide evidence for a null effect a researcher performs 2 one-sided tests: One to determine if your effect is smaller than the upper boundary of the equivalence range and one to determine if your effect is larger than the lower boundary of the equivalence range. If your effect is both significantly less than the upper boundary and significantly greater than the lower boundary of the equivalence range, one can classify the effect as too small to be of interest. Of course, because these tests all employ the use of p-values, they are subject to a known long-run error rate (the familiar Type 1 and Type 2 errors). (If this brief description was too brief I would take the time to read Lakens’ original blog post).

Although I could follow the logic of this equivalence testing procedure, I didn’t have an “intuitive feel” for what it means to use p-values to generate support for a null effect. This is probably due to years of learning the traditional NHST approach to hypothesis testing wherein you can only “reject” or “fail to reject” the null hypothesis.  Here is the process I went through to further my understanding of equivalence testing.

First, to get a feel for how p-values can be used to generate support for a null effect it is useful to get a feel for how p-values behave when there is an effect. Let’s take a simple 2-group design where there is an effect of d = 0.3. Below is the distribution of p-values from 10,000 simulated studies with a population effect of d = 0.3 and with 50 individuals per group. You can see that 31.8% of these p-values are below .05. This figure is merely a visual representation of the statistical power of this design: Given a certain effect (e.g., d = 0.3), a certain sample size (e.g., N = 100), and an alpha level (e.g., .05), you will observe a p-value less than alpha at a known long-run rate. In this case, statistical power is 0.32.

Let’s stick with the example where there is an effect of d = 0.3.  Now suppose the sample size is doubled. In 10,000 simulated studies, increasing the sample size from 50 per group to 100 per group results in more low p-values. As can be seen below, 55.5% of the p-values are below .05. In other words, increasing the sample size increases statistical power. When there is a to-be-detected effect, increasing the sample size increases your chances of correctly detecting that effect by obtaining a p-value below .05. In this case, statistical power is 0.55.

In comparison, let’s look at a scenario where there is no effect in the population, d = 0 (which is the scenario that is most relevant to equivalence testing). With no effect in the population you can only make Type 1 errors.  In 10,000 simulated studies where there is a population effect of zero (i.e., d = 0) and a total N of 100, 5.17% of the p-values were below .05. (If you ran this simulation again you might observe slightly more or slightly fewer p-values below .05. In the long run 5% of the studies will result in p-values below .05).  The 5% of studies with low p-values are all Type 1 errors because there is no effect in the population.

Sticking with the scenario where we have a population effect of zero (i.e., d = 0) and we double the sample size from a total N of 100 to a total N of 200. The distribution of p-values from 10,000 simulated studies shows that 4.87% of the p-values were less than .05 (again, in the long run this will be exactly 5%). When there is no effect in the population the distribution of p-values does not change when the sample size changes.

To recap: When there is a to-be-detected effect you can only make Type 2 errors.  With all else being equal, increasing the sample size increases statistical power which, by definition, decreases the likelihood of making a Type 2 error.  When there is no effect you can only make Type 1 errors.  With all else being equal, increasing the sample size does not affect the distribution of observed p-values. In other words, when a null effect is true, your statistical power will simply be your alpha level regardless of sample size.

What does this have to do with equivalence testing?  A lot actually.  If you followed the information above, then understanding equivalence testing is just re-arranging and re-framing this already-familiar information.

For the following simulations we are assuming there is no effect in the population (i.e., d = 0). We also are assuming that you determined that an absolute effect less than d = 0.4 is either too small for you to consider meaningful or it is too resource expensive for you to study.  (This effect was chosen only for illustrative purposes, you can use whatever effect you want.)

To provide support for a null effect it is insufficient to merely fail to reject the null hypothesis (i.e., observe a p-value greater than your alpha level) because a non-significant effect can either indicate a null effect or a weakly powered test of a true effect that results in a Type 2 error.  And, as shown above, increasing your sample size does not increase your chances of detecting a true null effect with traditional NHST. However, increasing your sample size can increase the statistical power to detect a null effect with equivalence testing.

Let’s run some simulations. We have already seen that if a null effect is true (d = 0) and your total sample size is 100 that traditional NHST will result in 5% of p-values less than .05 in the long run.  I now took these 10,000 simulated studies and I tested whether the effects were significantly smaller than d = 0.4 and whether the effects were significantly larger than d = -0.4.  As can be seen below, in these 10,000 simulations, when d = 0 and N = 100, 63.9% of the samples resulted in an effect that was significantly smaller than d = 0.4 and 63.1% of samples resulted in an effect that was significantly larger than d = -0.4 (these percentages are not identical because of randomness in the simulation procedure; in the long run they will be equal).

Some of the samples with effects that are significantly smaller than d = 0.4 actually have effects that are much smaller than d = 0.4.  These samples have effects that are significantly smaller than d = 0.4 but are not significantly larger than d = -0.4.  Likewise, some of these samples with effects that are significantly larger than = -0.4 have effects that are much larger than = -0.4.  These samples have effects that are significantly larger than = -0.4 but are not significantly smaller than = 0.4.

In equivalence testing, to classify an effect as “null” requires the effect to be both significantly less than the upper bound of the equivalence range and significantly higher than the lower bound of the equivalence range.  In these 10,000 samples, 27.04% of the samples would be considered “null” (i.e., d = -0.4 < observed effect < d = 0.4).

Now comes the real the utility of equivalence testing.  If we double the total sample size from N = 100 to N = 200 we can increase the statistical power of claiming evidence for the null hypothesis.  As can be seen below, within the 10,000 simulated studies where d = 0 and N = 200, 88.3% of the studies had effects that were significantly smaller than d = 0.4 and 88% had effects that were significantly greater than d = -0.4.  (again, differences in these percentages are due to randomness in the data generation process and are not meaningful), and 76.4% of these studies had effects that were both smaller than d = 0.4 and greater than d = -0.4.  Thus, increasing the total sample size from 100 to 200 increased the percentage of studies that would be classified as “null” (i.e., = -0.4 < observed effect < = 0.4) from 27.04% to 76.4%.

Here are the major take-home messages.  First, equivalence testing is nice because it allows you to provide evidence for a null effect by using the tools that most researchers are already familiar with (i.e., p-values).  Second, unlike traditional NHST, increasing N can increase the statistical power of detecting a null effect (defined by the equivalence range) when using equivalence testing.

These simulations are how I went about building my understanding of equivalence testing.  I hope this helps others build their understanding too.  The R-code for this post can be accessed here (https://osf.io/ey5wq/).  Feel free to use this code for whatever purposes you want and please point out any errors you find.

Measuring aggression in the lab is hard

Psychologists often study aggression in lab-based settings.  However, some people are unconvinced that commonly-used lab-based aggression paradigms actually demonstrate aggression, which, they claim, limits the evidentiary value of results from studies that use those paradigms.  Rather than dig in heels, I have tried to think of ways that researchers can frame their criticisms to make these discussions more productive.
I believe that most criticisms of lab-based aggression paradigms take on one of two flavors: The behavior was not believed to have been harmful or the behavior was not believed to have been caused by a cognitive process involving aggressive cognitions. 
The definition of aggression identifies the sources of criticisms
Aggression is a behavior that is done with the intent to harm another individual who is believed to want to avoid receiving the behavior.  Thus, to demonstrate aggression in the lab requires two factors: (a) a harmful behavior and (b) that behavior must be believed to have been caused by a cognitive process that involved an intent to harm and a belief the recipient wanted to avoid experiencing the behavior (i.e., collectively referred to as “aggressive cognitions” herein).  If both factors are present, aggression has occurred; if both factors are not present, aggression has not occurred.  Conceptually simple, yet hard to execute.
Demonstrating harmful behaviors in the lab
Neither the IRB nor most researchers will allow participants to actually harm another person just for the sake of testing a hypothesis.  So, you cannot even demonstrate an unambiguously “harmful” behavior in the lab.  This is a big deal.  It is like researchers who are interested in the phenomenon of “eating ice cream” and the IRB won’t allow participants in your lab to actually eat ice cream.  For this reason, aggression researchers must use “ethically palatable” behaviors that minimally meet the criterion of being harmful, but really don’t involve people harming one another. 
Some examples of previously-used lab-based behaviors include sending irritating sound blasts to another person (who typically do not exist), selecting how much hot sauce will ostensibly be served to a person who dislikes spicy foods, sticking pins into a Voodoo Doll of another person to “inflict harm,” choosing how long another person will hold an uncomfortable Yoga pose, etc.  It is not that aggression researchers think these are super harmful behaviors; but these are reasonable tasks that can be considered a little harmful, are quantifiable, can be done in a lab environment, don’t put anybody in harm’s way, etc.  In other words, these tasks are pragmatic, not ideal. 
Some people legitimately doubt whether these behaviors meet the “harmfulness” criterion (e.g., is a sound blast really “harmful”?).  And, I would suspect that most aggression researchers would readily concede that these behaviors are artificial, contrived, and open to debate on whether they are “harmful”.  If the opinion is that these behaviors are not “harmful,” then, by definition, these behaviors cannot be considered aggressive.  I sincerely hear and understand these criticisms.  Nevertheless, researchers obviously cannot allow participants to actually harm another person within a lab environment.  
Inferring the presence of aggressive cognitions in the lab
It is insufficient merely to demonstrate that a harmful behavior has occurred; the cognitive process that causes those behaviors must involve, in some (usually undefined) capacity, aggressive cognitions.  If aggressive cognitions were not involved, then the resulting behavior is not aggression, regardless of how harmful the behavior was.
Aggression researchers attempt to create a context from which aggressive cognitions can be inferred.  For example, researchers may tell participants that a specific behavior (e.g., pressing a button) will cause a specific event (e.g., send an unpleasant noise) that has a specific effect (e.g., another person will experience the unpleasant noise).  Thus, observing the behavior allows the researcher to infer the behavior was done with a known intent and with a known consequence.  If the behavior was harmful and aggressive cognitions were assumed to be involved in the cognitive process that caused those behaviors, then the resulting behavior can be assumed to be aggressive.
Some critics point out that several cognitive processes also can produce the same behavior; thus, there is no reason to favor a cognitive process involving aggressive cognitions over these other cognitive processes. For example, participants may perceive a particular task as competitive (rather than as an opportunity to aggress), participants may engage in “mischievous responding,” or participants may intuit the study’s hypotheses and behave according to what they believe the hypotheses are. 
The argument goes like this.  A cognitive process with “aggressive cognitions” may cause a harmful behavior (if a, then b), but observing a harmful behavior does not necessarily imply the behavior was caused by a cognitive process involving “aggressive cognitions” (b, therefore a) because there are several cognitive processes (e.g., competition, mischievous responding, socially-desirable responding, etc.) that also can cause the same harmful behaviors (if x, then b or if y, then b).  Believing that the presence of a harmful behavior necessarily implies the presence of a cognitive process involving aggressive cognitions is a logical error known as affirming the consequent. 
Perhaps an unappreciated idea is that these criticisms cut both ways.  Just because it is possible that a “non-aggressive” cognitive process can cause a harmful behavior does not mean that it did.  For example, just because it is possible that some participants in some instances may think sending sound blasts to another person is competitive (and not aggressive) does not mean that any specific instance of this behavior does not meet the criteria for aggression.  It is possible those sound blasts in this instance were being sent with the intent to aggress against the recipient and, thus, the behavior would meet the criteria for aggression.  Further, if the same context (e.g., experiencing an insult) both causes a harmful behavior in the lab (e.g., sound blasts) and harmful behavior out of the lab (e.g., punching another person), this may cause one to slightly favor the cognitive process involving aggressive cognitions when observing the behavior in the lab.  Ultimately, researchers need to use their judgment on whether it is plausible to infer that a cognitive process involved aggressive cognitions.  And reasonable people will disagree on what is plausible.
Another common approach to inferring the presence of aggressive cognitions is to ask participants why they exhibited a behavior.  For example, you could ask participants to report whether they sent loud sound blasts to be “aggressive” or not.  If they say “yes,” then the resulting harmful behavior may be considered aggressive. 
As straightforward as this approach appears, it has its own limitations.  First, this approach assumes that participants have introspective access to their cognitive processes (which is not a requirement for the resulting behavior to be considered aggressive).  Second, the abovementioned criticisms of the processes causing harmful behaviors also apply to the processes causing participants’ self-reported motives.  For example, participants may report having done a behavior to be aggressive merely because they are being “mischievous,” or participants may intuit the study hypotheses and “play along” with what they believe the hypotheses are.   Simply put, there are many cognitive processes that can become expressed in a response of “I did that behavior to be aggressive”.
As with the criticisms of the behaviors typically observed in lab-based aggression paradigms, I sincerely hear and understand the critiques about whether aggressive cognitions are involved in the process causing those behaviors.  There is no avoiding the fact that inferring characteristics of cognitive processes is hard to do and that different researchers have different ideas of what would convince them to infer the presence of aggressive cognitions. 
Framing and addressing the critiques
It is hard to demonstrate aggression in a laboratory setting in a way that will result in wide-spread agreement.  But hard does not mean impossible.  And disagreements need not be permanent.  Here are things that I believe will facilitate discussions about the value of these paradigms. 
1.    For those offering critiques, be specific about the target of criticism.  Do you not believe the behavior was harmful?  Or do you believe there was an alternative cognitive explanation for the observed harmful behavior?  Or both?  Clarity in the critique offers clarity in the ways in which researchers can improve their methods.  Those who are unconvinced by current methods should state what methods or evidence would be convincingInconvincibility is a conversation stopper.  

2.    For researchers, to demonstrate aggression you need to both (a) demonstrate a harmful behavior and (b) this behavior must be assumed to have been caused by a cognitive process with “aggressive” cognitions.  Thus, you need to both argue why you believe the observed behavior is harmful and you need to argue why you believe the cognitive process involved aggressive cognitions.  Without both of these things, you cannot claim you have measured aggression.  Keep in mind that people might will argue the behavior was not harmful, people might will argue there was an alternative cognitive process that caused the behavior, or both.  Such critiques are OK, it’s called science.  Take these criticisms seriously and use them as motivation to improve your methods. 

Research Minutiae

I have a confession: While I enjoy learning other people’s ideas about major issues in the field of psychology, I also am interested in the mundane details about psychological research. You know, the minutiae that most people overlook or take for granted when they do psychological research. In this way, I am sort of like a less cool, less funny Jerry Seinfeld (if he happened to be into research). The nice part is that when I learn a tip or a trick that may be useful I often can incorporate these ideas immediately into my work. The research process is slow, so it is not often that we get an immediate payoff.

With that said, I believe that every researcher learns and collects little tips and tricks along the way, but these are not often the focus of conversations. However, I believe these small details are important, helpful, and interesting. So, I put together a list of 5 mundane details that I care about. I call these my research minutiae. These are from my personal experience, so take them with a grain of salt (and a cup of coffee, research should always involve coffee). But, perhaps some of these will be helpful to you.

1) I cannot tell you how often I review an article and the authors report an interaction with the letter ‘x’ and not the multiplication symbol ‘×’. Please keep in mind that x !=×.
Incorrect: We observed a Condition x Gender interaction.
Correct: We observed a Condition × Gender interaction.
I don’t believe this detail is picky. First, one is correct and one is incorrect. Further, keep in mind that our audience is other humans who we we must convince that our research is meaningful. These small details help signal to the reader that you (the author) attend to details and that you have taken your time with preparing your manuscript. Conversely, if your results are filled with errors, the reader may wonder how careful you were with other aspects of the research design or statistical analyses.

2) When you develop a line of research you often end up measuring the same things in different studies. Label your variables consistently. This allows you to more easily re-use syntax/script that you have already created. This also makes it easy to find common variables across your different datasets. This has the added benefit of nudging you to start thinking of your research less as a collection of individual studies and more of a body of research. In addition to using the same variable names when referring to the same thing, here is a list of common ways to label variables.

Labels with spaces (example: feedback condition): This may work fine for labeling columns in excel, but most stats programs won’t be able to read variable names with spaces. Don’t do it.
ant.format (example: feedback.condition): This is recommended by the google style guide for R. It is pretty readable and, from what I can tell, is fairly common.
gooseneck_format (example: feedback_condition): Also very common. In my opinion, this is not visually appealing. For example, mean_of_trait_aggression looks ugly. But that is just my opinion.
camelBack (example: feedbackCondition): This is what I tend to use. It saves a keystroke and a space, which is nice for making concise script. However, if your variable names are long, this format can be tough to read. Also, some people capitalize the first letter of the first word and some don’t. This becomes important when using programs like R, which are case sensitive.

First, be consistent within your personal datasets. Second, if you run a lab or have close collaborators, encourage everybody to use the same variable names and naming style. It is a small thing that helps with communication.

3) Come up with a good set of demographic questions and try to use it in all of your research. I always collect demographic information, but I used to come up with a set of questions for each of my studies. Recently I had a reason to examine data from across my different studies (I was comparing demographics of my in-person participants and my online participants). I quickly found my inconsistencies to be frustrating. For example, I sometimes gave participants the option to self-identify as Asian-American or as of Middle Eastern descent and other times there was not the option of to self-identify as Middle Eastern. Sometimes I forced people to identify as male or female, other times I gave an NA/other option. I also used different response formats for education level, relationship status, and employment status. This made my cross-study comparisons less than optimal and took much longer than it needed to. Just to be clear, I am not only talking about collecting the same demographic information, but I also am talking about using the exact same wording and the exact same response formats. This makes apples-to-apples comparisons across all of your studies a cinch.

If you run a lab, take the time to come up with a good set of demographic items (this is a good task for an undergraduate to learn about question wording and different response formats). At this point, try to be inclusive, answering these questions only adds seconds to a participants’ time. Get the information from them while you can. Try to use those questions in all of your studies. Not only will it allow you to re-use script/syntax (see previous point), but it will facilitate your cross-study comparisons. Also keep in mind that you are not beholden to these questions. If there are problems with some questions, or you need to add/drop questions for a specific study, then do what you need to do.

There are many reasons that the entire field will never adopt a common protocol for the collection of demographic information, but it seems feasible for a small group of researchers in a specific area to do so. I am not aware of that happening, but one could imagine. In my mind, I imagine that would be awesome. Future meta-analysts would thank you.

4) Save your raw data in a separate file and try to never open it again. In fact, if you ever looked at the folders where I keep my files, you would find one datafile that is labeled with a “RAW” and then there is the datafile that I actually use for analyses. For example, when I get survey data entered into an .csv file, I save that file as Study 1 RAW.csv. Then I save the file again as Study 1.csv. I only do my analyses on the Study 1.csv file and NEVER on the Study 1 RAW.csv file. When computing variables, reverse coding variables, etc. it is too easy to make mistakes. It is always nice to be able to go back to the RAW datafile and start from scratch if you need to.

5) Annotate your script/syntax. People will tell you that this is important because in the age of open science that other people need to be able to open your files and understand what you did. Another researcher should be able to exactly reproduce your analyses merely by looking at your files.

That is all good and well. I love open science and I think those are valid points. I will make a more pragmatic argument. Annotate your script/syntax so that you will know what you did 3 months or 3 years from now. It may seem like a time waster. After all, while you are doing your analyses you know what your variable names mean and you know why you are doing your analyses. It is hard to imagine not knowing what your variables mean or why you did a specific analysis. But trust me, your memory for these things decays quickly. If I look back at my old files, my memory for what I was thinking is surprisingly bad. I am thankful that I am a good note taker. Make detailed notes while you are doing your analyses and why you are doing your analyses. Your future self will thank you.

So, these are a few of the research minutiae that I have picked up over the few years that I have been doing research. My hope is that one of these things may help at least one other person. Most of these are things that I do out of awareness now. So, if I notice myself doing any other things, I may add to this list in future posts. I also enjoy hearing about other people’s research minutiae. If you have any that you would like to share, please leave a comment or contact me.