Which behavior is more aggressive?

First, some mental stretching

If you compared two aggressive behaviors, would you be able to tell which one was more aggressive?

Let’s test your intuition. Consider these two scenarios.

(a) Person B insults Person A. Person A punches Person B in the face.

(b) Person 2 insults Person 1. Person 1 punches Person 2 in the face.

Which behavior was more aggressive, Person A’s behavior or Person 1’s behavior? Most people would say these are equivalently aggressive behaviors based on the information that is provided. Let’s try a different scenario; same game, different behaviors.

(a) As part of an experiment that is ostensibly about how tactile experiences affect cognitive performance, Person A assigns Person B to hold their hand in ice water for 45 seconds

(b) As part of an experiment that is ostensibly about how tactile experiences affect cognitive performance, Person 1 assigns Person 2 to hold their hand in ice water for 45 seconds.

Which behavior was more aggressive, Person A’s behavior or Person 1’s behavior? Again, your intuition would probably say these are equivalently aggressive behaviors based off the information that is provided. I mean, 45 seconds is the same as 45 seconds, right? Let’s try a final example.

(a) As part of an experiment that is ostensibly about how tactile experiences affect cognitive performance, Person A assigns Person B to hold their hand in ice water for 45 seconds.

(b) As part of an experiment that is ostensibly about how tactile experiences affect cognitive performance, Person 1 assigns Person 2 to hold their hand in ice water for 30 seconds.

Which behavior was more aggressive, Person A’s behavior or Person 1’s behavior? Your intuition would probably say that Person A’s behavior is more aggressive than Person 1’s. After all, 45 seconds is greater than 30 seconds.

OK, now that your intuition is warmed up, let’s poke and prod these ideas a little bit.

Aggression is the combination of several necessary things

What is aggression? There must be several factors present for aggression to occur. Aggression is commonly defined as a “behavior that is done with the intent to harm another individual who wants to avoid receiving the harm” (Baron & Richardson, 1994). Thus, for aggression to occur, there needs to be a behavior that, if successfully executed, would cause harm. This behavior also must have been done with intent (e.g., it is not an accident) and with the belief the recipient wanted to avoid the behavior.

Each of these features are necessary for aggression to occur; if one of the features is not present, then there is no aggression. For example, there must be a behavior (i.e., it is insufficient to merely desire to harm another individual). Further, a behavior can cause harm and not be aggressive if it is unintentional (e.g., accidentally dropping a hammer on somebody’s foot). And even an intentional behavior can cause harm and not be aggressive if it is believed the recipient does not want to avoid the behavior (e.g., two adults who engage in BDSM can intentionally cause tissue damage to one another as part of consensual sexual activities).

The amount of aggression is not defined based on the extremity of the consequences of the behavior

So what specifically does it mean when determining which of two aggressive behaviors was more aggressive? Does more aggression correspond to a more harmful behavior? Does it merely correspond to more intention to cause harm regardless of the actual harm? Both? As far as I can tell, this is an unanswered question (at least in the social psychology literature on aggression that I am familiar with).

Go back to the scenarios in the warm-up exercise. Suppose you intuited that having both individuals hold their hand in ice water for 45 seconds was equally aggressive and you intuited that having an individual hold their hand in ice water for 45 seconds is more aggressive than having an individual hold their hand in ice water for 30 seconds. If this indeed was your intuition, then it seems like your intuition is that more actual harm (defined here as how long one has to hold their hand in ice water) corresponds to more aggression.

However, it is not difficult to see how this “more actual harm = more aggression” correspondence can break down. Imagine a bar. You know, a place where everybody knows your name. Suppose Norm is really pissed at Sam. Norm throws an empty beer stein at Sam and strikes him in the head. Sam’s head hurts, but is otherwise OK. Now suppose that Woody is only slightly peeved at Cliff. Woody gives Cliff a firm and assertive push. Cliff stumbles backwards, accidentally falls, strikes his head against the bar, and ends up dying before the ambulance arrives. Woody, who only wanted to give a little push, is horrified that he killed Cliff.

Clearly dying is more harm than a headache, thus, Cliff has clearly been more severely harmed than Sam. However, is it really the case that an assertive push is more aggressive than throwing a heavy glass beer stein at an individual’s head? Probably not. Or at least that’s not what your intuition might say. It’s just that in this scenario the less intuitively aggressive behavior (i.e., the push) resulted in a more harmful consequence than the more intuitively aggressive behavior (i.e., throwing the beer stein).

The previous two paragraphs highlight an important concept to grasp. When comparing two aggressive behaviors, it is not the actual amount of harm, but it is the intended amount of harm, that determines which behavior was more aggressive. This would mean that Norm throwing the beer stein at Sam is more aggressive than Woody pushing Cliff even though Woody’s behavior resulted in more actual harm. Why? Because Norm intended to harm Sam more than Woody intended to harm Cliff.

So Woody’s behavior caused more harm than Norm’s even though Norm’s behavior was more aggressive than Woody’s. But Woody’s behavior only caused more harm because of all of the unintended stuff that happened after the behavior (i.e., Cliff stumbled and fatally hit his head on the bar). We cannot judge which of two behaviors is more aggressive based off the unforeseen and unintended things that happen after an intended behavior is executed. For example, what if Sam takes an aspirin for his headache, but the aspirin is actually a poison that slowly and painfully kills Sam over the course of several miserable days. Would Norm’s behavior now become equally as aggressive as Woody’s because Sam and Cliff both ended up dying? After all, Norm’s behavior caused Sam to take the aspirin/poison.

What does this mean for determining which behavior is more aggressive? 

Because we cannot evaluate the aggressiveness of a behavior based on the actual consequences, the best way to determine which of two behaviors is more aggressive is to compare the level of harm that was intended by the aggressor. There will be a series of events that unfolds once these intentions are manifested as actual behaviors, but only the consequences that are foreseen and intended by the aggressor can be used to evaluate the aggressiveness of the behaviors.

So Norm’s behavior is more aggressive than Woody’s because Norm intended to cause more harm than Woody. Full stop. For the purposes of determining who was more aggressive, it does not matter what unforeseen and unintended consequences follow. These are not relevant for evaluating the aggressiveness of the behavior.

As a mental exercise, you could evaluate Norm and Woody’s behaviors at the conclusion of their intended behaviors to determine who behaved more aggressively. For Norm, you could look at the amount of harm that was caused at the moment the intentionally-thrown beer stein hit Sam’s head. For Woody, you could look at the amount of harm that occurred at the point of the intentional push. If that is the behaviors these two intended to do, then this is what their aggressiveness ought be evaluated on. Everything that happens after their intended behavior is executed (e.g., Cliff falls and hits his head; Sam ingests poison) does not matter for evaluating who was more aggressive.

Now let’s return to the final scenario of the warm-up exercise. Here it is so you don’t have to scroll back up.

(a) As part of an experiment that is ostensibly about how tactile experiences affect cognitive performance, Person A assigns Person B to hold their hand in ice water for 45 seconds.

(b) As part of an experiment that is ostensibly about how tactile experiences affect cognitive performance, Person 1 assigns Person 2 to hold their hand in ice water for 30 seconds.

I believe most people would intuit that Person A was behaving more aggressively than Person 1 because 45 seconds is longer than 30 seconds. This would mean that we believe that Person A intended to cause more harm than Person 1. But aggression researchers must be careful not to “affirm the consequent“.  That would go something like this.

If Person A intends to cause more harm, then Person A will assign a longer time;

Person A assigns a longer time;

Therefore, Person A intended to cause more harm.

We only can assume that Person A intended to cause more harm than Person 1 if we also assume that their intentions map on directly to their behavior (e.g., there is a linear relationship between their intention and the amount of time selected in the cold water task). Further, when comparing the amount of aggression between individuals, it is necessary to assume that individuals’ intentions are similarly manifested into behaviors. However, it’s possible, for example, that Person A has a high tolerance for pain and assumes that holding your hand in ice water for 45 seconds is not that bad. It’s also possible that Person 1 has a low tolerance for pain and thinks that holding your hand in ice water for 30 seconds would be excruciating. In this case, Person 1’s 30 seconds would be intended to be more harmful than Person A’s 45 seconds.

What should aggression researchers do?

Option 1: Describe the results in a way that does not refer to participants’ intentions. For example, you could merely say “Person A assigned their recipient to hold their hand in ice water for a longer amount of time than Person 1”. Do not say “Person A was more aggressive than Person 1” because that is making a statement about their intentions.

Option 2: If you want to make claims about aggression, then you must explicitly state your assumptions. For example, you could say “For the ice water task, we are assuming that the amount of time assigned will directly correspond to participants’ intended harm to the recipient.” If you accept this assumption, then you could conclude that the behaviors that cause more actual harm are more aggressive. However, people are not required to accept your assumptions.

Option 3: You can measure aggression within-participants. It may not hold that two separate individuals’ responses directly correspond to their intentions. That is, what one person intends when they, for example, assign another person to hold their hand in ice water for 30 seconds may not be the same thing as another person intends when they also choose 30 seconds. In other words, their intentions can differ even if their behaviors are identical. However, imagine comparing two behaviors by the same individual. It seems way more plausible that the more harmful behavior for that individual is more aggressive than the less harmful behavior for that individual.

Is this effect smaller than the SESOI? Evaluating the hostile priming effect in the Srull & Wyer (1979) RRR

I was recently involved with a Registered Replication Report (RRR) of Srull & Wyer (1979). In this RRR, several independent labs collected data to test the “hostile priming effect”: An effect where exposing participants to stimuli related to the construct of hostility causes participants to subsequently judge ambiguous information as being more hostile. The results from each lab were combined into a random-effects meta-analysis. The result is a high-quality estimate of the hostile priming effect using a particular operationalization of the procedure.  

In both the original study and the RRR, participants completed a sentence descrambling task where they saw groups of 4 words (break, hand, his, nose) and had to identify 3 words that created a grammatically-correct phrase (break his nose). Participants in the RRR were randomly assigned to one of two conditions: A condition where 20% of the descrambled sentences described “hostile” behaviors or a condition where 80% of the descrambled sentences described “hostile” behaviors. All participants then (a) read a vignette about a man named Ronald who acted in an ambiguously-hostile manner, (b) reported their judgments of Ronald’s hostility, and (c) reported their judgments of the hostility of a series of unrelated behaviors.

Thus, the RRR had one between-participants condition (i.e., the 20% hostile sentences vs. the 80% hostile sentences) and two outcome variables (i.e., hostility ratings of Ronald and hostility ratings of the behaviors). We expected to observe more hostile ratings from those who were in the 80% hostile condition than those who were in the 20% hostile condition.

The full report of the RRR can be found here.

I want to discuss the result of the meta-analysis for ratings of Ronald’s hostility. On a 0-10 scale, we observed an overall difference of 0.08 points, a pretty small effect. However, because the RRR had so much statistical power, the 95% confidence interval for this effect was 0.004 to 0.16, which excludes zero and was in the predicted direction. This works out to a standardized mean difference of d = 0.06, 95%, CI[0.01, 0.12]. What should we make of this effect? Does this result corroborate the “hostile priming effect”? Or is this effect too small to be meaningful?

Here are my thoughts on this effect and my efforts to determine whether it is meaningful. However, just because I was an author on this manuscript should not bestow my opinions with any special privilege. I completely expect people to disagree with me.

First, the argument in favor of the detection of the hostile priming effect

Some people will point to a meta-analytic effect of d = 0.06, 95% CI[0.01, 0.12] and argue this ought to be interpreted as a successful demonstration of the hostile priming effect. The logic of this argument is simple: Because participants were randomly assigned to groups, a nil effect (i.e., an effect of 0.00) is a theoretically meaningful point of comparison. And because the 95% confidence interval does not include zero, one could claim the observed effect is “significantly” different from a nil effect. In other words, the observed effect is significantly greater than zero in the predicted direction.

To some, the magnitude of the effect does not matter. It only matters that an effect was detected and was in the predicted direction.

Arguments against the detection of the hostile priming effect

Without arguing about the magnitude of the effect, one can make at least two arguments against the idea that we detected the hostile priming effect. Essentially, these arguments are based on the idea that you can make different decisions about how you construct the confidence intervals, which would affect whether they include zero or not.

First, one could point out that there were two outcome variables and two meta-analyses. If you want to maintain an overall Type 1 error rate of 5%, one ought to adjust for the fact that we conducted two hypothesis tests. In this case, each adjusted confidence interval would be wider than the unadjusted 95% confidence interval. This would make the adjusted 95% confidence interval for the ratings of Ronald’s hostility contain zero, which, by the same logic as described in the previous section, would be interpreted as an effect that is “not significantly” different than zero.  

Second, you could argue that a 95% confidence interval is too lenient. Because of the resources that were invested in this study, perhaps we ought to adopt a more stringent criterion for detecting an effect such as a 99% confidence interval. Adopting, for example, a 99% confidence interval would make the interval wider and would then include zero.  

It is important to keep in mind that decisions on how to construct confidence intervals should be made a priori. In the RRR, we planned to construct 95% confident intervals separately for each of the outcome variables. Sticking to our a priori data analysis plan, the 95% confidence interval for the ratings of Ronald’s hostility excludes zero. For this reason, I don’t believe these arguments are very persuasive.

Is the observed effect too small to be meaningful?

Let’s assume that we accept that a hostile priming effect was detected. So what? A separate way to evaluate the effect for Ronald’s hostility is to ask: Is the detected effect meaningful? To answer this question we need to establish what we mean by “meaningful”. In other words, we need to establish what is the Smallest Effect Size of Interest (SESOI).

Once a SESOI is established, one can create a range of effects that would be considered smaller than what is “of interest.” Then we can test whether our observed effect is smaller than what would be “of interest” by conducting two one-sided significance tests against the upper- and lower-bounds of the SESOI region using the TOSTER package (see Lakens, Scheel, & Isager, 2018). If the observed result is significantly smaller than the upper-bound of this range and is significantly larger than the lower-bound of this range, then one can conclude the effect is smaller than the SESOI. Equivalently, one can construct a 90% confidence interval and see whether the 90% confidence interval falls completely between the lower and upper bounds of the SESOI.

Here are 6 ways that I created the SESOI for the ratings of Ronald’s hostility. [Disclaimer: I constructed these SESOI after knowing the results. Ideally, these decisions should be made prior to knowing the results. This would be a good time to think about what SESOI you would specify before reading what comes next].

1) What is the SESOI based on theory? The first way to determine the SESOI is to look to theory for a guide. As far as I can tell, the theories that are used to predict priming effects merely make directional predictions (e.g., Participants in Group A will have a higher/lower value on the outcome variable than participants in Group B). I cannot see anything in these theories that would allow one to say, for example, effects smaller than a d of X would be inconsistent with the theory. Please let me know if anybody has a theoretically-based SESOI for priming effects.

2) What is the SESOI implied by the original study? A second way to determine the SESOI is to look at what effect was detectable in the original study. Srull and Wyer (1979) included 8 participants per cell in their study. Notably, the original study included several other factors, and seemed to be primarily interested in the interactions among these factors, and the RRR was interested in the difference of two cells. Fair enough. Nevertheless, we could infer the SESOI based on what effect would have produced a significant effect given the sample that was included in the original study.

To determine what effects would not have been significant in the original study, we can estimate what effect would correspond to 50% power. An effect smaller than this would not have been significant, an effect exactly this magnitude would have produced p = .05, and an effect larger than this effect would have been significant in the original study. With n = 8 participants/cell, a one-tailed α = .05, and 1 – β = .50, the original authors would have needed an effect of d +/- 0.86 to find a p­-value < α. The effects from the RRR is significantly greater than d = -0.86 (z = 32.58, p < .001) and is significantly less than d = +0.86 (z = -28.59, p < .001).

3) What is the SESOI based on prior research? A third way to determine the SESOI is to look at the previous literature. In 2004, DeCoster and Claypool conducted a meta-analysis on priming effects with an impression formation outcome variable (an interesting side note: the effect size computed for Srull & Wyer [1979] was d > 5 and was deemed a statistical outlier in this meta-analysis). The meta-analysis concluded there is a hostile priming effect of about a third of a standard deviation, d = 0.35, 95% CI[0.30, 0.41] (more interesting side notes: This meta-analysis did not account for publication bias and also includes several studies that were authored by Stapel and were later retracted for fraud. Due to these two factors, it seems likely that this effect size is upwardly biased). Nevertheless, we can at least point to a number to create an SESOI and know where it came from. Perugini, Gallucci, and Costantini (2014) suggest using the lower limits of a previous meta-analysis to be conservative.

The lower limits of the 95% CI for the DeCoster and Claypool meta-analysis is d = 0.30. The effects from the RRR is significantly greater than d = -0.30 (z = 12.8, p < .001) and is significantly less than d = +0.30 (z = -8.41, p < .001).

4) What is the SESOI based on my subjective opinion? A fourth way to determine the SESOI is to merely ask yourself “what is the smallest effect size that I think would be meaningful?” To me, in the context of an impression formation task using a 0-10 scale, I would put my estimate to be somewhere around one-quarter of one point on the rating scale. In other words, I would consider a mean difference of 0.25 points to be the minimally-interesting difference. Of course, people can disagree with me on this.

The standard deviation for ratings of Ronald’s hostility was 1.44 units, which means that 0.25 units is an effect of d = (0.25/1.44) 0.17. The effects from the RRR is significantly greater than d = -0.17 (z = 8.20, p < .001) and is significantly less than d = +0.17 (z = -3.81, p < .001).  

5) What is the SESOI that represents the amount of resources that others are likely to invest in their future studies? A fifth way to determine the SESOI is to ask “how large of an effect would be needed to be routinely detectable by future researchers?” The answer to this question comes from determining the resources that future researchers would be likely to invest in detecting this effect. For me, I think that researchers would be willing to invest 1,000 participants into a study to trying to detect the hostile priming effect. Effects that require more than 1,000 participants would likely be deemed too expensive to routinely study. That is based on my gut and people are free to disagree.

If researchers were willing to collect n = 500 participants/cell, then they would be able to detect an effect as small as d = 0.16 with the minimum recommended level of statistical power (using a one-tailed (because of the directional prediction) α = .05, and 1 – β = .80). The effects from the RRR is significantly greater than d = -0.16 (z = 7.85, p < .001) and is significantly smaller than d = +0.16 (z = -3.46, p < .001).  

6) What is the SESOI based on an arbitrarily small effect size? Finally, we can determine the SESOI by using an arbitrary convention like Cohen’s suggestion that a d = 0.20 represents a small effect. Or, more stringently, we could follow Maxwell, Lau, and Howard (2015)’s suggestion to consider a d +/- 0.10 to be trivially small.

The effects from the RRR is significantly greater than d = -0.10 (z = 5.73, p < .001) and is NOT significantly less than d = +0.1 (z = -1.34, p = 0.09).  

A Summary of the SESOI analyses

Let’s put it all together into one visualization. Look at the figure below. The blue diamond on the bottom represents the meta-analytic effect of d = 0.06 for the hostile priming effect. The vertical blue dashed lines represent the 90% confidence interval for the hostile priming effect. Notice that the 90% confidence interval just excludes zero.

The horizontal red lines represent the “ranges of equivalence” that I specified above. Each of the horizontal red lines are centered around zero. If the red line is wider than both vertical dashed blue lines, then we would conclude that the observed effect is smaller than the SESOI.

equivalence testing figure

Consistent with the analyses in the previous section, we can see the horizontal red lines extend past the 90% confidence intervals except for the arbitrarily small effect size of d +/- 0.10. Thus, by most standards, we would consider the observed effect to be smaller than the SESOI.

So What Do We Conclude?

For one of the two outcome variables in the RRR, we detected a hostile priming effect in the predicted direction. Further, this detected effect is not significantly smaller than an arbitrarily small effect of d = 0.10 (but then again, our study was not designed to have high power to reject such a small SESOI).

However, when we construct the SESOI in any other way, this detected effect is significantly smaller than the SESOI. It would take several thousands of participants to routinely detect a hostile priming effect of this magnitude, which makes it likely too resource expensive to make this effect part of an ongoing program of research.

But the question that we really want answered is “what does this effect mean for theory?” Unfortunately (and frustratingly), the theories that predict such priming effects are too vague to determine whether an observed effect of d = 0.06 is corroborating or not, which means that intelligent people will still disagree on how to interpret this effect.


Code for equivalence tests and figure:

# here is the code for conducting the equivalence tests for the Srull &amp; Wyer RRR
# code written by Randy McCarthy
# contact him at randyjmccarthy@gmail.com with any question

# implied by the original study

TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.86, high_eqbound_d=0.86, alpha=0.05)

# from LL of decoster and claypool (2004) 

TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.3, high_eqbound_d=0.3, alpha=0.05)

# from my subjective judgment

TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.17, high_eqbound_d=0.17, alpha=0.05)

# from amount of resources likely to be invested

TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.16, high_eqbound_d=0.16, alpha=0.05)

# from an arbitrarily small effect

TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.10, high_eqbound_d=0.10, alpha=0.05)


# plotting the equivalence tests 

equivRanges &lt;- ggplot() +
  xlim(-1.5, 1.5) +
  xlab(&quot;&quot;) +
  geom_point(aes(x = 0.06, y = 0.05),
             color = &quot;blue4&quot;,
             size = 5,
             shape = &quot;diamond&quot;) +
  scale_y_continuous(name = &quot;&quot;, limits = c(0, 1), breaks = NULL) +
  geom_vline(aes(xintercept = 0),
             color = &quot;black&quot;,
             size = 1) +
  geom_vline(aes(xintercept = c(0.01, 0.11)),
             color = &quot;blue4&quot;,
             size = 1,
             linetype = &quot;dashed&quot;) +
  geom_segment(aes(y    = c(0.9, 0.7, 0.5, 0.3, 0.1),
                   yend = c(0.9, 0.7, 0.5, 0.3, 0.1),
                   x    = c(-0.86, -0.30, -0.17, -0.16, -0.10),
                   xend = c(0.86, 0.30, 0.17, 0.16, 0.10)),
               color = &quot;red&quot;,
               size = 1.5) +
  geom_label(aes(y = c(0.95, 0.75, 0.55, 0.35, 0.15),
                 x = 0,
                 label = c(&quot;50% Power of Original Study&quot;,
                           &quot;ll of CI From Previous Meta-Analysis&quot;,
                           &quot;Randy&#039;s Subjective Opinion&quot;,
                           &quot;Economic Argument&quot;,
                           &quot;Arbitrarily Small Effect&quot;)),
             size = 4,
             nudge_x = -0.5) +
  ggtitle(&quot;Equivalence Testing&quot;) +
  xlab(&quot;Standardized Mean Difference of &#039;Hostile Priming Effect&#039;&quot;) +
<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

"Open Science" is risky

Open science practices are “risky”. Not in the sense that they are potentially dangerous, but in the sense that they make it easier for you to be wrong. You know, theoretically “risky”.

Theoretical progress is made by examining the logical implication of a theory, deducing a prediction from the theory, making observations, and then comparing the actual observations to the predicted observations. One way to infer theoretical progress is the extent to which our predicted observations get closer and closer to our actual observations.

Actual observations that are consistent with the predicted observations are considered corroborating. In this case, we tentatively and temporarily maintain the theory. Actual observations that are inconsistent with the predicted observations are considered falsifying of the theory. In this case, we should modify or abandon the theory. The new theory can then be submitted to the same process. Much like long division will iteratively hone in on the quotient, many iterations of this conjecture-and-refutation process will slowly increase the ability of the theory to make accurate predictions.

One key aspect of the predictive impressiveness of a theory is the class of observations that are considered “falsifiers”. That is, the predictive impressiveness of a theory comes from how many observations are forbidden by the theory. Predictions that have lots of potentially falsifying observations are considered “risky”.

As an intuitive example, suppose I have a theory to predict where a ball will land on a roulette wheel. I could predict the color of the pocket where the ball would land (Rouge ou Nior) or that the ball would land on an even/odd number (Pair ou Impair). In this bet, a successful prediction forbids about 50% of the possible outcomes. Other bets are riskier though. The riskiest bet on a single spin is to predict the ball will land on a single number. In this bet, a successful prediction forbids 37/38 possible outcomes. The payoffs from these different bets reflects the riskiness of the predictions. The riskier bet (e.g., predicting ball will land on black 15) pays off more than the less risky bet (e.g., predicting a ball will merely land on a black pocket). Correspondingly, a theory that correctly predicts the riskier bets is considered to be more predictively impressive than a theory that correctly predicts the less risky bets.

Our scientific theories are much the same. A theory that makes a vague prediction (e.g., Group A will have slower reaction times than Group B) will have less predictive impressiveness than a theory that makes more specific predictions (e.g., Group A will respond 350-500 ms slower to stimuli than Group B). So we can increase the predictive impressiveness of our scientific theories by having the predicted observations be more precise.

However, unlike predicting the outcomes of a roulette wheel, the predictive impressiveness of scientific theories is not exclusively evaluated on the precision of the predicted observations. Predictive impressiveness also comes from characteristics of the process. That is, our scientific theories not only predict outcomes, but also attempt to explain why those outcomes occur. Even if the predictive outcomes are the same, an observation can become riskier if we constrain possible reasons for why the outcome occurred.

Suppose a researcher has a theory that drinking coffee increases alertness. A study may randomly assign participants to be in the coffee group or the no coffee group. And the outcome variable may be how quickly participants respond to stimuli on a screen as a proxy for alertness. Even if the predicted outcome is the same (i.e., the coffee group will respond faster than the no coffee group), the prediction can be riskier by ruling out other possible reasons for an observation. That is, all else being equal, a study that demonstrates that the coffee group responds faster than the no coffee group will be more impressive if there are certain characteristics of the methods such as double-blinding participants and experimenters, giving the no coffee group decaf as a placebo, etc. The reason these methodological characteristics increase the predictive impressiveness of the theory is that they rule out other plausible explanations for the observations. For example, if the predicted result only occurred for studies that were not double-blind, then the observations are likely due to demand effects and not due to coffee, which would be damning for your original theory. In short, we add these methodological characteristics in order to constrain alternative plausible explanations for our observations, which increases the class of observations that would be considered falsifying.

The same logic applies to “open science” practices such as pre-registration and the open sharing of data and stimuli. These methodological characteristics cannot turn a bad study into a good study, but these features make it easier for others to find errors in your data, errors in your choice of statistical analyses, weaknesses in your chosen stimuli, etc. In other words, these practices make it easier for you to be wrong because you have provided would-be critics with all of the information they need to root out errors in your claims. Open science practices say to the world “Prove me wrong. And to help you, I am going to try and make it as easy as possible for you to find a mistake that I made.”

All else being equal, studies whose claims are both consistent with a theory and whose methods have been maximally exposed to daylight are stronger than studies whose claims are merely consistent with a theory.


Multi-Site Collaborations Provide Robust Tests of Theories

According to Popper (1959) “We can say of a theory, provided it is falsifiable, that it rules out, or prohibits, not merely one occurrence, but always at least one event” (p. 70). I argue that, all else being equal, multi-site collaborations more robustly test theories than studies done at a single site at a single time by a single researcher because the data from a multi-site collaboration more robustly represent the theoretically falsifying event.

Let’s break down the key concepts of this argument.

What is a multi-site collaboration?

A multi-site collaboration is a study that involves a team of researchers at several locations who each test the same hypothesis. Often these collaborations use the same data collection procedures and same stimuli. Their individual results are then pooled together, often times in a meta-analysis, regardless of the results from any of the individual labs.*

Thus, the features necessary to test the hypotheses are the same across all labs. But there are inevitably some lab-to-lab differences in the specifics of the samples, the physical setting of the lab, the precise time the data are collected, etc.

Good exemplars of multi-site collaborations are the ManyLabs projects (see here or here) or Registered Replication Reports (see here or here).

Occurrences vs. Events

The next key concept is the distinction between occurrences and events. In the first sentence I said that a scientific theory must forbid at least one event. Popper considered a specific instance of a researcher deducing a hypothesis, operationalizing the theoretically-necessary features, and making an observation to be an occurrence. Each occurrence includes the features of a study that are deduced from the theory. And each occurrence takes place in the presence of a unique and idiosyncratic combination of other factors such as the specific time and specific location of a study. An event, on the other hand, represents the class of all possible occurrences that are equally deducible from the theory (an event = occurrence1, occurrence2, occurrence3, …occurrencek).

Thus, occurrences are confounded with the idiosyncratic combination of other factors at a specific time and specific location, whereas events transcend those factors. Events represent only what can be logically deduced from a theory; occurrences also contain the infinite other factors that are inevitably present when an event is instantiated. Thus, the more robustly we can create events, the more robustly we can test our theories.

An example

Suppose I have a theory that “listening to a song with violent lyrics increases the accessibility of aggressive cognitions”. This is a legitimate scientific theory because it allows you to deduce which events are consistent with the theory and which events are inconsistent with the theory. Namely, those who listen to songs with violent lyrics should have an increase in aggressive thoughts and should not have a similar level or a decrease in aggressive thoughts.

Suppose Researcher A conducts a study. This study will include the necessary features to test a hypothesis that was deduced from a theory. For example, Researcher A may hypothesize that listening to Johnny Cash’s Folsom Prison Blues (a song with violent lyrics) would cause them to complete more word stems (e.g., KI _ _) with aggressive words (e.g., KILL) than non-aggressive words (e.g., KISS; a measure of the accessibility of aggressive cognitions). The results from this study would be an occurrence. Thus, in addition to the deduced theoretically-necessary features to test a hypothesis, this single occurrence is confounded with an idiosyncratic combination of theoretically-irrelevant factors. For example, the observations in this single study occur in the presence of participants’ interaction with the experimenter, what the 3rd participant ate for breakfast yesterday, the ambient temperature of the room, the position of the stars when the last participant completed the study, etc., etc., etc.

Now suppose Researcher B also conducts a study. This researcher also deduces the features that would be theoretically necessary to test the hypothesis. Suppose this researcher follows Researcher A’s approach and uses Johnny Cash’s Folsom Prison Blues as the song with violent lyrics and also uses the word-fragment completion task as the measure of aggressive thoughts. The results from this study also would be an occurrence. Thus, this study includes the features of the study that were deduced from a theory and occurs in the presence of an idiosyncratic combination of theoretically-irrelevant variables. Further, the idiosyncratic combination of theoretically-irrelevant variables are different for Researcher A and Researcher B. That is, the observations made by Researcher B will likely occur in the presence of different interactions with the experimenter, a different breakfast by the 3rd participant, a different ambient temperature of the room, a different position of the stars when the last participant completed the study, etc., etc., etc.

Because the combination of theoretically-irrelevant factors differ for each occurrence, the occurrence made by Researcher A will not be equivalent to the occurrence made by Researcher B in all possible ways. This non-equivalence is what people often refer to when they say “there is no such thing as an exact replication”: Two studies always differ in some aspects (such people often point to the inarguable presence of differences between occurrences and imply those occurrences do not belong to the same event class). However, and critically, each of the occurrences in this example are equally deducible from the theory. So each of these occurrences belong to the same event class, which means they are equally useful for potentially falsifying the theory.

In fact, because a single occurrence is confounded by the combination of theoretically-relevant and theoretically-irrelevant factors that are present when a single observation is made, any individual occurrence is ambiguous: Was the observation due to the theoretically-necessary variables? Or was the observation due to a freaky alignment of other factors that will never be recreated?

With a single study at a single site, we can assume that an occurrence was due to the theoretically-necessary variables and we can assume that it was not due to a freaky alignment of other factors. It is up to individuals as to whether or not they want to accept those assumptions. To empirically test whether an event is consistent or inconsistent with a theory, we need observations from several occurrences. That is, we need several observations that maintain the deduced theoretically-necessary features, but differ in the theoretically-irrelevant features that confound each individual observation, in order to disentangle the former from the latter.

Putting it all together

Let’s go back to our example. The observation made by Researcher A is an occurrence. The observation made by Researcher B is an occurrence. Because these occurrences were equally deducible from the theory, these occurrences belong to the same event. It is necessary to observe several occurrences to disentangle the effects due to the theoretically-deduced factors from the theoretically-irrelevant factors.

Multi-site collaborations involve several researchers who each make observations across a range of occurrences. That is, multi-site collaborations involve observations being made across a range of idiosyncratic combinations of theoretically-irrelevant factors. Collectively, these individual occurrences better approximate the class of events that are used to test theories than any individual occurrence. Thus, all else being equal, multi-site collaborations provide more robust tests of our theories than a single study done at a single location at a single time.

I argue that Researcher A and Researcher B should agree on what study is logically deduced from their theory, each collect data following the same agreed-upon protocol (i.e., each make an occurrence within the same event class), combine their data into a common analysis regardless of how their individual data come out, and plan to grab a beer together at the next conference.**

For this, and for many other reasons, I hope that multi-site collaborations become commonplace in psychological science.

Where to begin? 

Are you wanting to get involved in some multi-site collaborations? Here are some places to begin.

StudySwap: an online platform where you can find other collaborators.
The Psychological Science Accelerator: a network of labs who have committed to devoting some of their research resources to multi-site collaborations.
Registered Replication Reports: multi-site collaborations of replications of previously-published research.

*I believe the lack of inclusion bias in the meta-analyses from multi-site collaborations is probably the greatest methodological strength of these studies. However, this post is focusing on a different benefit of multi-site collaborations.

**This last part is a crucial feature of a successful multi-site collaboration.

The meaningfulness of lab-based aggression research

I consider myself to be an aggression researcher. And I often use lab-based research methods. So I guess I consider myself a lab-based aggression researcher. One of the things I often worry about is whether I am producing meaningful, informative, credible research that helps us understand “real-world” aggression. Or am I merely playing a game where I can follow the right methods and say the right things and I can publish the results as a study on “aggression” regardless of the relevance of that research for “real-world” aggression. Am I adding lines to my CV, but not making an inch of theoretical progress?

Adding to my constant angst about this issue, I recently read a blog post by the statistician Andrew Gelman. The post was not a direct critique of lab-based aggression research per se, but the message was clear: Aspects of many lab-based aggression studies are so ludicrous that, from the outside looking in, some real studies appear as if they are an April Fool’s Day joke. Aggression researchers use Voodoo Dolls, games that involve noise blasts, etc. in our lab-based research. Yet we claim to be studying when and for whom “real” aggression–like punching, intimate partner violence, etc.–are likely to occur. While I disagree with some of the specific criticisms raised in the blog post, I cannot disagree there is a perception that some aspects of lab-based aggression research appear foolish.

My initial reaction to this post was “you just don’t get it.” If only this author knew why we do lab-based research or how hard it is to study aggression in a lab or the arguments we make for the validity of our methods, surely the author would not be so harsh. Merely brushing off these criticisms is lazy. Blaming the audience for not understanding is lazy. The real reason this blog post hit me like a gut punch was that some of these criticisms might just be accurate. Or worse, there is a possibility that our entire enterprise is complete bullshit. What if we have been studying how undergraduates play games in our labs and communicating this to the public as aggression research? What if we have deluded ourselves with arguments that serve to mollify our doubts about our research and continue with the status quo more than actually address shortcomings with our research. That would confirm my worst fears about lab-based aggression research.

I am not saying that I believe lab-based aggression research is bunk, I am saying that many people view lab-based aggression research as bunk.

I want to be proud of my area of research. I want the public to believe that my fellow lab-based aggression researchers are good stewards of the public’s trust and that we are diligently and objectively working towards tackling difficult and socially-important topics. In short, lab-based aggression researchers need to be MORE critical of our own work than outsiders. We shouldn’t have to wait for other people to hold a mirror up to our research to force us to be critical. We shouldn’t merely expect people to accept our arguments because we are “scientists.” We need to be faster and harsher critics of our own research than outsiders are. We must always be probing our methods and theories for weaknesses. We must be more skeptical of the validity of our methods than other researchers or the general public. We must always keep active the possibility that we are collectively deluding ourselves. We need to earn our seat at the table. This doubt should motivate us and inspire us to produce better research.

In that vein, here are 3 areas that I would like lab-based aggression researchers to focus on.

#1 Prioritize replicability
Taking inspiration from the Data Colada post by Joe Simmons, I want lab-based aggression researchers to prioritize replicability. Essentially, above all, if somebody publishes a finding, I would like to know that if I replicate the methods that I have a reasonable chance to observe the same empirical effect. To me, this takes priority over the validity of our methods and even some methodological shortcomings with the particular study.

Prioritizing replicability is extra important for research with undergraduates. If we opt for convenience in our samples over representativeness, then there is no reason not to make our lab-based research replicable.

Let’s use a hypothetical example. Suppose a researcher reports a study that randomly assigns participants to play Tetris or Mortal Kombat. Then the researcher has the participants play a game where they get to send annoying sound blasts back-and-forth to each other. Suppose this hypothetical study finds that participants who played Mortal Kombat sent louder sound blasts than participants who played Tetris, and this was taken as evidence that playing violent video games cause aggression.

There are many criticisms that can be leveled at this hypothetical study: Tetris and Mortal Kombat differ in a lot of ways other than their violent content, participants may intuit the study’s hypotheses and behave in a way that confirms the hypotheses (i.e., demand effects), the outcome variable may not actually be an aggressive behavior, etc. All of these are fine criticisms, but I believe they are all moot unless we can establish the basic empirical effect. Namely, if you and I both follow the same methods, do we both get similar results? If all competent researchers can reliably produce the effect, the other problems seem tractable.

Is this a problem in the extant literature? I would argue “yes.” We have now documented there is a huge amount of data analysis flexibility in our most commonly-used measure of aggression, there is asymmetry in the funnel plots of one of our most commonly-studied hypotheses, and pre-registered research has failed to detect theoretically-consistent effects. These are essentially arguments about our observed empirical effects, which pre-empts discussions about how to properly interpret these effects.

#2 Focus on validity
Within laboratory settings, we cannot allow participants to behave more than mildly aggressive. We cannot allow non-trivial amounts of aggression to occur. And we certainly cannot allow violence to occur. If we want to make claims about “real-world” aggression that is more impactful than what is allowable in the lab, we must (a) infer that the behaviors exhibited within our lab-based aggression paradigms are valid representations of behaviors that would occur in different settings outside of the lab and (b) that the amount of aggression displayed in our lab-based settings also “scales up” to more impactful levels of aggression. That is, the mild aggression inside of the lab “scales up” to non-trivial aggression outside of the lab.

Sadly, in my opinion, the validity for many of our lab-based aggression methods are weak. Yet, validity information is extremely important for demonstrating the meaningfulness of our lab-based research. I am glad to see people like David Chester starting to do the hard-yet-necessary work of trying to validate some of our existing lab-based aggression paradigms. I don’t know how our methods will fare when we expose our existing methods to the scrutiny they deserve. It could turn out to be great, it could turn out to be crap. But I know that process will be super important for demonstrating the relevance of our research in the future.

#3 Be more modest
Until the replicability and validity of our methods are sound, we have to remain humble about our claims. I don’t mean that we merely have to make a nod to replicability in our Discussion sections and only have enough validity research to satisfy the editor and peer reviewers, I mean let’s strive to have replicable evidence that any competent researcher can produce and validity information that satisfies the most skeptical nay-sayer. I don’t know if we will ever achieve these goals, but that is what I believe we should strive for. Then, and only then, should we dip our toes into the waters of making claims about “real-world” phenomena.

I know it is interesting to study extreme aggression or intimate partner violence. I know those are the topics that I want to study. And those topics are probably more fundable than claiming to study “mildly aggressive” behaviors. But let’s be real, our current lab-based methods probably just aren’t there yet. This is my personal opinion, but I have serious reservations about whether, for example, choosing how long somebody has to hold their hand in ice water is a very good proxy for whether somebody would, for example, get into a fight. I am very willing to be proven wrong, but I must be proven wrong with replicable and valid evidence.

I know the arguments in favor of lab-based research: Within the well-controlled confines of a lab, we may have control over factors that we do not have outside of the lab, it may be easier to use some research designs (e.g., experiments) that allow for certain inferences (e.g., causal inferences), we can test our theories of aggression while placing participants at the minimal risk of harm, we can artificially create theoretically-relevant conditions that may not exist in the “real-world,” this research is an invaluable piece to the overall puzzle, etc. But knowing these arguments is not the same as really, really believing these arguments. Doing lab-based aggression research is severely limiting in a lot of ways too. Most prominently, we cannot actually allow more than trivial amounts of aggression to occur in a lab. Let’s not pretend like these very real limitations do not exist.

Currently, I see a lot of lab-based aggression researchers who want it both ways. We want to study the “real-world” aggression that is most interesting and most socially-relevant in a lab and we want sidestep all of the thorny issues that come from studying an extremely complex phenomena in an artificial laboratory environment. I don’t want to sidestep these issues though. I would like our field to tackle them head-on. I personally and selfishly want better research to justify my belief in these arguments. I want to be able to say with a straight face that the scientific community has done our due diligence on these methods before claiming we have demonstrated a real-world phenomenon. Until then, I would like us to be much more humble than we currently are.

In summary…
Lab-based aggression research is my community. And I am seriously concerned that if we do not change our ways that we will continue to be seen as foolish by some people. We will be seen as a community that continues to fill our journals with each other’s research, but ultimately not contributing credible and informative research about the phenomena that we care about. This is not the future I want. Rather than treating those as criticisms of uninformed outsiders, I want to take those criticisms to heart. If somebody believes our area of research is bullshit, I want to be able to say that we have sincerely considered that possibility. Let’s roll up our sleeves and get to work.

The benefits of "crowdsourced" research

What is crowdsourced research?  
Briefly, “crowdsourced” research involves several individual researchers who coordinate their resources to accomplish goals that would be difficult to achieve individually. Although there are several different ways in which researchers can work collaboratively, the current blog is focusing on projects where several different researchers each collect data that will be pooled together into a common analysis (e.g., the “Many Labs” projects, Ebersole et al., 2016; Klein et al., 2014; Registered Replication Reports [RRR], Cheung et al., 2016; Wagenmakers et al., 2016; “The Pipeline Project,” Schweinsberg et al., 2016).
Below I try to convince you that crowdsourcing is a useful methodological tool for psychological science and ways that you could choose to get involved. 
Eight benefits of crowdsourced research
First, crowdsourced research can help achieve greater statistical power. A major limiting factor for individual researchers is the available sample of participants for a particular study. Commonly, individual researchers do not have access to a large enough pool of participants, or enough resources (e.g., participant compensation) to gain access to a large enough pool of participants, to complete a properly powered study. Or researchers must collect data for a long period of time to obtain their target sample size. Because crowdsourced research projects involve the aggregation of results from many labs, a major benefit is that such projects have resulted in larger sample sizes and more precise effect size estimates than any of the individual labs that contribute to the project. 
Second, crowdsourced research provides information about the robustness of an effect to minor variations in context. Conclusions from any individual instantiation of an effect (e.g., an effect demonstrated in a single study within a single sample at a single point in time) are inevitably overgeneralized when summarized (e.g., Greenwald, Pratkanis, Leippe, & Baumgardner, 1986). That is, any individual study occurs within an idiosyncratic combination of an indefinite amount of contextual variables, most of which are theoretically irrelevant to the effect (e.g., time of day, slight moment-to-moment variations in the temperature of the room, the color of socks the researcher is wearing, what the seventh participant ate for breakfast the Saturday prior to their study appointment, etc.). Thus, a summary of an effect “overgeneralizes” to contexts beyond what was actually present in the study that is being summarized. And it is only when an effect is tested across several levels and combinations of these myriad contextual variables can information be strongly inferred about the theoretically invariant characteristics of the effect (e.g., the effect is observed across a range of researcher sock colors; thus, the observation of an effect is unlikely to depend on any specific color of socks).  
A benefit of crowdsourced research is the results inherently provide information about whether the effect is detectable across several slightly-different permutations and combinations of contextual variables. Consequently, crowdsourced research allows for stronger inferences to be made about the effect across a range of contexts. Notably, even if a crowdsourced research project “merely” uses samples of undergraduate students in artificial laboratory settings, the overall results of the project would still test whether the effect can be obtained across contexts that slightly vary from sample-to-sample and laboratory-to-laboratory. Although this hypothetical project may not exhaustively test the effect across a wide range of samples and conditions, the results from the overall crowdsourced research project will test the robustness of the effect more than the results from any individual sample within the project.
Third, because the goal of most crowdsourced research is the aggregate or synthesis of the results from several different labs who have agreed to combine their results a priori, another benefit is there is unlikely to be inclusion bias (which would be comparable to publication bias among published studies) within the studies that contribute to the project. Consequently, the overall results from crowdsourced research projects are unlikely to suffer from inclusion bias than any comparable synthesis of already-completed research such as a meta-analysis. Rather, crowdsourced research projects involve several studies that provide estimates that vary around a population effect and are unlikely to systematically include or exclude studies based on those studies’ results. The lack of inclusion bias is because individual contributors to a crowdsourced research project do not need to achieve a particular type of result to be included in the overall analysis. Rather, because the overall project hinges on several contributors each successfully executing methods that are comparable, individual contributors have a motivation to adhere to the agreed-upon methods as closely as possible.
Fourth, and related to the points made in the previous paragraph, because the crowdsourced research project involves the coordination of several labs, it is unlikely there would be post-hoc concealment of methodological details, switching of the planned analyses, filedrawering “failed” studies, optional stopping of data collection, etc. without several other contributors knowing about it. This distribution of knowledge likely makes the final project more transparent and documented than a comparable set of non-crowdsourced studies. In other words, it would literally take a conspiracy to alter the methods or to systematically exclude results from contributing labs of a crowdsourced research project. Consequently, because crowdsourced research projects inherently involve the distribution of knowledge across several individuals, it is reasonable for readers to assume that such projects have strongly adhered to a priori methods. 
Fifth, comparisons of the results from the contributing labs can provide information (but not all information) about how consistently each lab executed the methods. Although cross-lab consistency of results is not inherently an indicator of methodological fidelity, any individual lab that found atypical results (e.g., surprisingly strong or surprisingly weak effects), for whatever reason, would be easily noticeable when compared to the other labs in the crowdsourced research project and should be examined more closely.
Sixth, the nature of crowdsourced research projects means that the methods are already in a form that is transferable to other labs. For example, there would already be a survey that has been demonstrated to be understood by participants from several different labs, there would already be methods that are not designed to be dependent on an idiosyncratic physical feature of one lab, there would be an experiment that has been demonstrated to work on several different computers or is housed online where the study is accessible for anybody with internet access, etc. The transferability of methods does not inherently make methods more appropriate for testing a hypothesis, but it does make it easier for other researchers who were not contributors to the original crowdsourced study to replicate the methods in a future study.
Seventh, although there have been calls to minimize the barriers to publishing research (and thus reducing the negative impact of file drawers; e.g., Nosek & Bar-Anan, 2012), some have opined that research psychologists should be leary of the resulting information overload and strive to publish fewer-but-better papers instead (e.g., Nelson, Simmons, & Simonsohn, 2012). Crowdsourced research seems to address both the file drawer problem and the concern about information overload. Take an RRR as an example. Imagine if each individual contributor in the RRR conducted a close replication of a previously-published effect and tried to publish their effect independently of one another. Also, another researcher gathers these studies and publishes a meta-analysis that described the synthesis of each of those studies. I do not believe that each manuscript would necessarily be publishable on its own. And, even in the unlikely event that several manuscripts each describing one close replication were published, there would be a significant degree of overlap between each of those articles (e.g., the Introduction sections presumably would largely cover the same literature, which would be tiresome for readers). Thus, several publications each describing one close replication of an effect are inefficient for journals who would not want to tax editors and reviewers with several articles with a significant amount of overlap, for the researchers who do not need to write manuscripts that are largely redundant with one another (plus, each manuscript is less publishable as a stand-alone description of one replication attempt), and for readers who should not have to slog through several redundant publications. Crowdsourced research projects provide one highly-informative presentation of the results, readers only need to find and read one manuscript, and editors and reviewers only need to evaluate one manuscript. Also, because the one crowdsourced manuscript would include all of the authors, there is no loss in the amount of authors who get a publication. The result is fewer, but better, publications.*  
Finally, researchers at institutions with modest resources can contribute their resources to high-quality research. Thus, crowdsourced research can be more democratic than traditional research. There are hundreds of researchers who have access to resources (e.g., time, participants, etc.) that may be insufficient individually, but could be incredibly powerful collectively. Or there may be researchers who mentor students who need to accomplish a project in a rigid period of time (e.g., a semester or an academic year) who need projects where the hypotheses and materials are “ready to go.” Crowdsourced research projects ensures that scientific contributions do not only come from researchers who have enough resources to be self-sufficient. 
Three ways to get involved
First, stay up-to-date on upcoming opportunities. Check out StudySwap (https://osf.io/view/studyswap/), which is an online platform to facilitate crowdsourced research. Follow StudySwap on Twitter (@Study_Swap) and like StudySwap on facebook (https://www.facebook.com/StudySwapResearchExchange/). Also follow the RRR group (https://www.psychologicalscience.org/publications/replication) and Psi Chi’s NICE project (https://osf.io/juupx/) to hear about upcoming projects for you and your students. Crowdsourced research projects only work when there are lots of potential contributors who are aware of opportunities. 
Second, Chris Chartier and I are excited to announce an upcoming Nexus (i.e., special issue) in Collabra:Psychology on crowdsourced research. Although the official announcement will be coming in the near future, we are starting to identify individuals who may be interested in leading a project. This Nexus will involve a Registered Reports format of crowdsourced research projects we colloquially call Collections^2 (pronounced merely as “collections,” but visually denoted as a type of crowdsourced research by the capital C and the exponent). Collections^2 are projects that involve collections, or groups of researchers, who each collect data that will be pooled together into common analyses (get it? data collection done by a collection of researchers = a Collection^2) and is the same as the crowdsourced projects that were discussed above.**
Collections^2 that would qualify for inclusion in the Nexus can be used to answer all sorts of research questions. Here is a non-exhaustive list of the types of Collections^2 that are possible:
  1. Concurrent operational replication Collections^2: Several researchers simultaneously conduct operational replications of a previously-published effect or of a novel (i.e., not previously-published) effect. These projects can test one effect (such as some of the previous RRRs) or can test several effects within the data collection process (such as the ManyLabs projects). 
  2. Concurrent conceptual replication Collections^2: Projects where there is a common hypothesis that will be simultaneously tested at several different sites, but there are several different operationalizations of how the effect will be tested. The to-be-tested effect can either be previously-published or not. These projects would test the conceptual replicability of an effect and whether the effect generalizes across different operationalizations of the key variables. 
  3. Construct-themed Collections^2: Projects where researchers are interested in a common construct (e.g., trait aggression) and several researchers collect data on several outcomes associated with the target construct. This option is ideal for collections of researchers with a loosely common interest (e.g., several researchers who each have an interest in trait aggression, but who each have hypotheses that are specific to their individual research).
  4. Population-themed Collections^2: Projects where contributing researchers have a common interest the population from which participants will be sampled (e.g., vegans, athiests, left-handers, etc.). This sort of a collaboration would be ideal for researchers who study hard-to-recruit populations and want to maximize participants’ time. 
  5. And several other projects that broadly fall under the umbrella of crowdsourced research (There are lots of smart people out there, we are excited to see what people come up with).

This Nexus will be a Registered Reports format. If you are interested in leading a Collection^2 or just want to bounce an idea off of somebody, then feel free to contact Chris or me to discuss the project. At some point in the near future, there will be an official call to submit Collections^2 proposals and lead authors can submit their ideas (they do not need to have all of the contributing labs identified at the point of the proposal). We believe the Registered Reports is especially good for these Collection^2 proposals. Collections^2 include a lot of resources, so we want to avoid any foreseeable mistakes prior to the investment of these resources. And we believe that having an In-Principle Acceptance is critical for the proposing authors to effectively recruit contributing labs to join a Collection^2. 
If you are interested in being the lead author on a Collection^2 for the Collabra:Psychology Nexus you can contact Chris or me. Or keep an eye out for the official call for proposals coming soon. 
Third, if you do not want to lead a project, consider being a contributing lab to a Collection^2 for the Collabra:Psychology Nexus on crowdsourced research. Remember, these Collections^2 will have an In-Principle Acceptance, so studies that are successfully executed will be published. Being a contributor would be ideal for projects that are on a strict timeline (e.g., an honor’s thesis, first-year graduate student projects, etc.). Keep an eye out for announcements and help pass the word along. 

*There is the issue of author order where fewer authors get to be first authors. However, when there are several authors on a manuscript, the emphasis is rightly placed on the effect rather than the individual(s) who produced the effect.

**The general idea of Collections2 has been referred to “crowdsourced research projects,” as we did above, or elsewhere as “concurrent replications” (https://rolfzwaan.blogspot.com/2017/05/concurrent-replication.html). We like the term Collections2 because “crowdsourced research projects” are a more general class of research that does not necessarily require multi-site data collection efforts. We also believe the name “concurrent replications” may imply this is a method that is only used in replication attempts of previously-published effects. Also, the name “concurrent replication” may imply that all researchers use the same variable operationalizations across sites. Although concurrent replications can be several operational replications of a previously-published effect, they are not inherently operational replications of previously-published effects. Thus, we believe that Collections2 are more specific than “crowdsourced research projects” and more flexible than what may be implied by the name “concurrent replication.”  

3 useful habits for your research workflow

I chronically tinker with my research workflow. I try to find better ways to brainstorm, organize my schedule, manage my time, manage my files (e.g., datafiles, R code, manuscripts, etc.), read and synthesize research articles, etc. In some ways, I am always in a state of self-experimentation: I find an idea, make a change, and then reflect on whether that change was helpful. Some of these changes have “stuck” and become part of my research workflow. 
Recently I have been reflecting on which of my research workflow habits have proven useful and stuck with me over the (relatively) long haul. Here are my current top 3.

Habit #1: Making 1 substantive change per day on an active writing project

Researchers are writers and writing takes time. However, academic writing is a marathon, not a sprint, so academic writing takes a lot of time. It is not uncommon for some of my writing projects to be stretched out over the course of months and sometimes years. I don’t know if this makes me a slow writer, but this is the pace at which I can write good academic prose. If I was less diligent, this timeline could be stretched out even further.

One habit that keeps me on track is to have an active writing project like a manuscript or a grant and commit to making one substantive change each day until the project is completed. Just one change. Even if you only have 5 minutes on a given day, that is sufficient time to open up your writing project, start reading, and make one substantive change. This could be making a sentence more concise, finding ways to smooth a transition between two related ideas, or replacing an imprecise adjective with a more appropriate one. Typically, when I make my one change for the day I end up writing for a longer period of time. The whole point of this habit is that “one change is more than none change.”

Committing to one change per day is helpful because it keeps the project moving forward. It is a horrible feeling when you want to get a manuscript out the door and it has sat idle for 2 months. Where did the time go? Then you think about how much collective time you spent on Twitter and you wish you could have all of that time back in one big chunk. Sigh!

Habit #2: Learn to juggle

There is a saying that goes “being a good juggler is to be a good thrower.” As a researcher, I am always handling several projects that are happening in parallel. Each of these projects requires a sequence of actions. Every now and then (like once a week), you need to assess your active projects and think about the current statuses and trajectories of each of these projects. Which balls are suspended in the air? Which balls are falling and require your immediate attention? Which balls can be thrown back up into the air? Are there any balls you can get rid of?

For example, preparing an IRB application requires you to accomplish a few activities (e.g., write the application, gather the stimuli, etc.), but once the IRB application is submitted you are merely waiting for approval; there is nothing that you can actively do with the application after it is submitted. Suppose you are at the beginning stages of a project and you need to do two activities: (a) write an IRB application and (b) program a study. It may make more sense to write the IRB application first and then, while the IRB application is being reviewed, take the time to program the study rather than vice versa. While you are programming the study, the IRB application review is happening in parallel. This is an example of “throwing” the IRB application ball so you can focus on the study programming ball.

This example seems obvious, but the juggling gets more complex as you get more balls in the air. Regularly assess all of your active projects and identify your next throw. Over time you begin to identify which throws are good throws. For me, good throws are either submissions (IRB applications, manuscript submissions, grant submissions, etc.) or getting feedback to co-authors because those projects can move forward at the same time I am focusing on doing other activities. For example, if there is a manuscript that is 95% complete, I focus my energies on the last 5%. Once the manuscript is submitted I can turn my attention to other things while that ball is suspended in air (i.e., the manuscript is being peer-reviewed). The habit that I have developed is nearly-completed manuscripts and providing feedback to co-authors are priorities.

The key to making this habit work is to take the time and strategically choose your next throw. There is a big difference between the rhythm, cadence, and zen of a juggler and the chaos, stress, and frustration of whack-a-mole.

Habit #3: Clear the clutter

At the beginning of this year I wanted to make a small change to reduce the amount of emails I receive. I used to get a lot of mass emails from places such as Twitter notifications, TurboTax, the American Legion, Honeywell thermostats (seriously!), etc. I never read these emails. Never! Now, whenever I get an automated email that I know I will never read, I go to the bottom of the email and find the “unsubscribe” link in the fine print. I take the 5 seconds to unsubscribe because I know that the 5 seconds I spend now will be repaid with minutes of my future time. I probably get 50% fewer emails now. Merely unsubscribing from mass emails has given me enough free time to make my one substantial change per day (Habit #1 above).

Here’s how you can immediately incorporate these habits into your research workflow. First, assess your current projects and identify if there are any “good throws” you can make. Is there a manuscript that if you really, really focus on, you could get submitted in the next week? Is there a draft of a manuscript you could get returned to a co-author if you spent the afternoon in focused writing? Commit to executing one good throw. Second, identify a writing project that you will commit to writing on every single day. This can be your “good throw” project from the first step or something else altogether. Try to write on this project every day for a week. What do you have to lose? My prediction is that you will notice the progress and you won’t want to stop making your daily substantive change. Finally, commit to unsubscribing from mass/junk emails as they come into your inbox. Just do it. You will notice a steady decrease in the amount of clutter in your inbox (and fewer distractions) as time goes on.

Good luck and have a productive day.

Academic Craftsmanship

Let me share three short stories.

Story 1: Steve Jobs was obsessed with the design of his products. When designing the first Macintosh, Jobs was adamant about the circuit boards being neat and orderly. The circuit boards! The innards of the computer! My guess is that 99% of users never looked inside the computer, and surely several of the 1% who did look inside never noticed the care and skill that went into making the circuit board look nice. Sure, it may have looked like an orderly circuit board, and it may seem like a waste of resources because making the circuit board orderly does not inherently improve the performance of the computer. But it is this concern about excellence and quality being carried throughout all of the product, inside and out, not just the part of the product that most users see, as being essential to what made the Mac the Mac.

Story 2: My nephew loves Legos. At a recent family function, I vividly remember him sitting on the floor methodically assembling his Lego model. His focus was intense. He was in a state of flow. He couldn’t care less about whether anybody was watching him work; he was on a mission to create something awesome. He’d look at the schematic, find the next piece, and put the piece in the right spot. Snap! Repeat! After the last step, looking at what he assembled with his own two hands, he felt like Michelangelo just unveiled the David. He loves building his Legos because the more he does it, the better he gets. 
Story 3: Some graduate student is in a lab somewhere right now tinkering with ggplots on her laptop. She tries out different shapes in her scatter plot. Now different colors. Is the font too big? Too small? Should I use theme_minimal() or theme_bw()? What location of the legend makes it easiest for a reader to intuit the essential information from the figure? After hours of tinkering, honing, polishing, she creates a figure that is just right. When she presents that figure, she glances at the audience’s reaction to her masterpiece.  

What do these three stories have in common? Craftsmanship.

Today I want to give a nod to the often overlooked academic craftsmanship that I see in my colleagues’ work. You know, the little things that researchers do in the process of creating their research products that give them pride. The little things that make a merely publishable manuscript into scientific poetry, an adequate figure into a piece of art, and an ordinary lecture into the academic version of the Beatles’ Sgt. Pepper’s Lonely Heart’s Club Band

Let me first stake a flag in the ground before the rabble gets aroused. When I say academic craftsmanship, I do not mean “flair.” Even the craftiest craftsman who ever crafted a craft is incapable of consistently producing significant results with N = 20. Also, when I say academic craftsmanship, I do not mean having a knack for being able to “tell a good story” to an editor and three anonymous reviewers (although that does seem to be skill that some people have developed). Craftsmanship cannot compensate for vague hypotheses or poor inferences. When I say academic craftsmanship, I simply mean the details that take care, patience, and skill that evoke a sense of pride and satisfaction.

Here is one of my favorite examples of academic craftsmanship.

Check out the correlation graph between the original effect size and the replication effect size for the Reproducibility Project: Psychology (http://shinyapps.org/apps/RGraphCompendium/index.php#reproducibility-project-the-correlation-graph ). First off, the overall figure is packed with information—there is the scatterplot, the reference line for a replication effect size of zero and a reference line for a slope of 1 (i.e., original effect size = replication effect size), the density plots on the upper and right borders of the scatterplot, rug marks for individual points, the sizes of the points correspond to replication power, the colors of the points correspond the p-values, etc.—but overall the figure amazingly does not seem cluttered. The essential information is intuitive and easily consumable. There are details such as the color of the points that match the color of the density plots that match the color of the rug ticks. Matching colors seems like the obvious choice, yet somebody had to intentionally make these decisions. You can breathe in the overall pattern of results without much effort. Informative, clean-looking, intuitive. This is a hard combination to execute successfully.

After seeing this figure, most people probably think “big deal, how else would you make this figure?” Believe me, I once spent 90 minutes at an SPSP poster session shaking my head at a horrible figure! It was ugly. It was not intuitive. It was my poster.

Now let’s look under the hood. Open up the R-code that accompanies this figure. Notice how there is annotation throughout the code; not too much, but just enough. Notice the subtleties in the code such as the use of white space between lines to avoid looking cluttered. Notice how major sections of the code are marked like this:
# EFFECT SIZE DENSITY PLOTS ————————————————————-
The series of hashes and the use of CAPS is effective in visually marking this major section. Does this level of care make the R-code run better? Not one bit. However, it is extremely helpful to the reader. This clean R-code is akin to the orderly circuit board in the Mac.

This is just one example. But I see craftsmanship all over the place. A clever metaphor, a nicely worded results section, the satisfaction of listening to the cadence of a well-rehearsed lecture, etc. Perhaps I will share more of these examples in the future. For now I only have one request. If this post is discussed on social media, I would like people to share their favorite examples of academic craftsmanship. 

All aggression is instrumental

Aggression is commonly defined as a behavior done with the intent to harm another individual who is motivated to avoid receiving the behavior. Some researchers go further and try to classify aggression as being either “reactive aggression” or “instrumental aggression.” I do not believe this distinction is useful.

Briefly, reactive aggression is supposedly an impulsive aggressive behavior in response to a provocation or instigation and is typically accompanied by feelings of anger or hostility. The supposed goal of reactive aggression is merely to “cause harm” to the recipient of the behavior. Think of snapping at another person in the heat-of-the-moment. Instrumental aggression is supposedly an aggressive behavior that is enacted to achieve a particular goal.  Think of a bank robber who shoots the guard while trying to make a getaway. 
Several researchers have pointed out that this distinction is difficult, if not impossible, to make (e.g., Bushman & Anderson, 2002; Tedeschi & Quigley, 1999). I agree. With a little thought, one can see that “snapping” at another person can be used to achieve several goals such as restoring a perceived slight to one’s reputation or exerting social control. Thus, the above example of reactive aggression also can be construed as instrumental. Similarly, one also can see that shooting a bank robber probably was in response to some feature of the situation such as the perception that the guard was impeding the goal of successfully executing the robbery.  Thus, the above example of instrumental aggression can be construed as being in response to something and, thus, reactive.

Wait! Am I saying that snapping at another person is the same as a bank robber shooting the guard? No. These are very different behaviors, but the distinctions is not that one is “reactive” and one is “instrumental.”

The argument that the reactive-instrumental distinction is a false distinctions is fairly simple. Aggression is, by definition, a behavior that was done intentionally (i.e., non-accidentally). Intentional behaviors are used to achieve social motives. Thus, aggression is one specific type of intentional behavior that is used to achieve social motives. What are some examples of social motives that can be achieved with aggressive behaviors? Protecting oneself, acquiring resources, restoring one’s reputation, enforcing a violated social norm, etc.

Further, the belief that aggression can be done “to cause harm” is logically incorrect. Because the definition of aggression requires the aggressive behaviors to have been done with intent and with the belief the recipient wants to avoid the behavior, some believe this definition implies that “causing harm” can be the end goal of the behavior rather than merely a means to achieving some other ends. Therefore, “causing harm” can seemingly be the goal behind reactive aggression. Although this is a common belief, this conflates the definitional criteria of aggression with the motive for why an individual would use an aggressive behavior. This is an easy conflation to make because “to cause harm” seems like a reasonable and satisfactory answer to the question “why did this person behave aggressively?” However, this only seems like a satisfactory answer, but it’s not. One cannot explain the causes of a phenomenon (aggression) merely by referring to a necessary component of the phenomenon (an intentionally-caused harmful behavior): A person who behaves aggressively did so with the intent to harm the recipient by definition.

I sincerely hope that we can move beyond the reactive-instrumental definition because I do not believe it is a scientifically useful distinction. Aggression is one behavior in our repertoire of behaviors we use to navigate our complex social environments. All aggression is instrumental. 

"Lab-based measure of aggression" are to "real aggression" what college students are to all humans

Aggression is a common feature of social interactions.  Therefore, it is important for social scientists to develop a well-rounded understanding of this phenomenon.  One valuable approach to understanding aggression is laboratory-based research, which requires researchers to have usable and valid methods for measuring aggression in laboratory settings.  However, behaviors that are clearly aggressive, such as one person forcefully striking another person with a weapon, are fraught with ethical and safety considerations for both participants and researchers.  Such behaviors are, therefore, not a viable option for displayed aggression within lab-based research.  For these reasons, aggression researchers have developed a repertoire of tasks that purportedly measure aggression, are believed to be safe for participants and researchers, and are ethically-palatable.  I collectively refer to these tasks as “lab-based aggression paradigms.”  The major concern herein is whether the behaviors exhibited within lab-based measures of aggression are representative of “real” aggression. 

A common definition of aggression is “a behavior done with the intent to harm an individual who is motivated to avoid receiving that behavior” (Baron & Richardson, 1994, p. 7). If one adheres to this definition, a behavior is considered aggressive when both (a) a harmful behavior has occurred and (b) the behavior was done (i) with intent to harm the target and (ii) the belief the target wanted to avoid receiving the behavior. A strength of this definition is the clear demarcation between harmful behaviors that are not aggressive (i.e., a dentist who causes pain in the process of pulling the tooth of a patient; inflicting consensual pain for sexual pleasure, etc.) and harmful behaviors that are aggressive (i.e., punching another person out of anger; yelling at another person and causing a fear response, etc.). 
As hinted to above, the degree of “harm” that is permissible within lab-based settings is very mild. In fact, the lower bound of harmfulness at which behaviors become unambiguously aggressive is likely the upper bound of harmfulness that is permissible within laboratory settings.  

Extending Baron and Richardson’s (1994) definition, Parrot and Giancola (2007) proposed a taxonomy of how such aggressive behaviors may manifest. Within their taxonomy, aggressive behaviors vary along the orthogonal dimensions of direct versus indirect expressions and active versus passive expressions. For example, a physical fight would be considered a direct and active form of physical aggression whereas not correcting knowingly-false gossip would be considered an indirect and passive form of verbal aggression (to the extent the individual believes their inaction will indirectly harm a target individual). Because Parrot and Giancola strongly adhere to the definition of aggression proposed by Baron and Richardson, each of these forms of aggression are still required to meet the criteria described in the previous paragraph. The purported usefulness of this taxonomy is that factors that incite one form of aggression may not incite other forms of aggression. Thus, Parrot and Giancola assert that using their taxonomy to classify the different behavioral manifestations of aggression, and which antecedents causes those different behavioral manifestations, will lead to a nuanced understanding of the causes and forms of aggression.

The first dimension of Parrot and Giancola’s (2007) taxonomy is the direct versus indirect nature of the aggressive behavior. In describing the distinction between direct and indirect aggression, Parrot and Giancola state that direct aggression involves “face-to-face interactions in which the perpetrator is easily identifiable by the victim. In contrast, indirect aggression is delivered more circuitously, and the perpetrator is able to remain unidentified and thereby avoid accusation, direct confrontation, and/or counterattack from the target” (p. 287). However, several lab-based aggression paradigms seemingly have features of both direct and indirect forms of aggression. Many of these paradigms involve contrived interactions where participants communicate with a generic “other participant,” for example, via computer or by evaluating one another’s essays. These contrived interactions are not really face-to-face and they are not really anonymous. So the behaviors within lab-based aggression paradigms are not cleanly classified as being either direct or indirect within Parrot and Giancola’s taxonomy. 

Similarly, participants’ behaviors exhibited within lab-based aggression paradigms are often not “directly” transmitted to the recipient of those behaviors. For example, participants do not make physical contact with their interaction partner at any point within these paradigms. The consequences of participants’ behaviors are often transmitted to the recipient via the ostensible features of the study in which they are participating. For example, in one lab-based aggression paradigm, participants’ harmful behavior is selecting how long another participant must submerge their hand in ice water (Pederson, Vasquez, Bartholow, Grosvenor, & Truong, 2014). Therefore, participants must believe (a) they can harm the recipient by varying how long they tell the experimenter to have the recipient hold their hand in ice water, (b) that a longer period of time causes more harm, and (c) the experimenter will successfully execute the harmful behavior at a later point in time.

Collectively then, it is questionable whether behaviors within lab-based aggression paradigms are considered “direct.” Nevertheless, it is clear that these behaviors do not include face-to-face aggression or physical aggression that includes direct physical contact. And the time of many of the behaviors within lab-based aggression paradigms are asynchronous with the (ostensible) delivery of harm to the recipient.

The second dimension of Parrot and Giancola’s (2007) taxonomy is the active versus passive nature of the behavior. Active aggression involves an individual actively engaging in a behavior that harms the recipient. In contrast, passive aggression is characterized by participants’ lack of action that is believed to cause harm to the recipient. All of the major lab-based aggression paradigms involve behaviors that are considered active.

In summary, within lab-based aggression paradigms, the harmfulness of the behaviors is on the extreme low end of the range of possible harmfulness, participants may believe their behaviors will only cause mild amounts of harm, participants may believe the recipient may only be mildly motivated to avoid the behaviors, and the form of participants’ behaviors may only cover a limited amount of the conceptual space of possible forms of aggression. Collectively, the behaviors exhibited in lab-based aggression paradigms seem to be limited and unrepresentative of the multi-faceted nature of aggression.

Is this potential un-representativeness a problem? On the one hand, the relationship between the behaviors within lab-based measures of aggression and “real” aggressive behaviors is like the relationship between a convenience sample of college students and “all humans.”  The former is not a representative sample of the latter, therefore, the generalizability from the former to the latter is potentially biased. On the other hand, to the extent that the behaviors exhibited within lab-based aggression paradigms are valid instances of very mild and specific forms of aggression, lab-based research has a valuable place within a robust science of aggressive behaviors.