I was recently involved with a Registered Replication Report (RRR) of Srull & Wyer (1979). In this RRR, several independent labs collected data to test the “hostile priming effect”: An effect where exposing participants to stimuli related to the construct of hostility causes participants to subsequently judge ambiguous information as being more hostile. The results from each lab were combined into a random-effects meta-analysis. The result is a high-quality estimate of the hostile priming effect using a particular operationalization of the procedure.
In both the original study and the RRR, participants completed a sentence descrambling task where they saw groups of 4 words (break, hand, his, nose) and had to identify 3 words that created a grammatically-correct phrase (break his nose). Participants in the RRR were randomly assigned to one of two conditions: A condition where 20% of the descrambled sentences described “hostile” behaviors or a condition where 80% of the descrambled sentences described “hostile” behaviors. All participants then (a) read a vignette about a man named Ronald who acted in an ambiguously-hostile manner, (b) reported their judgments of Ronald’s hostility, and (c) reported their judgments of the hostility of a series of unrelated behaviors.
Thus, the RRR had one between-participants condition (i.e., the 20% hostile sentences vs. the 80% hostile sentences) and two outcome variables (i.e., hostility ratings of Ronald and hostility ratings of the behaviors). We expected to observe more hostile ratings from those who were in the 80% hostile condition than those who were in the 20% hostile condition.
The full report of the RRR can be found here.
I want to discuss the result of the meta-analysis for ratings of Ronald’s hostility. On a 0-10 scale, we observed an overall difference of 0.08 points, a pretty small effect. However, because the RRR had so much statistical power, the 95% confidence interval for this effect was 0.004 to 0.16, which excludes zero and was in the predicted direction. This works out to a standardized mean difference of d = 0.06, 95%, CI[0.01, 0.12]. What should we make of this effect? Does this result corroborate the “hostile priming effect”? Or is this effect too small to be meaningful?
Here are my thoughts on this effect and my efforts to determine whether it is meaningful. However, just because I was an author on this manuscript should not bestow my opinions with any special privilege. I completely expect people to disagree with me.
First, the argument in favor of the detection of the hostile priming effect
Some people will point to a meta-analytic effect of d = 0.06, 95% CI[0.01, 0.12] and argue this ought to be interpreted as a successful demonstration of the hostile priming effect. The logic of this argument is simple: Because participants were randomly assigned to groups, a nil effect (i.e., an effect of 0.00) is a theoretically meaningful point of comparison. And because the 95% confidence interval does not include zero, one could claim the observed effect is “significantly” different from a nil effect. In other words, the observed effect is significantly greater than zero in the predicted direction.
To some, the magnitude of the effect does not matter. It only matters that an effect was detected and was in the predicted direction.
Arguments against the detection of the hostile priming effect
Without arguing about the magnitude of the effect, one can make at least two arguments against the idea that we detected the hostile priming effect. Essentially, these arguments are based on the idea that you can make different decisions about how you construct the confidence intervals, which would affect whether they include zero or not.
First, one could point out that there were two outcome variables and two meta-analyses. If you want to maintain an overall Type 1 error rate of 5%, one ought to adjust for the fact that we conducted two hypothesis tests. In this case, each adjusted confidence interval would be wider than the unadjusted 95% confidence interval. This would make the adjusted 95% confidence interval for the ratings of Ronald’s hostility contain zero, which, by the same logic as described in the previous section, would be interpreted as an effect that is “not significantly” different than zero.
Second, you could argue that a 95% confidence interval is too lenient. Because of the resources that were invested in this study, perhaps we ought to adopt a more stringent criterion for detecting an effect such as a 99% confidence interval. Adopting, for example, a 99% confidence interval would make the interval wider and would then include zero.
It is important to keep in mind that decisions on how to construct confidence intervals should be made a priori. In the RRR, we planned to construct 95% confident intervals separately for each of the outcome variables. Sticking to our a priori data analysis plan, the 95% confidence interval for the ratings of Ronald’s hostility excludes zero. For this reason, I don’t believe these arguments are very persuasive.
Is the observed effect too small to be meaningful?
Let’s assume that we accept that a hostile priming effect was detected. So what? A separate way to evaluate the effect for Ronald’s hostility is to ask: Is the detected effect meaningful? To answer this question we need to establish what we mean by “meaningful”. In other words, we need to establish what is the Smallest Effect Size of Interest (SESOI).
Once a SESOI is established, one can create a range of effects that would be considered smaller than what is “of interest.” Then we can test whether our observed effect is smaller than what would be “of interest” by conducting two one-sided significance tests against the upper- and lower-bounds of the SESOI region using the TOSTER package (see Lakens, Scheel, & Isager, 2018). If the observed result is significantly smaller than the upper-bound of this range and is significantly larger than the lower-bound of this range, then one can conclude the effect is smaller than the SESOI. Equivalently, one can construct a 90% confidence interval and see whether the 90% confidence interval falls completely between the lower and upper bounds of the SESOI.
Here are 6 ways that I created the SESOI for the ratings of Ronald’s hostility. [Disclaimer: I constructed these SESOI after knowing the results. Ideally, these decisions should be made prior to knowing the results. This would be a good time to think about what SESOI you would specify before reading what comes next].
1) What is the SESOI based on theory? The first way to determine the SESOI is to look to theory for a guide. As far as I can tell, the theories that are used to predict priming effects merely make directional predictions (e.g., Participants in Group A will have a higher/lower value on the outcome variable than participants in Group B). I cannot see anything in these theories that would allow one to say, for example, effects smaller than a d of X would be inconsistent with the theory. Please let me know if anybody has a theoretically-based SESOI for priming effects.
2) What is the SESOI implied by the original study? A second way to determine the SESOI is to look at what effect was detectable in the original study. Srull and Wyer (1979) included 8 participants per cell in their study. Notably, the original study included several other factors, and seemed to be primarily interested in the interactions among these factors, and the RRR was interested in the difference of two cells. Fair enough. Nevertheless, we could infer the SESOI based on what effect would have produced a significant effect given the sample that was included in the original study.
To determine what effects would not have been significant in the original study, we can estimate what effect would correspond to 50% power. An effect smaller than this would not have been significant, an effect exactly this magnitude would have produced p = .05, and an effect larger than this effect would have been significant in the original study. With n = 8 participants/cell, a one-tailed α = .05, and 1 – β = .50, the original authors would have needed an effect of d +/- 0.86 to find a p-value < α. The effects from the RRR is significantly greater than d = -0.86 (z = 32.58, p < .001) and is significantly less than d = +0.86 (z = -28.59, p < .001).
3) What is the SESOI based on prior research? A third way to determine the SESOI is to look at the previous literature. In 2004, DeCoster and Claypool conducted a meta-analysis on priming effects with an impression formation outcome variable (an interesting side note: the effect size computed for Srull & Wyer  was d > 5 and was deemed a statistical outlier in this meta-analysis). The meta-analysis concluded there is a hostile priming effect of about a third of a standard deviation, d = 0.35, 95% CI[0.30, 0.41] (more interesting side notes: This meta-analysis did not account for publication bias and also includes several studies that were authored by Stapel and were later retracted for fraud. Due to these two factors, it seems likely that this effect size is upwardly biased). Nevertheless, we can at least point to a number to create an SESOI and know where it came from. Perugini, Gallucci, and Costantini (2014) suggest using the lower limits of a previous meta-analysis to be conservative.
The lower limits of the 95% CI for the DeCoster and Claypool meta-analysis is d = 0.30. The effects from the RRR is significantly greater than d = -0.30 (z = 12.8, p < .001) and is significantly less than d = +0.30 (z = -8.41, p < .001).
4) What is the SESOI based on my subjective opinion? A fourth way to determine the SESOI is to merely ask yourself “what is the smallest effect size that I think would be meaningful?” To me, in the context of an impression formation task using a 0-10 scale, I would put my estimate to be somewhere around one-quarter of one point on the rating scale. In other words, I would consider a mean difference of 0.25 points to be the minimally-interesting difference. Of course, people can disagree with me on this.
The standard deviation for ratings of Ronald’s hostility was 1.44 units, which means that 0.25 units is an effect of d = (0.25/1.44) 0.17. The effects from the RRR is significantly greater than d = -0.17 (z = 8.20, p < .001) and is significantly less than d = +0.17 (z = -3.81, p < .001).
5) What is the SESOI that represents the amount of resources that others are likely to invest in their future studies? A fifth way to determine the SESOI is to ask “how large of an effect would be needed to be routinely detectable by future researchers?” The answer to this question comes from determining the resources that future researchers would be likely to invest in detecting this effect. For me, I think that researchers would be willing to invest 1,000 participants into a study to trying to detect the hostile priming effect. Effects that require more than 1,000 participants would likely be deemed too expensive to routinely study. That is based on my gut and people are free to disagree.
If researchers were willing to collect n = 500 participants/cell, then they would be able to detect an effect as small as d = 0.16 with the minimum recommended level of statistical power (using a one-tailed (because of the directional prediction) α = .05, and 1 – β = .80). The effects from the RRR is significantly greater than d = -0.16 (z = 7.85, p < .001) and is significantly smaller than d = +0.16 (z = -3.46, p < .001).
6) What is the SESOI based on an arbitrarily small effect size? Finally, we can determine the SESOI by using an arbitrary convention like Cohen’s suggestion that a d = 0.20 represents a small effect. Or, more stringently, we could follow Maxwell, Lau, and Howard (2015)’s suggestion to consider a d +/- 0.10 to be trivially small.
The effects from the RRR is significantly greater than d = -0.10 (z = 5.73, p < .001) and is NOT significantly less than d = +0.1 (z = -1.34, p = 0.09).
A Summary of the SESOI analyses
Let’s put it all together into one visualization. Look at the figure below. The blue diamond on the bottom represents the meta-analytic effect of d = 0.06 for the hostile priming effect. The vertical blue dashed lines represent the 90% confidence interval for the hostile priming effect. Notice that the 90% confidence interval just excludes zero.
The horizontal red lines represent the “ranges of equivalence” that I specified above. Each of the horizontal red lines are centered around zero. If the red line is wider than both vertical dashed blue lines, then we would conclude that the observed effect is smaller than the SESOI.
Consistent with the analyses in the previous section, we can see the horizontal red lines extend past the 90% confidence intervals except for the arbitrarily small effect size of d +/- 0.10. Thus, by most standards, we would consider the observed effect to be smaller than the SESOI.
So What Do We Conclude?
For one of the two outcome variables in the RRR, we detected a hostile priming effect in the predicted direction. Further, this detected effect is not significantly smaller than an arbitrarily small effect of d = 0.10 (but then again, our study was not designed to have high power to reject such a small SESOI).
However, when we construct the SESOI in any other way, this detected effect is significantly smaller than the SESOI. It would take several thousands of participants to routinely detect a hostile priming effect of this magnitude, which makes it likely too resource expensive to make this effect part of an ongoing program of research.
But the question that we really want answered is “what does this effect mean for theory?” Unfortunately (and frustratingly), the theories that predict such priming effects are too vague to determine whether an observed effect of d = 0.06 is corroborating or not, which means that intelligent people will still disagree on how to interpret this effect.
Code for equivalence tests and figure:
# here is the code for conducting the equivalence tests for the Srull & Wyer RRR
# code written by Randy McCarthy
# contact him at email@example.com with any question
# implied by the original study
TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.86, high_eqbound_d=0.86, alpha=0.05)
# from LL of decoster and claypool (2004)
TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.3, high_eqbound_d=0.3, alpha=0.05)
# from my subjective judgment
TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.17, high_eqbound_d=0.17, alpha=0.05)
# from amount of resources likely to be invested
TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.16, high_eqbound_d=0.16, alpha=0.05)
# from an arbitrarily small effect
TOSTER::TOSTmeta(ES=0.0621, se=0.0283, low_eqbound_d=-0.10, high_eqbound_d=0.10, alpha=0.05)
# plotting the equivalence tests
equivRanges <- ggplot() +
xlim(-1.5, 1.5) +
geom_point(aes(x = 0.06, y = 0.05),
color = "blue4",
size = 5,
shape = "diamond") +
scale_y_continuous(name = "", limits = c(0, 1), breaks = NULL) +
geom_vline(aes(xintercept = 0),
color = "black",
size = 1) +
geom_vline(aes(xintercept = c(0.01, 0.11)),
color = "blue4",
size = 1,
linetype = "dashed") +
geom_segment(aes(y = c(0.9, 0.7, 0.5, 0.3, 0.1),
yend = c(0.9, 0.7, 0.5, 0.3, 0.1),
x = c(-0.86, -0.30, -0.17, -0.16, -0.10),
xend = c(0.86, 0.30, 0.17, 0.16, 0.10)),
color = "red",
size = 1.5) +
geom_label(aes(y = c(0.95, 0.75, 0.55, 0.35, 0.15),
x = 0,
label = c("50% Power of Original Study",
"ll of CI From Previous Meta-Analysis",
"Randy's Subjective Opinion",
"Arbitrarily Small Effect")),
size = 4,
nudge_x = -0.5) +
ggtitle("Equivalence Testing") +
xlab("Standardized Mean Difference of 'Hostile Priming Effect'") +
<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>