***If you find an error, please find me at the next conference and say “hi.” You are entitled to one free beer/coffee from me (either one at any time of the day/night, I don’t judge).
A few weeks ago, Daniel Lakens posted an excellent blog about equivalence hypothesis testing (http://daniellakens.blogspot.com/2016/05/absence-of-evidence-is-not-evidence-of.html). Equivalence hypothesis testing is a method to use frequentist statistical analyses (specifically, p-values) to provide support for a null hypothesis. Briefly, Lakens describes a form of equivalence hypothesis testing wherein the “null hypothesis” is a range of effects that are considered to be smaller than the smallest effect of interest. To provide evidence for a null effect a researcher performs 2 one-sided tests: One to determine if your effect is smaller than the upper boundary of the equivalence range and one to determine if your effect is larger than the lower boundary of the equivalence range. If your effect is both significantly less than the upper boundary and significantly greater than the lower boundary of the equivalence range, one can classify the effect as too small to be of interest. Of course, because these tests all employ the use of p-values, they are subject to a known long-run error rate (the familiar Type 1 and Type 2 errors). (If this brief description was too brief I would take the time to read Lakens’ original blog post).
Although I could follow the logic of this equivalence testing procedure, I didn’t have an “intuitive feel” for what it means to use p-values to generate support for a null effect. This is probably due to years of learning the traditional NHST approach to hypothesis testing wherein you can only “reject” or “fail to reject” the null hypothesis. Here is the process I went through to further my understanding of equivalence testing.
First, to get a feel for how p-values can be used to generate support for a null effect it is useful to get a feel for how p-values behave when there is an effect. Let’s take a simple 2-group design where there is an effect of d = 0.3. Below is the distribution of p-values from 10,000 simulated studies with a population effect of d = 0.3 and with 50 individuals per group. You can see that 31.8% of these p-values are below .05. This figure is merely a visual representation of the statistical power of this design: Given a certain effect (e.g., d = 0.3), a certain sample size (e.g., N = 100), and an alpha level (e.g., .05), you will observe a p-value less than alpha at a known long-run rate. In this case, statistical power is 0.32.
Let’s stick with the example where there is an effect of d = 0.3. Now suppose the sample size is doubled. In 10,000 simulated studies, increasing the sample size from 50 per group to 100 per group results in more low p-values. As can be seen below, 55.5% of the p-values are below .05. In other words, increasing the sample size increases statistical power. When there is a to-be-detected effect, increasing the sample size increases your chances of correctly detecting that effect by obtaining a p-value below .05. In this case, statistical power is 0.55.
In comparison, let’s look at a scenario where there is no effect in the population, d = 0 (which is the scenario that is most relevant to equivalence testing). With no effect in the population you can only make Type 1 errors. In 10,000 simulated studies where there is a population effect of zero (i.e., d = 0) and a total N of 100, 5.17% of the p-values were below .05. (If you ran this simulation again you might observe slightly more or slightly fewer p-values below .05. In the long run 5% of the studies will result in p-values below .05). The 5% of studies with low p-values are all Type 1 errors because there is no effect in the population.
Sticking with the scenario where we have a population effect of zero (i.e., d = 0) and we double the sample size from a total N of 100 to a total N of 200. The distribution of p-values from 10,000 simulated studies shows that 4.87% of the p-values were less than .05 (again, in the long run this will be exactly 5%). When there is no effect in the population the distribution of p-values does not change when the sample size changes.
To recap: When there is a to-be-detected effect you can only make Type 2 errors. With all else being equal, increasing the sample size increases statistical power which, by definition, decreases the likelihood of making a Type 2 error. When there is no effect you can only make Type 1 errors. With all else being equal, increasing the sample size does not affect the distribution of observed p-values. In other words, when a null effect is true, your statistical power will simply be your alpha level regardless of sample size.
What does this have to do with equivalence testing? A lot actually. If you followed the information above, then understanding equivalence testing is just re-arranging and re-framing this already-familiar information.
For the following simulations we are assuming there is no effect in the population (i.e., d = 0). We also are assuming that you determined that an absolute effect less than d = 0.4 is either too small for you to consider meaningful or it is too resource expensive for you to study. (This effect was chosen only for illustrative purposes, you can use whatever effect you want.)
To provide support for a null effect it is insufficient to merely fail to reject the null hypothesis (i.e., observe a p-value greater than your alpha level) because a non-significant effect can either indicate a null effect or a weakly powered test of a true effect that results in a Type 2 error. And, as shown above, increasing your sample size does not increase your chances of detecting a true null effect with traditional NHST. However, increasing your sample size can increase the statistical power to detect a null effect with equivalence testing.
Let’s run some simulations. We have already seen that if a null effect is true (d = 0) and your total sample size is 100 that traditional NHST will result in 5% of p-values less than .05 in the long run. I now took these 10,000 simulated studies and I tested whether the effects were significantly smaller than d = 0.4 and whether the effects were significantly larger than d = -0.4. As can be seen below, in these 10,000 simulations, when d = 0 and N = 100, 63.9% of the samples resulted in an effect that was significantly smaller than d = 0.4 and 63.1% of samples resulted in an effect that was significantly larger than d = -0.4 (these percentages are not identical because of randomness in the simulation procedure; in the long run they will be equal).
Some of the samples with effects that are significantly smaller than d = 0.4 actually have effects that are much smaller than d = 0.4. These samples have effects that are significantly smaller than d = 0.4 but are not significantly larger than d = -0.4. Likewise, some of these samples with effects that are significantly larger than d = -0.4 have effects that are much larger than d = -0.4. These samples have effects that are significantly larger than d = -0.4 but are not significantly smaller than d = 0.4.
In equivalence testing, to classify an effect as “null” requires the effect to be both significantly less than the upper bound of the equivalence range and significantly higher than the lower bound of the equivalence range. In these 10,000 samples, 27.04% of the samples would be considered “null” (i.e., d = -0.4 < observed effect < d = 0.4).
Now comes the real the utility of equivalence testing. If we double the total sample size from N = 100 to N = 200 we can increase the statistical power of claiming evidence for the null hypothesis. As can be seen below, within the 10,000 simulated studies where d = 0 and N = 200, 88.3% of the studies had effects that were significantly smaller than d = 0.4 and 88% had effects that were significantly greater than d = -0.4. (again, differences in these percentages are due to randomness in the data generation process and are not meaningful), and 76.4% of these studies had effects that were both smaller than d = 0.4 and greater than d = -0.4. Thus, increasing the total sample size from 100 to 200 increased the percentage of studies that would be classified as “null” (i.e., d = -0.4 < observed effect < d = 0.4) from 27.04% to 76.4%.
Here are the major take-home messages. First, equivalence testing is nice because it allows you to provide evidence for a null effect by using the tools that most researchers are already familiar with (i.e., p-values). Second, unlike traditional NHST, increasing N can increase the statistical power of detecting a null effect (defined by the equivalence range) when using equivalence testing.
These simulations are how I went about building my understanding of equivalence testing. I hope this helps others build their understanding too. The R-code for this post can be accessed here (https://osf.io/ey5wq/). Feel free to use this code for whatever purposes you want and please point out any errors you find.
I have a confession: While I enjoy learning other people’s ideas about major issues in the field of psychology, I also am interested in the mundane details about psychological research. You know, the minutiae that most people overlook or take for granted when they do psychological research. In this way, I am sort of like a less cool, less funny Jerry Seinfeld (if he happened to be into research). The nice part is that when I learn a tip or a trick that may be useful I often can incorporate these ideas immediately into my work. The research process is slow, so it is not often that we get an immediate payoff.
With that said, I believe that every researcher learns and collects little tips and tricks along the way, but these are not often the focus of conversations. However, I believe these small details are important, helpful, and interesting. So, I put together a list of 5 mundane details that I care about. I call these my research minutiae. These are from my personal experience, so take them with a grain of salt (and a cup of coffee, research should always involve coffee). But, perhaps some of these will be helpful to you.
1) I cannot tell you how often I review an article and the authors report an interaction with the letter ‘x’ and not the multiplication symbol ‘×’. Please keep in mind that x !=×.
Incorrect: We observed a Condition x Gender interaction.
Correct: We observed a Condition × Gender interaction.
I don’t believe this detail is picky. First, one is correct and one is incorrect. Further, keep in mind that our audience is other humans who we we must convince that our research is meaningful. These small details help signal to the reader that you (the author) attend to details and that you have taken your time with preparing your manuscript. Conversely, if your results are filled with errors, the reader may wonder how careful you were with other aspects of the research design or statistical analyses.
2) When you develop a line of research you often end up measuring the same things in different studies. Label your variables consistently. This allows you to more easily re-use syntax/script that you have already created. This also makes it easy to find common variables across your different datasets. This has the added benefit of nudging you to start thinking of your research less as a collection of individual studies and more of a body of research. In addition to using the same variable names when referring to the same thing, here is a list of common ways to label variables.
Labels with spaces (example: feedback condition): This may work fine for labeling columns in excel, but most stats programs won’t be able to read variable names with spaces. Don’t do it.
ant.format (example: feedback.condition): This is recommended by the google style guide for R. It is pretty readable and, from what I can tell, is fairly common.
gooseneck_format (example: feedback_condition): Also very common. In my opinion, this is not visually appealing. For example, mean_of_trait_aggression looks ugly. But that is just my opinion.
camelBack (example: feedbackCondition): This is what I tend to use. It saves a keystroke and a space, which is nice for making concise script. However, if your variable names are long, this format can be tough to read. Also, some people capitalize the first letter of the first word and some don’t. This becomes important when using programs like R, which are case sensitive.
First, be consistent within your personal datasets. Second, if you run a lab or have close collaborators, encourage everybody to use the same variable names and naming style. It is a small thing that helps with communication.
3) Come up with a good set of demographic questions and try to use it in all of your research. I always collect demographic information, but I used to come up with a set of questions for each of my studies. Recently I had a reason to examine data from across my different studies (I was comparing demographics of my in-person participants and my online participants). I quickly found my inconsistencies to be frustrating. For example, I sometimes gave participants the option to self-identify as Asian-American or as of Middle Eastern descent and other times there was not the option of to self-identify as Middle Eastern. Sometimes I forced people to identify as male or female, other times I gave an NA/other option. I also used different response formats for education level, relationship status, and employment status. This made my cross-study comparisons less than optimal and took much longer than it needed to. Just to be clear, I am not only talking about collecting the same demographic information, but I also am talking about using the exact same wording and the exact same response formats. This makes apples-to-apples comparisons across all of your studies a cinch.
If you run a lab, take the time to come up with a good set of demographic items (this is a good task for an undergraduate to learn about question wording and different response formats). At this point, try to be inclusive, answering these questions only adds seconds to a participants’ time. Get the information from them while you can. Try to use those questions in all of your studies. Not only will it allow you to re-use script/syntax (see previous point), but it will facilitate your cross-study comparisons. Also keep in mind that you are not beholden to these questions. If there are problems with some questions, or you need to add/drop questions for a specific study, then do what you need to do.
There are many reasons that the entire field will never adopt a common protocol for the collection of demographic information, but it seems feasible for a small group of researchers in a specific area to do so. I am not aware of that happening, but one could imagine. In my mind, I imagine that would be awesome. Future meta-analysts would thank you.
4) Save your raw data in a separate file and try to never open it again. In fact, if you ever looked at the folders where I keep my files, you would find one datafile that is labeled with a “RAW” and then there is the datafile that I actually use for analyses. For example, when I get survey data entered into an .csv file, I save that file as Study 1 RAW.csv. Then I save the file again as Study 1.csv. I only do my analyses on the Study 1.csv file and NEVER on the Study 1 RAW.csv file. When computing variables, reverse coding variables, etc. it is too easy to make mistakes. It is always nice to be able to go back to the RAW datafile and start from scratch if you need to.
5) Annotate your script/syntax. People will tell you that this is important because in the age of open science that other people need to be able to open your files and understand what you did. Another researcher should be able to exactly reproduce your analyses merely by looking at your files.
That is all good and well. I love open science and I think those are valid points. I will make a more pragmatic argument. Annotate your script/syntax so that you will know what you did 3 months or 3 years from now. It may seem like a time waster. After all, while you are doing your analyses you know what your variable names mean and you know why you are doing your analyses. It is hard to imagine not knowing what your variables mean or why you did a specific analysis. But trust me, your memory for these things decays quickly. If I look back at my old files, my memory for what I was thinking is surprisingly bad. I am thankful that I am a good note taker. Make detailed notes while you are doing your analyses and why you are doing your analyses. Your future self will thank you.
So, these are a few of the research minutiae that I have picked up over the few years that I have been doing research. My hope is that one of these things may help at least one other person. Most of these are things that I do out of awareness now. So, if I notice myself doing any other things, I may add to this list in future posts. I also enjoy hearing about other people’s research minutiae. If you have any that you would like to share, please leave a comment or contact me.