How many times has someone asked you, ‘is that result statistically significant?” As a researcher I’ve probably heard it 6,321 times +/- 30 with a p value of .02.

I’m arguing here for some humility and some qualitative thinking on testing. Here is the argument for some humility, perhaps and especially from those who consider themselves data analysts or quant jocks.

(*This is courtesy of a behavioral scientist acquaintance, Kristian Sorensen*)

Suppose you have a test idea and compare the mean response rate for your control and experimental groups. Let’s say it’s 20 subjects in each sample. You use an independent means *t*-test and your result is significant (*t *= 2.7, d.f. = 18, *p *= 0.01). [*Note: Yes, you can have small samples and statistically significant results – this isn’t a debate point, it’s a matter of fact]*

Mark each of the statements below as “true” or “false.”

- You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). [] true/false []
- You have found the probability of the null hypothesis being true. [] true/false []
- You have absolutely proved your experimental hypothesis (that there is a difference between the population means). [] true/false []
- You can deduce the probability of the experimental hypothesis being true. [] true/false []
- You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. [] true/false []
- You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. [] true/false []

This quiz was done with 44 psychology students, 39 professors and lecturers of psychology and 30 statistics teachers. Every professor taught null-hypothesis testing and every student had successfully passed one or more statistics courses in which it was taught.

80% of the professors and teachers *teaching statistics* got at least one of the statements wrong.

Here is a way to accurately think about significance testing that benefits from being non-technical and humble, which fosters better conclusions and next steps using donor dollars.

Saying something is *statistically significant *is akin to saying there is some reason to believe the test idea works. The operational meaning is that we should repeat the test.

This does happen or at least did with test, re-test and rollout in the ol’ days of direct mail with the larger volume players.

One reason all this matters? Too many wannabe-behavioral scientists mistaking one experiment producing “statistically significant” results as proof of a universal law and established truth. Nonsense. People are messy and complicated; misinterpreting statistical significance and declaring victory suggests they aren’t.

Kevin