

We previously ran a chi-square test on simulated data. In both cases, we’re making decisions about the effect without actually investigating the effect. Or get a p-value of 0.000023 and declare a “highly significant effect” because the p-value is very small. For example, get a p-value of 0.08 and declare there is “no effect” because the p-value is not less than 0.05. There is a long tradition of making binary decisions of “some effect” or “no effect” based on a p-value falling below an arbitrary threshold. In addition to considering power before an experiment, we should focus less on p-values after the analysis and more on confidence intervals on the effect. High power is a probability, not a certainty. But it doesn’t guarantee a significant result. Assuming our effect size estimates are realistic and meaningful, this provides us guidance on how many subjects we need to recruit.

It appears we need over 2100 subjects per group. How large a sample size do we need to have a 90% chance of correctly rejecting the null hypothesis of no difference in the means at a significance level of 0.05? We can use the power.t.test() function to answer this. In our first example, the real effect was 0.1 (with a standard deviation of 1). We pretend some meaningful effect is actually real and then determine the sample size we need to achieve a high level of power, such as 0.90. Power is the probability of correctly rejecting a null hypothesis test when a hypothesized effect really exists. Power is something to think about before running an experiment, not after. An insignificant result will always have low “power.” (The gaps in the scatterplot line are due to the discrete counts in the table.) Xlab = "p-value", ylab = "observed power",Īgain, nothing is gained by performing post hoc power calculations.

Pwr <- (n = length(grp)/2,Ĭ(pvalue = chisqtest$p.value, obs_power = pwr$power) The results are stored in an object called sim_out. The p-value of the test and post hoc power (or “observed power”) are returned in a vector.
#T test power calculator code
Below we “replicate” the code 2000 times. Simply pass our previous code (minus the set.seed() function) to replicate() as an expression surrounded by curly braces, and specify the number of times to replicate the code. The replicate() function makes quick work of this. Hoenig and Heisey (2001) demonstrate this mathematically. You will always get low post hoc power on a hypothesis test with a large p-value. The experiment may very well be under-powered, but a post hoc power calculation doesn’t prove that. Proponents of post hoc power might conclude that the experiment was under-powered, but as we stated in the opening, post hoc power is completely determined by the p-value. The post hoc power is calculated to be about 0.053, which is very low. It may help to pretend that we ran an experiment and that the x1 group is the control and the x2 group received some sort of treatment. This allows you to follow along in case you want to replicate the result. The set.seed(11) function allows us to always generate the same “random” data for this example. One has a mean of 10, and the other has a mean of 10.1. We first draw 10 samples from two normal distributions. To begin, let’s simulate some data and perform a t-test. In this article, we demonstrate this idea using simulations in R. Nothing is learned from post hoc power calculations. Low p-values will always have high power. High p-values (i.e., non-significance) will always have low power. The problem with this idea is that post hoc power calculations are completely determined by the p-value. This allows researchers to entertain the notion that their hypothesized effect may actually exist they just needed to use a bigger sample size. The idea is to show that a “non-significant” hypothesis test failed to achieve significance because it wasn’t powerful enough. Also known as observed power or retrospective power, post hoc power purports to estimate the power of a test given an observed effect size. It is well documented that post hoc power calculations are not useful (Althouse, 2020 Goodman & Berlin, 1994 Hoenig & Heisey, 2001).
