What is an effect size?

In reporting on the results of a research study, you might be asked “What was your effect size?” In some disciplines, there is a renewed emphasis on the reporting of effect sizes with the hope that this will encourage researchers to rely less on statistical significance and P-values in interpreting their findings, and to focus more on the practical or clinical meaning of their results.

The idea of reporting an effect size is nothing new. Consider, for example, a randomised trial of overweight pregnant women comparing the effects of regular weight measurement with a control (no regular measurement) on weight gain between weeks 16 and 34 of the pregnancy. The mean difference (control – regular measurement) was 0.12 kg/week, with a 95% confidence interval 0.03 to 0.22 kg/week. The mean difference, indicating higher average weight gain in the control group, is a measure of effect size. The confidence interval provides information about the precision of the effect. When the outcome of interest is numerical, differences of means can provide useful measures of effect size. This quantifies the effect in terms of the original measurement scale – a natural starting place for considering the practical implications of the results.

Consider another randomised trial for pregnant women with depression who do not wish to take anti-depressants; the treatments compared are transcranial magnetic stimulation (TMS) and a “sham” stimulation (control). One outcome measured is whether the woman is in remission by the end of pregnancy. Here, where the outcome is binary, the effect size can be quantified using the odds ratio: the ratio of the odds of remission for the TMS group to the control group. This is standard and appropriate. Again, a confidence interval can be reported with this estimate of the effect size.

In the case of fitting statistical models for numerical outcomes with numerical explanatory variables, regression coefficients provide a measure of effect size.

Note that these kinds of effect sizes are those recommended by the American Psychological Association:

Always present effect sizes for primary outcomes … If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure …

Wilkinson & APA Task Force on Statistical Inference (1999, p.599)

This type of measure is sometimes called a “direct” measure of effect.

“Scale free” and standardised effect sizes

There are a range of other measures of effect sizes that can be calculated for the results of a study; the choice can depend on the nature of the study design and the type of outcomes measured.

Scale free effect sizes do not depend on the units of the outcomes measured; Pearson’s correlation coefficient is a familiar example. No matter what units are used to measure each of the two numerical variables of interest, the correlation will take the same value, a number between -1 and +1.

“Standardised effect sizes” are also scale free. Cohen’s d, for example, used in the context of a comparison of the means of two independent samples, standardises the mean difference by dividing by the pooled standard deviation. There are many other such standardised effect sizes designed for different types of studies including, for example, overall measures of effect when there are more than two treatments to be compared. One motivation for the development of these type of measures was the need to meta-analyse results from studies that used different measures of the same construct.

The use of standardised or scale free effect measures should not remove the onus on the researcher to make a meaningful applied interpretation of their findings. If standardised effect sizes are reported, this should be in addition to direct measures of effect. The use of standardised effect measures has given rise to classification schemes where an effect of a given magnitude is given a “label”. For example, Pearson’s correlations less than 0.10 might be described as indicating a “weak or small effect”. This takes the interpretation of the effect size out of context – what is a “moderate” effect in one context or discipline might be substantively meaningful or very useful in another. Again, focus on the direct measures of effect supports better interpretation.

Population effect sizes

Sample effect sizes are used to estimate effect sizes in populations. In planning and designing a study, you may need to consider the likely population effect size in order to estimate the required sample size. Population effect sizes are defined according to design and outcome measurement, in analogous ways to sample effect sizes.

Wilkinson, L. and the Task Force on Statistical Inference, American Psychological Association, Science Directorate.(1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594-604.