^{1}

^{2}

^{*}

^{3}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AK AF TS. Performed the experiments: AF TS AK. Analyzed the data: AF TS AK. Contributed reagents/materials/analysis tools: AK AF TS. Contributed to the writing of the manuscript: AK AF TS.

The

We investigate whether effect size is independent from sample size in psychological research. We randomly sampled 1,000 psychological articles from all areas of psychological research. We extracted

We found a negative correlation of r = −.45 [95% CI: −.53; −.35] between effect size and sample size. In addition, we found an inordinately high number of

The negative correlation between effect size and samples size, and the biased distribution of

Theories are evaluated by data. However, since whole populations cannot be examined, statistics based on samples are used for drawing conclusions about populations. The

An ES is a measure of the strength of a phenomenon which estimates the magnitude of a relationship. Thus, ES offer information beyond p-values. Importantly, ES and sample size (SS) ought to be unrelated. Here we provide evidence that there is a considerable correlation between ES and SS across the entire discipline of psychology: small sample studies often produce larger ES than studies using large samples. Publication bias can be a reason for a correlation between ES and SS (as we will argue here), but there can be other reasons: power analysis, the use of multiple items, and adaptive sampling.

Power analysis. Statistical power is the probability of detecting an effect in a sample if the effect exists in reality. Power analysis consists in choosing SS in a way to ensure high chances to detect the phenomenon given the anticipated size of the effect. Thus, if we expect a large ES, power analysis will show that a small SS is sufficient to detect it. In contrast, if we expect a small effect, power analysis calls for a large SS. This procedure of SS determination leads to a negative relationship between ES and SS.

Repeated trials, or multiple items. Overestimations of ES can result from aggregating data

Adaptive sampling. Lecoutre et al.

Besides these three main arguments for a correlation between ES and SS there are additional possibilities which can result in such a correlation. First, it might be that smaller experiments are conducted in more controlled (lab) settings, or that smaller experiments use more homogeneous samples (e.g., psychology undergraduates) than larger experiments (e.g., ran via Amazon's Mechanical Turk). Yet, empirical evidence does not fully support this view as there are strikingly similar results from MTurk as well as lab-experiments

All these explanations may well be true to some extent. However, in most cases when funnel plot asymmetry is found, it is attributed to publication bias

Publication bias leads to a negative ES-SS correlation, which has been reported in some research areas

A random sample of 1000 English-language peer reviewed articles published in 2007 was drawn from the PsycINFO database. As keywords we used ‘English’ ‘peer reviewed’ ‘journal article’ in ‘year 2007’. Three articles could not be acquired, another article was a duplicate. The remaining 996 articles were classified using a hierarchical tree (see

Note: aConfirmatory factor analysis; path analysis; structural equation modeling; hierarchical linear modeling; survival analysis; growth curves; analyses testing reliability and validity of scales.

After excluding 9 quantitative reviews, and 51 articles using a statistical method other than categories a-d (see

Statistical analysis category | k | _{ESxN} |
95% CI |

a) tests for categorical data | 41 | −.28 | [−.54; .03] |

b) tests of mean difference and analyses of variance | 184 | −.52 | [−.62; −.40] |

c) Correlations/linear regressions | 90 | −.36 | [−.53; −.17] |

d) rank order tests | 26 | −.53 | [−.76; −.18] |

Total | 341 | −.45 | [−.53; −.36] |

Since reliance on statistical power can lead to an ES-SS relationship, we contacted the corresponding authors of all 1,000 articles by email and asked them to participate in an online survey. In the survey we described to them that we had randomly selected 1,000 articles from the PsycINFO database to extract the SS and ES, and that one of their papers happened to be selected. Then we asked the authors to estimate the direction and size of the correlation between ES and SS in these papers. Of the 1,000 email addresses 146 produced error replies (false address, retirement, university change, etc.). From the rest, 282 corresponding authors (33%) clicked the link and 214 (25%) answered the question about the correlation.

The distribution of corrected ES and of SS is given in _{b} = .009, ^{2}

Note: bin from 450 to 500 includes all studies with sample size bigger than 450 but less than 1,000.

Note: k indicates the number of articles in each category. Vertical lines indicate the 95% confidence interval for the mean effect sizes.

Note: Dashed lines indicate the 95% confidence intervals for the regression line.

Finally, we tested whether explicit power considerations were the reason for this correlation. We measured the ES-SS correlation in studies that reported power, or at least mentioned it in the method or discussion section. Only 19 of the 341 studies (5%) computed power, and another 27 (8%) mentioned power. There was no difference in correlation between studies mentioning power and studies failing to do so: all correlations were significant and negative (all

The estimated correlations covered the whole range of possible values between −1 and +1. Specifically, 21% of the respondents expected a negative, 42% a positive, and another 37% exactly a zero correlation between ES and SS (mean r = 0.08, SD = 0.02, median r = 0.00). Thus, on average, authors estimated that there will be no ES-SS correlation, rendering power considerations unlikely as the source of a possible ES-SS relationship.

Note: Dashed line specifies the critical z-statistic (1.96) associated with p = .05 significance level for two-tailed tests. Width of intervals (0.245 i.e. a multiple of 1.96) correspond to a 12.5% caliper.

z-interval | over caliper | under caliper | ||

10% caliper | 1.76; 2.16 | 15 | 39 | <.001 |

15% caliper | 1.67; 2.25 | 22 | 58 | <.001 |

20% caliper | 1.57; 2.35 | 24 | 77 | <.001 |

We investigated the relationship between ES and SS in a random sample of papers drawn from the full spectrum of psychological research and found a strong negative correlation of

Publication bias, in its most general definition, is the phenomenon that significant results have a better chance of being published, are published earlier, and are published in journals with higher impact factors

Publication practice needs improvement. Otherwise misestimation of empirical effects will continue and will threaten the credibility of the entire field of psychology

One proposal is to apply stringent standards of

Another proposal requires that studies are

Another proposal is the

Still another proposal is to install

We favor still another proposal: to

For SS there is no accepted prescription; any SS is acceptable. In consequence, the variance in SS is enormous. In our data set, the smallest sample was N = 1 and the largest was N = 60.599. SS is an index of precision: the bigger the sample, the more precisely can population parameters be estimated. We think that a valid way to improve publication practice is by focusing on the precision of research. More specifically, ES should be supplemented with confidence intervals. The reader can tell from the width of the intervals how accurate, and therefore trustworthy, the estimation is

Confidence intervals as remedy are readily available because the relevant techniques and computer programs do already exist (e.g., Cumming: ESCI

Note that the information about precision may increase chances of publication primarily for non-significant studies. The size of the confidence interval of the ES indicates how precisely the study managed to measure the underlying effect. A precise measurement may be worth being published, irrespective of whether or not the effect is significant.

In a sample of 1,000 randomly selected papers that appeared in indexed psychological journals of the year 2007, ES was negatively correlated with SS. This indicates that it is the significance of findings which mainly determines whether or not a study is published. Our results stand for the entire discipline of psychology, bearing in mind the following main limitations:

First, we sampled from only one single year of psychological research. It could be that things are changing, to the better or to the worse, and sampling from different years could help decide how reliable our findings are and how dynamic the presumed processes are.

Second, our analysis includes less than half of all papers, namely those which were quantitative and for which we were able to extract data on SS,

Third, we choose our strategy to focus on the main finding to avoid violating independence. Since authors tend to begin papers with the best data (or develop their argument along the strongest finding), our analysis is not representative for all results within an article. Note, however, that it is the main finding that receives most attention.

Fourth, we generalized over all areas of psychological research. In some areas independence between ES and SS might exist. Our analysis is coarse-grained and the general picture may not apply for all areas.

These limitations should be kept in mind for evaluating how challenging our findings are for psychological research. An extreme interpretation of our findings is that nearly every result obtained in a small sample study overestimates the true effect. This pessimistic view is based on the fact that due to the high ES - SS correlation in conjunction with publication bias mainly overestimated findings tend to make it into publication. However, we opt for a more tempered view: First, probably not all of the small studies will be affected. In addition, there is no problem with large studies: these measure the underlying effect precisely and tend to find significant effects by virtue of high power. Most importantly,

(DOCX)

(SAV)