I have read the journal’s policy and have the following conflicts. Dr. Allison has, anticipates, or has had financial interests with the Frontiers Foundation; Vivus, Inc; Kraft Foods; University of Wisconsin; University of Arizona; Paul, Weiss, Wharton & Garrison LLP; and Sage Publications. This study was funded by Coca Cola. There are no declarations of employment, consultancy, patents, products in development or marketed products to declare. This does not alter our adherence to all the PLoS ONE policies on sharing data and materials, as detailed online in the guide for authors.
Conceived and designed the experiments: DBA. Performed the experiments: DBA GLG. Analyzed the data: DBA GLG. Wrote the paper: DBA GLG.
Much has been written regarding pvalues below certain thresholds (most notably 0.05) denoting statistical significance and the tendency of such pvalues to be more readily publishable in peerreviewed journals. Intuition suggests that there may be a tendency to manipulate statistical analyses to push a “near significant pvalue” to a level that is considered significant. This article presents a method for detecting the presence of such manipulation (herein called “fiddling”) in a distribution of pvalues from independent studies. Simulations are used to illustrate the properties of the method. The results suggest that the method has low type I error and that power approaches acceptable levels as the number of pvalues being studied approaches 1000.
The interpretation of “statistical significance” as a pvalue equal to or less than 0.05 has been attributed to the above wellknown quotation from R.A. Fisher and other statements by him. In 1982, however, Cowles and Davis
The tendency for researchers to favor submission of significant findings for publication over insignificant ones, and the tendency for journals to favor publishing these as well, is behind the wellknown issue of publication bias in metaanalysis (see Rothstein et al.
In 2007 Ridley et al.
Herein we consider an issue somewhat analogous to the one considered by Ridley et al.
PseudoCode  Comments 
Compute 

Compute μ = .8  
Compute σ = .4  
For i = 1 to 

Compute B = Bernoulli( 

Compute Z = Normal(0,1)  
Compute λ = Normal(μ,σ)  
Compute T = B*Z+(1−B)*(Z+λ)  
Compute p_{i} = 2*(1CDF_Normal(0,1,T))  This is for twotailed testing. 
Next i 
The code above was implemented in R (
Our proposed method involves examining the distributions of pvalues reported in collections of literature to detect the inappropriate manipulation of statistical analyses to produce pvalues that appear to be significant when the initial analyses produced results that were nearly, but not quite, statistically significant. The focus of the proposed tests is to determine whether there is a noticeable pattern in the number of pvalues between 0.05 and 0.075 versus those between 0.075 and 0.1 (any intervals of equal length could be used). Pvalues are assumed to have been collected from
In our method, simulations are used to generate distributions of pvalues in which fiddling both did and did not occur. First, pvalues are simulated under a null hypothesis of no fiddling. Then, pvalues are simulated under an alternative hypothesis that fiddling has occurred. The simulation procedure for generating pvalues is described in the next section 2. We then conduct an assessment of the quality of the procedure for simulating pvalues. This involves evaluating the sampling variability in the null distribution of pvalues and determining whether the alternative distribution (i.e., a distribution for which fiddling has occurred) is detectable as different from the null case. In a fourth section we propose a simple test that can be used by an investigator studying a body of literature that takes a single vector of pvalues and evaluates whether there is evidence that fiddling has occurred among the original studies contributing pvalues to that vector. Next, we describe a test using a mixture model approach that uses known theoretical properties of pvalues as random variables. A final section discusses the proposed method within the broader context of publication bias, the role of fiddling on effect sizes, the number of studies required to carry out the proposed method, and the issues to consider when designing a study for the purpose of investigating fiddling.
Two terms that are used throughout are
PseudoCode  Comments 
Compute 

Compute μ = .8  
Compute σ = .4  
Compute ρ = .95  
Compute α = .05  
Compute ι = .025  
Compute 

For i = 1 to 

Compute B = Bernoulli( 

Compute Z = Normal(0,1)  
Compute λ = Normal(μ,σ)  
For r = 1 to 

Compute Z_{r} = (ρ^{(1/2)})*Z+ (1ρ)^{(1/2)}*Normal(0,1)  This formula presupposes that ρ is positive. 
Compute T_{r} = B*Z_{r}+(1−B)*(Z_{r}+λ)  
Next r  
Compute Tmax = max(T_{1}…T 

Compute p_{i} = 2*(1CDF_Normal(0,1,T_{1}))  This is for twotailed testing. 
Compute p_{min} = 2*(1CDF_Normal(0,1,T_{max}))  
If (p_{i}>α AND p_{i}≤α+ι) p_{i} = p_{min}.  
Next i 
In addition to those items specified above, we now need to specify the number of additional tests required when one obtains a pvalue in the interval (α,α+ι], where ι is some small positive constant (e.g., 0.025) for a level 1 test. Let us denote this number
This section describes a scenario in which sampling variability is first assessed by simulating two level 2 null distributions for evaluating the performance of a test by its type I error. When the test is conducted on two null distributions, the test for fiddling should have a type I error close to or below the nominal level (0.05 used here). Next, simulated pvalues are obtained from a distribution where fiddling has occurred, and this distribution is tested against one of the null distributions to determine the power of the test. The tests used are simple twoway tests of contingency tables as described in the following testing scenario:
Two level 2 null data sets were simulated with
A twoway contingency table was generated as shown in
A chisquare test was conducted and a pvalue obtained from a test in which the proportion of pvalues between 0.05 and 0.075 was equal in both null distributions. Note that only pvalues between 0.05 and 0.1 were used. The pvalue from this level 2 test is equivalent to a test in which the row and column categories in
A level 2 alternative hypothesis data set (fiddling) was simulated and a twoway contingency table was generated as shown in
The same tests as in 3 were computed for
The above steps were repeated 5000 times.
Power and type I errors were reported for each test.
The above procedures were repeated for
# pvalues ∈ (0.05,0.075]  # pvalues ∈ (0.075,0.1]  
Null 1  N11  N12 
Null 2  N21  N22 
The results of the testing scenario are shown in
Data were simulated as described above. The testing described in the prior section used two level 2 null distributions for type I error and a level 2 alternative distribution and a null distribution for power evaluation. The tests described here are different in that a single distribution of pvalues was used. The test compares the proportion of pvalues in the interval 0.05–0.075 with those in the interval 0.075–0.1. This test for fiddling is appropriate for any situation in which a sufficient number of pvalues are obtained from
Define the following:
The hypothesis test is
A proportions test using normal approximation with continuity correction was done at a type I error threshold of 0.05 (i.e., the size of the rejection region for this test is 0.05). Note that the two statistics are technically not independent because they both depend on the same
# pvalues ∈ (0.05,0.075]  # pvalues ∈ (0.075,0.1]  
Null 1  N11  N12 
Alternative  N21  N22 

Type I Error (ChiSq Test)  Type I Error (Fisher’s Test)  Power (ChiSq Test)  Power (Fisher’s Test) 
400  0.0250  0.0326  0.1896  0.2308 
600  0.0300  0.0384  0.3202  0.3632 
800  0.0370  0.0456  0.4494  0.4882 
1000  0.0356  0.0418  0.5354  0.5780 
2000  0.0366  0.0402  0.8838  0.8952 
Two tests were conducted to evaluate type I error and power for various sample sizes.
Define the following conditional probability,
The estimator for this probability is
Type I error and power for the two tests (Test 1 and Test 2) are shown in

Type I Error (Test 1)  Type I Error (Test 2)  Power (Test 1)  Power (Test 2) 
400  0.0216  0.0216  0.4350  0.4350 
600  0.0224  0.0220  0.6132  0.6132 
800  0.0208  0.0198  0.7424  0.7416 
1000  0.0200  0.0198  0.8222  0.8218 
2000  0.0156  0.0152  0.9846  0.9842 
Test 1 considers the total of
In this approach, pvalues resulting from
The primary hypothesis considered here is a level 2 null hypothesis that no fiddling has occurred versus an alternative that fiddling has occurred. If each
Shown in the top panel of
Boxplots compare the distributions (from 1000 simulations) of comparison statistics from mixture models fitted to a distribution of pvalues for which no fiddling has occurred (i.e., a level 2 null distribution) and to a distribution of pvalues for which fiddling did occur. OBJ is the objective function calculated at maximum likelihood estimates of parameters of the model, and DIFF is a difference in the fitted model to 0.05 versus 0.1.
Our proposed mixture model approach is as follows:
Obtain a collection of pvalues from
Fit the above mixture model to this distribution of pvalues using maximum likelihood estimation. We used the R function “optim.” Numerical optimization procedures do not always converge, and this is somewhat difficult to control in simulation settings, in which thousands of models are fit to simulated data. Mixture models can be particularly challenging. This was studied in detail in Xiang et al.
From the fitted model, obtain the expected number of pvalues in the interval (0.05, 0.075] and between (0.075, 0.1]. Call these two numbers E1 and E2, respectively. They are computed by calculating (using an approximation) the cumulative density under the fitted model in the two intervals multiplied by
Obtain the two numbers,
Construct the twoway contingency table as shown in
Test for fiddling by using a chisquare test or a Fisher’s exact test and report the level 2 Pvalue from the two tests.
# 
# 

Expected  E1  E2 
Actual 


The steps of the simulation procedure are listed below. Two different scenarios were considered to determine whether the fitted mixture model was affected by fiddling. If it is not affected by fiddling, then it serves as a useful model for computing expected numbers of pvalues under the level 2 null case of no fiddling.
Data for
The mixture model is fit to the P.null pvalues.
Obtain the expected number of pvalues as described in Step 3 above.
Obtain the actual numbers of pvalues,
Construct two contingency tables as described in Step 5 above and shown in
Conduct the tests in Step 6 above.
Repeat 1000 times and compute type I error and power at a level 2 rejection region of size
Repeat for number of studies (i.e., pvalues) equal to
Redo the above Steps A–H, but modify Step B by fitting the mixture model to the collection of pvalues from P.alt.
The steps are repeated 1000 times for various sample sizes given in step H from section 5.3. For the final step I in section 5.3, the dashed line above for step B is redirected to instead connect the P.alt Pvalues to fitting the mixture model.

Type I Error (ChiSq Test)  Type I Error (Fisher’s Test)  Power (ChiSq Test)  Power (Fisher’s Test) 
400  0.021  0.032  0.339  0.417 
600  0.017  0.034  0.564  0.608 
800  0.019  0.027  0.711  0.756 
1000  0.022  0.029  0.791  0.817 
2000  0.021  0.028  0.976  0.982 
Type I error and power are reported for two different tests for contingency table data.

Type I Error (ChiSq Test)  Type I Error (Fisher’s Test)  Power (ChiSq Test)  Power (Fisher’s Test) 
400  0.017  0.024  0.343  0.416 
600  0.019  0.030  0.578  0.635 
800  0.021  0.031  0.716  0.758 
1000  0.024  0.031  0.817  0.842 
2000  0.021  0.028  0.980  0.982 
Type I error and power are reported for various sample sizes for two different tests using contingency table data.
Several remarks can be made regarding the simulation results. First is that the type I error rate was below the nominal level of 0.05. This suggests that the fitted mixture model accurately predicted the number of pvalues in the two intervals (0.05, 0.075] and (0.075, 0.1]. Moreover, for the purposes of type I error, it makes no difference whether the model is fit to pvalues simulated with fiddling (P.alt) or with no fiddling (P.null). Similarly, if the mixture model is fit to either P.null or P.alt, the power of the test to detect fiddling when it has actually occurred is nearly the same. Power starts to rise to values generally considered acceptable for level 2 testing when the number of pvalues exceeds 800. This suggests that, for a user who will not know whether fiddling has occurred in a practical application, the fitted mixture model is robust to this lack of knowledge. Power is somewhat higher for Fisher’s exact test versus the chisquare test, but the type I error rate is also slightly larger.
In conclusion, we have shown that tests of fiddling can be constructed and successfully applied to collections of pvalues, which one can obtain from published literature. Different tests have been illustrated that all have acceptable level 2 type I error rates and have good power under realistic scenarios when the number of pvalues available for analysis approaches 1000. These tests can be used going forward to study the extent to which fiddling seems to be occurring in the research literature and the factors associated with greater or lesser fiddling.
To summarize, we have shown that our proposed tests for fiddling can be successfully applied to vectors of pvalues, which can be extracted from published papers. Several different tests were evaluated, and all had type I error rates at or below the nominal level and had good power in realistic circumstances, provided the number of pvalues in the vector analyzed was near 1000. Our tests will be useful in the next stage of our research, wherein we will investigate the extent to which fiddling seems to be occurring in various research literature within the domains of nutrition and obesity and which factors are associated with the extent of fiddling. There are several points of discussion related to this particular endeavor or one similar to it.
On the surface, the topic of publication bias mentioned in the introduction may seem to be tightly related to the topic of fiddling discussed here. We generally think of publication bias as the tendency for statistically significant results to appear more frequently in literature versus nonsignificant results. The effect of this publication bias on the distribution of a collected set of pvalues would be to steepen the descent of a curve that is fitted to a distribution of pvalues as pvalues go from 0 towards larger values – that is, publication bias in which significant findings are more likely to be published will only steepen the monotonic decreasing nature of the expected sampling distribution of observed pvalues. The proposed mixture model will accommodate this. The mixture of a uniform and a two parameter beta distribution is quite flexible in capturing varying shapes on the interval 0 to 1. Fiddling, as we describe it, will produce a more distinct aberration in the distribution of pvalues near 0.05 (or an alternative alpha level chosen). So even in the presence of publication bias, the mixture model should work in the same way that it is reported here. If the particular aberration is detected by our method, intuition would suggest that the detection is specific to fiddling. However, one could not be certain that the detection was not the result of some peculiar form of publication bias.
The relationship of fiddling to a ‘delayed analysis’ (i.e., waiting for a few more events to come in) is also worth some discussion. Let us consider that an investigator had a fixed number of cases preplanned for his/her study, repeatedly conducted significance tests as data accumulated, terminated the study either when the result was statistically significant or when the final sample size had been reached (whichever came first), and made no corrections for such repeated testing. Such procedures would increase type 1 error rates under the null hypothesis and increase power under the alternative hypothesis, but would not meet our definition of fiddling. Our definition of fiddling entails deciding to conduct additional testing only when pvalues are just above the threshold for significance. Moreover, the procedure of repeated interim testing described above would not produce the characteristic dip in the distribution of pvalues we have described.
In contrast, if an investigator collects an initial set of observations and then decides to collect an additional number of cases if and only if the initial result is just above the threshold of significance, then this is a form of fiddling as we have defined it and it would produce a ‘depletion’ of published pvalues justabove the significance threshold. As discussed above in the comparison between publication bias versus fiddling, this ‘fiddled with result’ may then lead to a greater chance of publication as a result of publication bias.
If an effect is present but small, then collecting additional cases may lead to a statistically significant result (i.e., pvalue <0.05) that is not of practical significance. Interval estimates of effect sizes can be as valuable (in some cases more valuable) as the pvalues to which they correspond. As such, one might wonder how fiddling would be detected when effect sizes are reported in the literature in lieu of, or in addition to, pvalues. The effect of fiddling on estimates of effect size should also produce a pattern if a sufficient number of interval estimates of effect sizes could be gathered. If, say, estimates are of contrasts and that contrasts not equal to zero are ‘significant,’ then fiddling would seem to produce an unusually large number of contrasts that narrowly miss covering zero. How to detect this aberration in interval estimates is not clear to us right now. Still, if fiddling is detected in a collection of pvalues, a useful followup investigation would be to consider estimated effect sizes (if available) that correspond to pvalues just below 0.05 and what proportion of those correspond to effects that are considered of practical significance for the particular application.
Another point of discussion is the seemingly large numbers of pvalues that are needed for our proposed method to work effectively. Our method to detect fiddling requires a sufficient number of pvalues in a narrow interval near some threshold (we used 0.05 as the most common threshold for ‘statistical significance’). Given that pvalues can fall in the interval 0 to 1, the number of studies must be sufficiently large to produce a sufficient number of pvalues in the subinterval 0.05 to 0.1 so that a test for fiddling can be carried out with reasonable power. Randomly selected pvalues from, say, 20 studies may only produce as few as 1 or 2 pvalues in the interval 0.05 to 0.1, and the test for fiddling as we propose it would have extremely little power. However, one could still proceed to conduct the binomial test for fiddling that is described in section 4 (test 2). If one could prescreen literature to randomly select pvalues near 0.05, then the binomial test as described in section 4 (test 2) could be carried out with a smaller sample of pvalues while retaining an adequate level of power. So having adequate power to detect fiddling of pvalues in a narrow subinterval near 0.05 is one reason why a sufficiently large number of pvalues are needed. When using the mixture model approach from section 5, there is a second reason for requiring a large number of pvalues. This reason relates to convergence of the optimization algorithm that produces maximum likelihood estimates of parameters in the mixture model. In our experience, for a particular analysis 100–200 pvalues in the interval of 0 to 1 are sufficient to obtain convergence of the algorithm. With a particular analysis, the user has the luxury of tweaking starting values for the algorithm so that the chance of convergence is enhanced. This is not possible in simulations. The minimum we used in the simulation study to assure convergence in the 1000 simulations was 400 pvalues.
We acknowledge that collecting such a large number of pvalues entails considerable work and are experiencing that directly as we begin the next stage of our work as described in the first paragraph of this section. As also noted earlier, Ridley et al
Given the labor intensive effort involved to conduct an investigation of fiddling, a clear plan for the investigation up front becomes vitally important. Some key questions to be considered in doing so are: 1) How should one best extract pvalues from the literature?; 2) Which areas of study and/or which publications and years should be considered? (For example, in some areas of genomics, alpha levels typically are set at levels between 10^{−4} and 10^{−8}, rather than 0.05); and 3) How should rounded pvalues be used if used at all? Ridley et al.
In addition to the above questions dealing with technical logistics for a general investigation of fiddling, more specific scientific questions that are of interest can also shape the plan and conduct of the investigation. Such questions are, 1) Are ‘fiddling’ rates higher in high profile journals that are harder to get published in versus lower profile journals? 2) Has the desire to ‘fiddle’ changed over the years and what timeframe should suitable papers be selected from? 3) Does ‘fiddling’ vary by subject area? In answering question 1, an analysis stratified by publication may need to be considered. Or separate analyses of fiddling for each journal conducted separately. Since the proposed tests for fiddling use data in two by two contingency tables, each test corresponds to a test of a difference in two proportions. As such, an interval estimate of this difference can be obtained and ‘significant differences’ in effect sizes of fiddling in one journal (or time period) versus another journal (or time period) can be compared. These and other interesting applied questions can be addressed going forward using the method we have offered herein. By doing so, we hope that areas in which research practices can be improved can be identified and constructive feedback provided to the field.
We thank Olivia Affuso, Nir Menachemi, and Bisakha Sen for reviewing an earlier draft of this manuscript.