^{1}

^{2}

^{*}

^{3}

The authors have declared that no competing interests exist.

Analyzed the data: KE. Wrote the paper: KE OH.

In this paper we review, and elaborate on, the literature on a regression artifact related to Lord’s paradox in a continuous setting. Specifically, the question is whether a continuous property of individuals predicts improvement from training between a pretest and a posttest. If the pretest score is included as a covariate, regression to the mean will lead to biased results if two critical conditions are satisfied: (1) the property is correlated with pretest scores and (2) pretest scores include random errors. We discuss how these conditions apply to the analysis in a published experimental study, the authors of which concluded that linearity of children’s estimations of numerical magnitudes predicts arithmetic learning from a training program. However, the two critical conditions were clearly met in that study. In a reanalysis we find that the bias in the method can fully account for the effect found in the original study. In other words, data are consistent with the null hypothesis that numerical magnitude estimations are unrelated to arithmetic learning.

Suppose that a researcher wants to study individual differences in how children respond to a training program. The predictor is a continuous measure of some individual property P. The dependent variable is improvement in ability, measured by a pre-training test and a post-training test. However, test scores are not perfect measures of ability. The same improvement in test scores from different pretest scores may therefore reflect different changes in ability. To control for this possibility the researcher may include the pretest score as a covariate in a regression analysis, thereby investigating whether property P predicts test score improvement

In 1999, Campbell and Kenny

Let us formalize the abovementioned setup: A researcher wants to study individual differences in how children respond to a training program. The predictor is a continuous measure of some individual property P. The dependent variable is improvement in ability, (imperfectly) measured by a pre-training test score

Here,

As discussed by other authors

Although inclusion of the pretest score as a covariate may seem both innocuous and sensible, it will lead to biased results when two critical conditions hold. The first condition is that P is correlated with pretest ability. The second condition is that test scores are not fully reliable measures of ability but subject to random within-individual variation, commonly represented by a “true score” model in which the test score is the sum of the child’s latent ability (true score) and a random error term of positive variance:

Improvement of test results will then reflect not only actual arithmetic learning (

The only novelty of our setup is that property P is continuous. A classic setup is retrieved in the special case of P taking only values 0 or 1, indicating membership in one of two groups. Our first critical condition then reduces to the presence of a group differences in pretest scores. The risk of a regression artifact in that case was pointed out more than forty years ago by Campbell and Erlebacher

Children who are higher on property P will tend to have higher pretest scores than children who are lower on property P. By selecting to compare children with equal pretest scores, the researcher will inadvertently make a biased selection of the random errors. Specifically, consider a higher-P child and a lower-P child who happened to have the same pretest score. Equal test scores will arise by chance when a child with higher ability has less luck on the test than a child with lower ability. Because of regression to the mean, the child with worse luck on the first test–i.e., the one with higher ability–will tend to do better than the other child on the second test. Because children with higher P tend to have higher ability, equality in test scores will most often reflect a situation where the higher-P child has had worse luck and will therefore tend to be luckier on the next test. The observation that regression to the mean has this consequence of divergence between members of different groups is sometimes referred to as Kelley’s paradox

Note how the bias was caused by the combination of initial differences and regression to the mean. The confounding effect on the results of the regression analysis is called a regression artifact.

The paper consists of three studies. The first study is a mathematical analysis of the emergence of the regression artifact. We derive an unbiased estimator of the regression artifact under certain simplifying assumptions. The second study is a computer simulation to illustrate the regression artifact, leading up to a discussion of Lord’s paradox in a continuous setting. The third study applies our theoretical framework to a reanalysis of the main finding of the aforementioned study in numerical cognition

To demonstrate how a regression artifact may arise in comparisons of groups, Campbell and Erlebacher

Here we consider the case where children vary on a continuous property P rather than belong to one of two groups. We will demonstrate how a regression artifact arises when the test score difference is regressed on property P if the pretest score is included as a covariate. Adapting the model of Campbell and Erlebacher

The first two terms specify a linear relationship between ability and property P, while the last term denotes unexplained between-individual variation in ability. We assume these random errors to be independently drawn from a normal distribution with mean 0 and standard deviation

Following equations (2) and (3), pretest and posttest scores of child

It is then possible to mathematically derive the expected size of the regression artifact. Under the model assumptions (2–4), the linear regression model (1) translates into

To analyze the results obtained from least-square estimation of this regression model we use the standard approach of letting the sample size tend to infinity, such that stochastic effects can be ignored. We can then identify coefficients between the left-hand and right-hand expressions to obtain

For infinite samples,

Following Campbell and Erlebacher

In order to simulate data we need to choose values for the model parameters

Our interest lies in

Let us emphasize what these simulations tell us. They assume a situation where no child’s ability change between tests, so

Note that the artifact tended to be half as large as

We shall close the theoretical part of this paper by a discussion of Lord’s paradox. In its original form, Lord’s paradox is about comparison of groups. Let us therefore consider hypothetical children whose P values are either 0 or 1, such that groups can be based on P values. Using our simulation model we generated abilities and test scores for a hypothetical set of 105 children, equally distributed over the two P values. The results are presented in a scatter plot of pretest score against test score change (i.e., posttest minus pretest), see

Latent abilities depended on P and did not change between tests.

First consider the solid line, indicating no change in test score. Because abilities did not change between tests in our model, the solid line is where all datapoints would have been if test scores had been perfect measures of ability. Because test scores included random errors in our model, the data points are instead distributed above and below the solid line to the same extent. Based on this observation an empirical researcher could draw the conclusion that

Now consider the dashed lines, showing the results of regressing test score change on pretest score in each group. These lines demonstrate another observation about this dataset: For children with the same pretest score, test score change tends to differ substantially between the high P group and the low P group. Based on this observation, an empirical researcher could instead draw the conclusion that

This phenomenon, that the same data on pretests and posttests in two groups can yield conflicting conclusions depending on what aspect of the data is observed, was first pointed out in a classic paper by Lord

So, which conclusion is correct? Given our knowledge about the model that generated this dataset, the answer is unambiguous: The first conclusion is correct and the second conclusion is incorrect. The relation between property P and change in test scores is just an artifact of regression to the mean and reflects no causal influence.

The crux of the matter is that the empirical researcher would not know which model generated the data. Specifically, our mathematical analysis implies that equivalent data are generated by the following model:

In words, this says that equivalent data would be observed if test scores were perfect measures of ability and if change in abilities were to some degree random but positively influenced by property P and negatively influenced by pretest ability. In such a world, the influence of P found by including pretest score as a covariate would be genuine and not a regression artifact. Instead it would be the analysis

The empirical researcher who wants to draw a conclusion about how much influence, if any, property P has on change in ability therefore needs additional knowledge about the underlying processes. Such knowledge may well exist. For instance, one may have an understanding about the mechanisms whereby ability changes. According to such understanding it might be implausible that ability would systematically decrease between tests among highly able children. This would support the first conclusion that the observed decrease in test scores among high pretest scorers is due to regression to the mean. It is also likely that a researcher has some knowledge about the extent of within-individual variation in test scores. Such knowledge can come from the nature of the test itself as well as from analysis of repeated tests with no treatment in-between. We shall later appeal to this kind of knowledge in our reanalysis of an empirical study.

Lord’s own version of the paradox presented data on weight change among male and female college students over an academic year

For a recent review and analysis of Lord’s paradox in treatment vs. control we refer to van Breukelen

To be able to test for the presence of bias in results, van Breukelen

So far we have theoretically discussed why and when a certain statistical analysis method will produce a regression artifact. We now turn to an empirical study where this method of analysis was used. The background is an interesting and well-established research finding that children’s proficiency in solving arithmetic problems correlates with the linearity of their estimations of numerical magnitudes. Specifically, arithmetic performance tends to be better the more the child estimates numerical magnitudes in a linear rather than logarithmic way. This fact has been demonstrated in many studies, as reviewed by Booth and Siegler

Booth and Siegler

The first session included the task of estimating the positions of 26 different numbers (ranging between 2 and 98) on a number line between 0 and 100. The researchers then measured the linearity of a child’s estimations by calculating the proportion of variance in estimations explained by a best-fitting linear expression of the numbers to be estimated. This measure will be referred to as ^{2}_{Lin}.

A set of four arithmetic problems (9+18, 26+27, 17+29, and 49+43) was used to test arithmetic performance. Children were asked to solve these problems in the first session. A child’s performance on the problem set was measured as the average absolute error in answers divided by 100, referred to as “percent absolute error” (PAE). For instance, a child giving answers 28, 50, 50, and 80 to these problems would have made absolute errors 1, 3, 4, and 12, yielding an average absolute error of 5 and a PAE of 0.05.

In two subsequent sessions, children received training on the same problems. (Training occurred in four between-subject conditions using different instructional procedures. However, all conditions were pooled in the analysis of the main hypothesis. Because this is the analysis we are concerned with in the present paper, the fact that there were different conditions will not be relevant to our account.) At the end of training, children were again given the same problems to solve. In a follow-up session two weeks after the end of training, children solved the same set of problems for a third time. Thus, three performance measures were collected: pre-training (PAE_{pre}), at end-of-training (PAE_{end}), and at follow-up (PAE_{followup}).

Booth and Siegler _{pre}) as a covariate. An impressive 39 percent of the variance in the test score difference PAE_{end}−PAE_{pre} was explained by a multiple regression on ^{2}_{Lin} and PAE_{pre}, with both factors coming out as highly significant predictors. A similar result was obtained for the difference in performance between the first session and the follow-up session: ^{2}_{Lin} and PAE_{pre} together explained 29 percent of the variance in PAE_{followup}−PAE_{pre}, again with both factors coming out as highly significant predictors. The researchers concluded that arithmetic learning is influenced by the linearity of children’s numerical magnitude estimations.

Note that this study fits perfectly with our previous theoretical discussion. The researchers’ aim was to study how arithmetic learning is influenced by a certain continuous property, linearity of numerical magnitude estimations, operationalized by the quantity ^{2}_{Lin}. This property is known to be related to arithmetic ability. The first condition for a regression artifact was therefore likely to be satisfied. Further, learning was measured as the change in test scores. These test scores measure how far off children were from the correct answers to difficult arithmetical problems. As a measure of arithmetic ability, this must be expected to suffer from substantial random errors. Children are likely to use guessing when they don’t know the right answer. They will then, by chance, sometimes come close to the right answer and sometimes not. Thus, the second critical condition for a regression artifact was also likely to be satisfied. Because the researchers used a method of statistical analysis that suffers from a regression artifact under the combination of these two critical conditions, we must expect their results to be biased. It might be that their finding was entirely due to the regression artifact. This calls for a reanalysis of their data.

Our aim is to estimate to what extent Booth and Siegler’s results suffer from the regression artifact and to assess whether or not their research conclusion still holds when the regression artifact is accounted for. To estimate the size of the regression artifact, we have conducted some additional analyses. We thank Julie Booth for sharing the raw data for this reanalysis. The data are presented in three scatter plots.

_{pre}) against their linearity of numerical magnitude estimations (^{2}_{Lin}). Recall that the test score measures percent absolute error in responses, so ^{2}_{Lin} performed better on the pretest. This is the first of the two conditions that give rise to the regression artifact. A simple linear regression of PAE_{pre} on ^{2}_{Lin} yields an estimated value of −0.36 of the unstandardized coefficient. This corresponds to parameter

_{end}−PAE_{pre}) against ^{2}_{Lin}. Similarly, _{followup}−PAE_{pre}) against ^{2}_{Lin}. No correlations are evident in these plots. Statistical tests confirm that there was ^{2}_{Lin}) and improvement in test scores from training, neither when measured at end of training (PAE_{end}−PAE_{pre}), _{S}_{followup}−PAE_{pre}), _{S}_{pre}) as a covariate in a linear regression of the test score differences on ^{2}_{Lin}, we find estimated unstandardized coefficients of −0.27 at end of training and −0.26 at the follow-up session. These values correspond to the outcome variable in our simulations in Study 2 (although with the opposite sign because PAE test scores measure negative ability). However, note that the real data do not support that relations between variables are linear, which they were assumed to be in our simulations.

Taken together, the above analyses show that the dataset exhibits our continuous version of Lord’s paradox. On the one hand, simple correlations indicate that linearity of numerical magnitude estimations does not have any positive influence on test score improvement. On the other hand, when the pretest score was included as covariate the results clearly indicate a positive influence on test score improvement. Which of these results best reflect the answer to the real research question – whether there is any influence on

Did arithmetic learning take place at all? In ^{2}_{Lin} in the same plot we have conducted a median split of children into “more linear” and “less linear” (i.e., above and below median on ^{2}_{Lin}, respectively).

Black and white dots signify “more linear” and “less linear” children according to a median split based on ^{2}_{Lin}.

Our final analysis capitalizes on Booth and Siegler’s inclusion of a follow-up test. Between the tests at end-of-training and follow-up no child received training. This is as close as an empirical study can get to ascertain that no actual learning affects the results between two tests. As suggested by van Breukelen _{followup}−PAE_{end} on ^{2}_{Lin}, including PAE_{end} as a covariate. The result is an estimated unstandardized coefficient of −0.25. This result is essentially identical to the results of our replication of Booth and Siegler’s analyses, which yielded unstandardized coefficients of −0.27 and −0.26. We conclude that there is no evidence of any influence of ^{2}_{Lin} on test score changes beyond the influence that stems from random variation in test scores.

In this paper we have discussed a pitfall in regression analysis of individual differences in change of test scores between a pretest and a posttest. Inclusion of the pretest score as a covariate may produce a regression artifact of a kind that has been discussed for a long time in the statistical literature

In an empirical study, Booth and Siegler