IQ and Measurement “Error”

There is a persistent idea in the IQ literature that the fact that an individual’s IQ score at one point in time and their IQ score at another point in time prima facie indicates measurement error. This idea is actually persistent in most of psychology, but we’ll focus on psychology here.

Classical Test Theory, Correlation and Measurement Error

In essence, this IQ literature posits that there is some underlying “true IQ” score that we can never measure perfectly, but only with some amount of error. This is known as “classical test theory” [1], where \hat{y}=y+\epsilon represents \hat{y} is the observed score on a test, y is the true score on a test [2], and \epsilon is the measurement ‘error’ that biases the observed score away from the true score [3].

They then posit that the correlation of \hat{y}_{1} and \hat{y}_{2} will necessarily be less than 1 because of measurement error. That is

\begin{array}{lr}  \sigma_{\hat{y}_1\hat{y}_2} & = \\  \text{cov}(\hat{y}_{1},\hat{y}_{2}) & = \\  \text{cov}(y_1+\epsilon_1, y_2+\epsilon_2) & = \\  \mathbb{E}(y_1y_2+y_1\epsilon_2+y_2\epsilon_1+\epsilon_1\epsilon_2)-\mathbb{E}(y_1+\epsilon_1)\mathbb{E}(y_2+\epsilon_2) & = \\  \mathbb{E}(y_1^2+y_1\epsilon_2+y_1\epsilon_1+\epsilon_1\epsilon_2)-\mathbb{E}(y_1+\epsilon_1)\mathbb{E}(y_1+\epsilon_2)& = \\  \mathbb{E}(y_1^2)+\mathbb{E}(y_1\epsilon_2)+\mathbb{E}(y_1\epsilon_2)+\mathbb{E}(\epsilon_1\epsilon_2)-\mathbb{E}(y_1+\epsilon_1)\mathbb{E}(y_1+\epsilon_2) & = \\  \mathbb{E}(y_1^2)+\mathbb{E}(y_1)\mathbb{E}(\epsilon_2)+\mathbb{E}(y_1)\mathbb{E}(\epsilon_2)+\mathbb{E}(\epsilon_1)\mathbb{E}(\epsilon_2)-(\mathbb{E}(y_1)+\mathbb{E}(\epsilon_1))(\mathbb{E}(y_1)+\mathbb{E}(\epsilon_2)) & = \\  \mathbb{E}(y_1^2)-(\mathbb{E}(y_1))^2 & = \\  \text{var}(y_1) \\  \end{array}

Similarly,

\begin{array}{lr}  \sigma_{\hat{y}_1}^2 & = \\  \text{var}(\hat{y}_1) & = \\  \text{var}(y_1+\epsilon_1) & = \\  \text{var}(y_1)+\text{var}(\epsilon_1)+2\text{cov}(y_1,\epsilon_1) & = \\  \text{var}(y_1)+\text{var}(\epsilon_1) \\  \end{array}

And,

\begin{array}{lr}  \sigma_{\hat{y}_2}^2 & = \\  \text{var}(\hat{y}_2) & = \\  \text{var}(y_2+\epsilon_2) & = \\  \text{var}(y_2)+\text{var}(\epsilon_2)+2\text{cov}(y_2,\epsilon_1) & = \\  \text{var}(y_2)+\text{var}(\epsilon_2)\\  \end{array}

Then we have that

\begin{array}{lr}  \text{cor}(\hat{y}_1,\hat{y}_2) & = \\  \frac{\sigma_{\hat{y}_1\hat{y}_2}}{\sigma_{\hat{y}_1}\sigma_{\hat{y}_2}} & = \\  \frac{\text{cov}(\hat{y}_1,\hat{y}_2)}{\text{var}(\hat{y}_1)\text{var}(\hat{y}_2)} & =\\  \frac{\text{cov}(y_1,y_2)}{(\text{var}(y_1)+\text{var}(\epsilon_1))(\text{var}(y_2)+\text{var}(\epsilon_2))}\\  \end{array}

Very similarly,

\begin{array}{lr}  \text{cor}(y_1, y_2) & = \\  \frac{\sigma_{y_1y_2}}{\sigma{y_1}\sigma{y_2}} & = \\  \frac{\text{cov}(y_1,y_2)}{\text{var}(y_1)\text{var}(y_2)}\\  \end{array}

Finally, we can compare. Because

\begin{array}{lr}  (\text{var}(y_1)+\text{var}(\epsilon_1))(\text{var}(y_2)+\text{var}(\epsilon_2)) & = \\  \text{var}(y_1)\text{var}(y_2)+\text{var}(y_1)\text{var}(\epsilon_2)+\text{var}(y_2)\text{var}(\epsilon_1)+\text{var}(\epsilon_1)\text{var}(\epsilon_2) & \geq \\  \text{var}(y_1)\text{var}(y_2)\\  \end{array}

[4]

we have

\begin{array}{lr}  \frac{1}{(\text{var}(y_1)+\text{var}(\epsilon_1))(\text{var}(y_2)+\text{var}(\epsilon_2))}\leq\frac{1}{\text{var}(y_1)\text{var}(y_2)} \\  \end{array}

Finally, multiplying the numerator of each side by \text{cov}(y_1,y_2) we get

\begin{array}{lr}  \frac{\text{cov}(y_1,y_2)}{(\text{var}(y_1)+\text{var}(\epsilon_1))(\text{var}(y_2)+\text{var}(\epsilon_2))}\leq\frac{\text{cov}(y_1,y_2)}{\text{var}(y_1)\text{var}(y_2)}  \end{array}

That is:

\begin{array}{lr}\text{cor}(\hat{y}_1,\hat{y}_2)\leq\text{cor}(y_1,y_2)\\ \end{array}

So it is true that in the presence of measurement “error”, that correlations will be attenuated, at least in normal error circumstances: that \text{cov}(\epsilon_1,\epsilon_2)=0, \text{cov}(y_1,\epsilon_1)=0 and so on.

But how do we know that there is measurement error in IQ scores? And if there is measurement error, how do we know the quantity of the measurement error in these IQ scores. That’s where the disingenuous nature of IQism comes in. 

Properties of IQ Tests

The way that IQists have “identified” measurement “error” in IQ scores is to observe the correlation between the same IQ test at slightly different measurement periods (test-retest reliability; Tuma & Applebaun 1980) or subsamples of the same measurement period [5] (a sort of internal reliability; Burke 1972). If these are lower than 1, then clearly we have induced measurement error due to our instruments being unreliable. The issue is that this inference is invalid, we have merely assumed that the observed imperfect correlation is the result of measurement error, rather than actually demonstrating it. There are numerous (possibly infinite) sources of the attenuation of correlation between administrations of IQ tests (including subsampling), meaning that the use of test-retest reliabilities or subsampling correlations to “estimate” the magnitude of measurement “error” is merely to beg the question.

When we consider the question of the actual ontology of an IQ score, we should consider what IQists actually say about their scores. IQists typically posit that “intelligence is what the IQ test measures” (Boring 1923; van der Maas et. al 2014), simply defining the concept of intelligence as represented by IQ scores. If this is true, then there can be no measurement error. If we’ve defined intelligence as whatever an IQ test outputs, then there is no “latent true score”, there is simply a measured score. Or rather, the true score is the observed score because \text{var}(\epsilon)\stackrel{\mathrm{def}}{=}0.

When we consider the fact that “true scores” are inherently unobservable, but are defined as the expectation of the distribution of observed scores [6], then the consequences for “measurement error” become clear. The first is that because true scores are merely the expectations of the observed scores, the ontological status of “true scores” shifts towards less of a “true” score, and more of a non-existent (and thus obviously unobservable) value assumed for the purposes of statistical estimation. The second is that the “correction for attenuation” [7] is essentially based on the claim that the correlation between the non-existent unobservable “true score” is higher than the observed correlation, attributing correlational potency to a non-existent figure. True scores are never actually realized, but the correlational (or in some cases causal) potency somehow lies in the unrealized and unrealizable score. The issue of the temporal disconnect between each realized score and their aggregation into the “true score” is never mentioned, for it would expose the incoherence of correlating an aggregated unobservabled unrealizable figure onto a single temporally fixed value of another.

 Intelligence As Static?

The most problematic assumption when it comes to the identification of test-retest correlations as reliabilities indicative of measurement error is the assumption that “true IQ” does not change from measurement period one to measurement period two. On that assumption, \text{cor}(\hat{y}_1,\hat{y}_2)<1 does represent measurement error, but if we relax the simplifying assumption that y_1=y_2 in our above calculations, then the calculation of how much measurement error is present in any particular measurement depends on how much true scores change. Since “true scores” on CTT are definitionally unobservable, then without the auxiliary assumption that y_1\stackrel{\mathrm{def}}{=}y_2, then measurement error is itself unmeasurable.

Despite the assumption that “true scores” do not change, we have good reason to believe that they in fact do change. For instance, Schwartz & Elonen (1975) document massive changes in IQ scores that are related to specific developmental processes and life history. We have long documented that no traits are “static”, but instead exhibit developmental plasticity (Bateson & Gluckman 2011West-Eberhard 2003). It is only by dogma that IQists have concluded that “true IQ scores” remain static. Or rather, classical test theory itself is conceptualized as only applicable to constant quantities, and it is assumed by fiat that certain traits (IQ scores, personality traits) are constant.

What Are Scores?

I think the best solution for psychology is treat each realized score as the true score. More specifically, to say that there is no fact of the matter about the “intelligence” or “personality” of an individual, only temporally specific measurements of it that represent what our instruments have estimated at that point in time as a facsimile of that person’s actual psychological qualities. These measurements are not “real” in some metaphysical sense, but are indeed figments of our instruments. “True scores” are a fiction, but the consequence of this is paradoxically that realized scores are true scores. There is no measurement error because psychological constructs come into existence by the employment or application of a test, they do not exist prior to them.

Footnotes

[1] Modern ‘validity generalization’ theory, item response theory, and other models will be discussed in later posts.

[2] See Borsboom & Mellenbergh (2002);Klein & Cleary (1967)Muir (1977)

[3] In actuality, the true score y is defined as y\stackrel{\mathrm{def}}{=}\mathbb{E}(\hat{y}) (the expectation of the observed scores). See Wu et. al (2017).

[4] The equality case occurs when \text{var}(\epsilon_1)=\text{var}(\epsilon_2)=0.

[5] See Guttman (1945)Kelley (1942); McCrae (2014).

[6] See [3].

[7] For a discussion of the correct for attenuation/correct for unreliability/correction for measurement error, see here.