Twins Reared Apart: IQ

There has been a pervasive claim among internet hereditarians that somehow the studies of twins reared apart provide prima facie evidence that background environments don’t matter, particularly with respect to intelligence

Much has been made of the claims with respect to twins reared apart, so I’m going to review the few studies that actually examine datasets of twins reared apart. These are, respectively, Newman, Freeman and Holzinger (1937), Shields (1962), Juel-Nielsen (1965), Minnesota Study of Twins Reared Apart and Swedish Adoption/Twin Study of Aging (I don’t mention Burt 1955 for the obvious reason of its falsification and I neglect MISTRA and SATSA for their recent discussion in the literature). Let’s go in order:

Newman, Freeman and Holzinger (1937)

Newman, Freeman and Holzinger’s 1937 book was the first set of twins reared apart to examine the “nurture”-“nature” question. They thought that studying twins reared apart would allow them to finally ‘settle’ the question.

The first issue is typical of studies of twins reared apart: sampling bias. Because twins reared apart are so rare, it is obvious that the way in which twins are ascertained is going to be an issue. In Newman et. al’s study, they noted:

In the faint hope of possibly securing information about as many as four
or five cases of such twins, one of us (Newman) wrote a short article about
our studies of twins in which an urgent appeal was made for information
about any cases of identical twins reared apart

(p. 132)

They soon noted that:

Incidentally, it might be intimated that this matter of opening up the channels for public correspondence is not without its drawbacks. Letters were received from scores of proud mothers of twins and from twins themselves telling how similar they were, how utterly different, or how remarkable in various ways.

(p. 132)

Indicating that their plea had drawn much attention in the public sphere, already biasing the results. Indeed, they even admit that the fact that the twins know each other is a result of their similarity:

In a good many of our pairs reared apart the real cause of their discovering each other was their very striking similarity. In these cases, where the twins were either unaware that they were twins or else had lost track of each other, they had discovered or rediscovered each other because a close acquaintance of one had encountered the other and had mistaken the latter for the former.

(p. 136)

As such, it is obvious that the sample of twins reared apart that would be able to answer to a letter or call is going to be more similar than the entire population of twins reared apart, introducing bias.

Other biases in their ascertainment include rejecting possible monozygotic twins who differed too much:

They “constitute a selected group, from which any doubtful cases have been
excluded. . . . One case was excluded because the twins wrote: ‘A good many
people think we are identical twins, but we ourselves do not think we are so very
much alike.’ Another case failed to meet our requirements because one of the twins wrote that, while they look very much alike so that they were sometimes
mistaken for each other, they were ‘as different as can be in disposition. . . . ‘ These twins, and other rejections, might well have been monozygotes. The
exclusion of even a very few dissimilar separated MZs from such a small sample
could radically affect the correlation

(Kamin, p. 53)

During their research they noted:

in some of our cases of identical twins reared apart there are evidences that marked environmental differences did not come into play until adult age had been reached, yet such differences had a definite effect.

(p. 142)

Moreover, when they computed the correlation of their sample of twins reared together (TRT) and twins reared apart (TRA), they found contrasting figures of .91 and .67, respectively. This would indicate a shared environmental component of 24%, much greater than what Bouchard et. al have been pushing in the recent past.

Kamin discusses the issues with age standardization in the actual IQ tests they employed:

There is still another factor in the NFH study which serves to inflate the apparent resemblance between at least some of their twins. The 1916 Stanford- Binet had been constructed primarily for children, and the NFH scoring procedure inappropriately penalized bright or well-educated adults. The original scoring
procedure suggested by Terman was to assign all adults a chronological age of 16
when calculating the I.Q. The highest mental age possible was 19.5, achieved by
answering all questions on the scales. Thus the highest possible I.Q. for an individual 16 years or older was 122. That I.Q. is 17 points lower than the I.Q. which would have been assigned to the same individual at age 14, had he answered the same questions in exactly the same way. To eliminate this absurd artifact, Terman had suggested a revised scoring scheme for adults, crediting extra months of “mental age” according to the number of test items successfully answered. That, of course, increases the scored I.Q. The failure to use the revised scoring procedure has an interesting effect on a twin study. When one member of an adult pair answers nearly all questions, and her twin does not, an artificial upper limit is placed on the brighter twin’s I.Q. That reduces the discrepancy between the twins’ I.Q.’s which would have appeared with a more appropriate scoring procedure.

Kamin, p. 55

Moreover, he provided ample evidence that the age-IQ correlation resulting from an improper age-standardization procedure in old Stanford-Binet tests produces a spurious inflation of the IQ correlations between twins reared apart:

The fact is that the 1916 Stanford-Binet, used by NFH, was repeatedly shown to be negatively correlated with age during childhood. The documentation for this observation, well known among testers, is provided in Chapter 5. The infrequent use of the test with adults, however, means that there are no comparable data relating adult I.Q. on this test to age. There are data within the NFH monograph amply confirming the confound of age and I.Q. They presented, it will be recalled, some data for nonseparated twins of school age —50 pairs of MZs and 50 pairs of DZs. For the MZs, the correlation of I.Q. with age was reported as a robust -.49, a highly significant effect. That is, the older the child, the stupider the I.Q. test declared him to be. For the DZs, the correlation was only -.16, significantly different than for the MZs, though pointing in the same direction. The difference might be attributable to very slight differences in the age distributions of the school children making up the two samples. The basic negative correlation for children in that age range is in any event clear. There was no age-I.Q. correlation calculated by NFH for their separated MZs. The bulk of that sample consisted of adults.

Kamin, p. 56

Kamin then uses a “psuedopairing procedure” to provide a rough estimate of what the correlations in IQ would look like, merely resulting from the age similarities in IQ alone:

We can now apply the pseudopairing technique to the NFH data. We begin with the 7 male pairs. The observed I.Q. correlation among them is .58. The pseudopairing procedure produces a correlation of .67. For the males, at least, we can predict an
individual’s I.Q. just as well from knowing his age as from knowing his twin’s. I.Q.! The genetic identity of the twins, in this example, literally contributes nothing to prediction. The correlation between twins’ I.Q.’s may be entirely due to their identical ages, rather than to their identical genes.

He doesn’t put much confidence in the method due to the low sample size and fact that it’s not a tested method, but it is indicative.

Another confound in the study is the fact that, while better than in other studies, the environments that twins reared apart experience are far from “apart”. Many of the individuals in their sample had regular contact with each other, while others lived in very similar environments. The correlation of environments produces a confound in the attribution of similarity correlations to genetic similarity, rather than environmental similarity.

One indicative result is the fact that the correlation in the years of education twins received in the Newman et. al sample is quite high [1]:

Although the reliability of ratings was “highly satisfactory” (.90 or above in each of the three areas), the majority of the differences were very low. Intraclass r’s cannot be computed for these ratings, since they were published only in difference form. The individual case reports, however, provided information on the number of years of schooling for each twin. The intraclass r for this variable was .55. Taken as a whole, these data indicate a definite tendency for the members of a twin pair to be placed in somewhat similar social, educational, and health settings.

(Bronfenbrenner 1999, p. 156)

One can also compute the correlations of twin dyads raised in more or less similar social and educational milieus:

Some indication of the magnitude of the relationship is provided, however, by correlations reported in the Chicago study between differences in IQ for each pair of identical twins and differences in their social and educational milieus. The two coefficients were .51 and .79 respectively

The effect of differences in schooling on differences in IQ can be seen to be quite profound:

The psychological significance of such correlations becomes more readily apparent in concrete form. For example, of the 19 pairs of separated twins in this study, there were 8 pairs who had the same number of years of school; the average IQ difference for this group was 1.45. The remaining 11 pairs differed in amount of schooling for an average of 5 years; the corresponding average difference in IQ points was 10.4. The greatest single pair- difference in schooling was 14 years with an IQ difference of 24 points.

Bronfenbrenner (1999) also reports:

In the Newman, Holzinger, and Freeman study, ratings are reported of the degree of similarity between the environments into which the twins were separated. When these ratings were divided at the median, the twins reared in the more similar environments showed a correlation of .91 between their IQs; for those brought up in less similar environments, the coefficient was .42.

Newman et. al’s own estimates indicate that 81% of the variance in IQ can be explained by various environmental differences (p. 342, 36% when most extreme cases are dropped).

Shields (1962)

Shields’s English study of twins reared apart was the second largest study of TRAs until MISTRA and SASTA. Not only did Shields find monozygotic twins reared apart, but a number of dizygotic twins reared apart. The correlation between DZ TRAs was only 0.05, which using the traditional $r_{DZA}=\frac{1}{2}h^2$, yields a $h^2$ of $.10$. What a large estimate! We should note, however, that the sample size for this computation is meagre, the say the least.

There are again, several issues here. The first is again the problem of age. Not only is it impossible to figure out whether there are age confounds in this dataset because the data was haphazardly reported, but there doesn’t exist any standardization samples for an entire $\frac{2}{3}$ of the sample (the female portion):

The “intelligence tests” used by Shields pose some problems. There were two
separate tests. The Dominoes, a 20-minute nonverbal test, had been employed in the British Army. There are no standardization data for female civilians, who
make up two-thirds of Shields’ sample. The median age of the separated MZs was
40, with a range from 8 to 59, and there are no data concerning age effects on the Dominoes test

Kamin, p. 47

There is no way of knowing the magnitude of the age effect on the test portion used by Shields, but his raw data do not indicate any obvious effect either of age or sex. The separated twins, however, have a significantly lower vocabulary score, and a
significantly greater variance, than the nonseparated twins. For both tests used by
Shields, the mean score of separated twins —but not of twins reared together
—was lower than the “population” mean. The relatively restricted vocabulary score variance among twins reared together can only serve to attenuate the correlation among them. There is no way to know whether their variance is lower than the population variance, or whether the separated twins’ variance is higher than the population’s, or whether both are lower. The standard deviation of twins reared together was 4.0, and that of twins reared apart was 5.7. The latter figure has been used by Jensen to compute “IQ’s” for the separated twins; but Jinks and Fulker,  in still another analysis, have estimated the population standard deviation to be 7.89! The choice of standard deviation will radically affect any trans- formation of raw scores into “IQ’s.” The raw scores are in addition very negatively skewed. There seems to be no satisfactory way of transforming them into I.Q.’s, and Shields quite properly did not attempt to do so.

Kamin, p. 48

There are other little indications that the correlations can be attributed to non-genetic factors, like the differential correlations between twin pairs when tested by different researchers:

For the 5 pairs tested by different psychologists, the mean difference in total intelligence score between twins was 22.4. For the 35 pairs tested by Shields, the mean score difference was only 8.5. These two figures, despite the very small size of one sample, differ to a statistically significant degree. The intelligence correlation for the small sample was a mere .11, compared to .84 for the sample tested exclusively by Shields. These two intraclass correlation coefficients differ significantly. Though this effect is suggested by the vocabulary data, it occurs primarily in the Dominoes test. For that test the correlation among twins tested independently was -.27; for twins tested by Shields, it was .82. The probability of so large a discrepancy arising by chance, even with such small samples, is less than .025. There is clearly a strong suggestion that unconscious experimenter expectation may have influenced these results

There is also substantial evidence that the environments of twins reared apart in this data are substantially correlated:

Large differences in social class do not often occur in the present material. While the social and cultural level of the branches of a family tend to be similar, the same is also generally true of adoptive parents who obtain children from the same Children’s Home (Shields, p. 48).

Moreover, Kamin computed the correlations for twins reared in branches of the same family vis a vis twins who were reared in completely different and unrelated families:

The appendix reveals that 27 pairs of separated twins were reared in related branches of the parents’ families. The most common pattern was for one twin to be
reared by the mother, and the other twin by the maternal grandmother or aunt. There were 13 pairs of twins who were raised in unrelated families. For twins
reared in related families, the score correlation was .83. For those reared in unrelated families, it was .51. These two correlations differ significantly. The genes within each pair are of course identical, so this is very powerful evidence for the role of environment in determining intelligence test scores.

Kamin, p. 49-50

Even in the cases of twins reared apart that are reared in unrelated families there is profound evidence of environmental similarity:

They have been included, in the previous analysis, among twins reared in unrelated families. They are the only case in which Shields presents some intelligence data for the foster mothers of two separated twins. The Mill Hill raw scores of the two foster mothers were 12 and 11, each at about the 20th percentile. The twins’ scores were 11 and 12.

Moreover, when one computes the correlations of the twins dyads with the most similar environments and compares it to the correlations of the twin dyads with less similar environments, there is a striking result:

For the 7 cases just described, the intelligence correlation is .99. That seems incredible. Tests are never that reliable, nor are nonseparated twins so highly correlated. The 33 remaining cases correlated .66. The difference between the two correlations is very highly significant. The most reasonable interpretation is that an unconscious experimenter bias has inflated both correlations equally, preserving a difference due to different degrees of environmental similarity. The F ratio between the within-twin pair variances of these two samples is significant beyond the .001 level.

We can also compute differential twin correlations by other metrics of environmental similarity:

The appendix suggests that in only 10 of the 40 cases had members of a pair never attended the same school, nor been reared by related families. For those 10 pairs the correlation was .47, not a very powerful testimonial to identical genes. Within those 10 pairs 6 individuals were given to family friends to be reared, 5 saw their twins during childhood, and in at least 3 cases, the twins lived in the same town, meeting one another in their homes; but Valerie had to take “a four-penny bus ride” to meet her twin.

Juel-Nielsen (1965)

The last of the three monograph form studies of twins reared apart is the Danish study by Juel-Nielsen. It contained 12 pairs, disproportionately ($\frac{3}{4}$) female. It was one of the few studies that published detailed case histories of the sample. The tyipcal concerns of ascertainment bias also apply, as “only eight of Juel-Nielsen’s 12 pairs (67 percent) were discovered through a register.” (Joseph, 66). However, Juel-Nielsen used a much less than perfect IQ test:

The I.Q. test employed by Juel-Nielsen was a Danish adaptation of the Wechsler Adult Intelligence Scale. The translated test, as Juel-Nielsen scrupulously indicated, had never been stan- dardized on a Danish population. The WAIS test in principle expects a decline in raw scores as adults advance in age. The test scoring procedure had been sup- posedly standardized on an American white adult population so as to produce a mean I.Q. of 100 at every age. There is no particular reason to suppose that the standardization would survive intact when a translated version of the test is administered to adult Danes.

(Kamin p. 61)

Moreover, there is no quantitative information to test any theories on correlated environments, so the bias of this assumption violation cannot be estimated. However, there is a great deal of evidence supporting the idea that such an assumption violation occurred:

For example, Ingegard and Monika were cared for by relatives until age 7, but then lived with their mother until age 14. “They were usually dressed alike and very often confused by strangers, at school, and sometimes also by their step-father. . . . The twins always kept together when children, they played only with each other and were treated as a unit by their environment, but their attitude toward their surroundings and particularly toward their mother were different from an early age”. The reader is asked to reflect on the fact that the I.Q. correlation between such “separated” twins is said to estimate the heritability of I.Q. in the Caucasian population.

Moreover, there was definitively contact between the twins:

age at separation ranged from 1 day to almost 6 years, and 5 of the 12 pairs spent at least the first year of life together. In addition, Pair IV (“Ingegerd & Monika”) was reared together with their mother between the ages of 7 and 14. Several pairs had a close rela- tionship and years of mutual contact. Each of the 12 case histories Juel- Nielsen presented contained a section called “The Twin Relationship,” which should not be found in a study of “reared-apart” twins where the common perception is that twins were separated at birth and had never met, and therefore had no relationship with each other.

There are other strange properties of the IQs of the sample:

The mean I.Q. of the 24 individuals in the sample was a rather high
105.5, considerably higher than that observed for other separated MZ twins. The
standard deviation was a very low 9.6. The mean I.Q. of the males was 113.2,
compared to 103 for the females, a statistically significant difference. The 9
female pairs averaged 52.6 years of age, ranging from 35 to 72. The 3 male pairs
averaged 48, ranging from 22 to 77. There is again, with a different test in a
different country, a very clear correlation between age and I.Q. For the 18
females, the correlation was a significant .61 . For the 6 males, the correlation was
also significant, – .82, with a reversed sign!

Kamin also reports the result of the psuedopairing procedure performed above for the Newman et. al study:

For those 9 pairs, the intraclass I.Q. correlation was .59. The pseudopairing procedure produced an identical correlation of .59. We observe once again that an I.Q. correlation attributed to identical genes may logically be attributed to the identical ages of subjects administered, in this instance, a wholly unstandardized test. There seems no question but that, at least for some samples of separated MZ twins, the correlation between I.Q. score and age contributes mightily to the reported I.Q. resemblance of twins.

Joseph charges Juel-Nielsen (and other TRA researchers) with neglecting the full implications of cohort effects and environmental similarity:

At the same time, Juel-Nielsen was well aware of the fact that, among MZAs, “a series of general cultural, social and psychologi- cal factors may well be common in their environments” (Juel-Nielsen, 1965/1980, Part I, p. 16). He also recognized that “all the twins were brought up in Denmark, the geographical distance between the child- hood homes was on the whole small, and owing to the relatively uni- form social, educational and cultural structure of the country, great diversities of environment are not to be expected” (Juel-Nielsen, 1965/1980, Part I, p. 97). However, he discounted the potential impact of these environmental similarities and instead focused on the “presumed . . . more relevant” role of “interpersonal relationships” and “attachments to objects” (Juel-Nielsen, 1965/1980, Part I, p. 16).


If there is anything to be taken away from the studies of twins “reared apart”, it is not that families/’shared environment’ does not influence cognitive ability. There is fairly robust evidence from the Newman et. al (1937) book that it does. However, for any estimation of the ‘true correlation’ between twins reared apart, there can be no scientific conclusion made on its magnitude. This remains true when consider the MISTRA and SATSA estimates as well (Kamin & Goldberger 2002). It is time for hereditarians to give up the game of citing these estimates as legitimate numbers for hypothesis testing.

[1] Bouchard (1983) has criticized these reanalyses for yielding unreliable estimates that do not replicate, however I do not find his re-reanalysis very convincing. For one, his own estimates, despite his claims, do in fact replicate many of the patterns of correlations that Taylor (albeit not the person I’m citing) reported; see Table 2 and Table 4 for the Otis Newman et. al estimate (ignore Bouchard’s mistaken inclusion of the Jeul-Nielsen data, which is much too small to stratify like this). I think the proper conclusion, if one is to accept Bouchard’s paper, is that there is no scientific content that can be drawn from the studies of twins allegedly “reared apart”. I am inclined to trust the results of the Newman et. al (1937) analysis more than Shields (1962) or Juel-Nielsen (1965). Note that I have begun my own reanalysis of the Newman et. al (1937) data, but have failed to find the alleged “Table 87” that Bouchard references. I hope that I will be able to find a copy of it in a physical copy of the book, however.