# methods

Method 1

Consider N currently identified alleles associated with the trait. Assume there are P alleles associated with the trait across the entire genome.

If the PGS difference computed from those N alleles is $PGS_N$, then the extrapolated difference is $PGS_N \cdot \frac{P}{N}$

Benefits:

• Simple to do
• Finding numbers of associated alleles across the entire genome is not difficult

Downsides:

• Highly dependent on your choice of original alleles.
• Depends significantly on the distribution of effect sizes in your original set of N alleles and $P \setminus N$ (e.g. assumes that that the mean effect size for the alleles in $P \setminus N$ is the same as the mean effect size for the alleles in N)

Method 2

Consider N currently identified alleles associated with the trait. Assume these N alleles explain X% of the variance, but it has been estimated that Y% of the variance in the trait is explained by alleles across the entire genome.

If the PGS difference computed from these N alleles is $PGS_N$, then we should extrapolate the total PGS difference from genes across the entire genome to be $PGS_N \cdot \frac{Y}{X}$.

Benefits:

• Doesn’t require mean effect sizes to be equal between discovered and undiscovered alleles

Downsides:

• Requires accurate estimation of total heritability and current explained heritability.

Method 3

Consider K currently identified alleles with effect sizes following $\beta_K \sim \mathcal{N}(\mu, \sigma^2)$. Call the frequency difference between two populations (consistently ordered, e.g. always African frequency minus European frequency) at each allele $i$, $\digamma_i$. Then the contribution of each allele $i$ to the mean genotypic difference is $\digamma_i\beta_i$.

In R, fit a GAM model for $\digamma_i\beta_i \sim \beta_i$. This should give you a good nonlinear fit of the expected contribution to the genotypic gap for an allele of a given effect size $\beta$. Call this predicted contribution $G(\beta)$.

Then, compare your current distribution of effect sizes $\beta_K$ to the distribution of effect sizes you actually expected when the whole-genome is sequenced, $\beta_M$. Sum over the whole-genome distribution as such: $\sum_{\text{genome}} G(\beta)$

Benefits:

• Doesn’t require assumptions about the distribution of effect sizes in current GWAS or in the actual set of associated alleles

Downsides:

• If there are a large number of large effect sizes alleles in your expected actual distribution, then the out of sample prediction can create wonky predictions, especially because the frequencies are going to be way off.
• If you have a non-random subset of alleles with respect to any one given effect size, it can bias the estimates for the rest of the genome.