Summary

This document shows how to analyze genetic data with the Digital Twin Test (DTT) using the digitaltwins package.

library(digitaltwins)

Create a synthetic data set

First, we create a synthetic data set which has two parts:

  1. A an external data set of unrelated individuals.
  2. A data set of parent-offspring trios.
#simulation parmaters
n_ext <- 1000 #number of external observations
n_trio <- 250 #number of trios
p <- 500 #number of observed genetic variants
chrome_width <- 50 #length of chromosome
d <- rep(p / chrome_width, p) #genetic distance between sites

k <- 5 #number of causal variants

We will consider a single chromosome of length 50 Mb (roughly the length of chromosome 22), and take 500 genetic variants equally spaced across the chromosome. For simplicitly, we assume recombination occurs uniformly at random.

We begin with a matrix set of simulated haplotypes included in the digitaltwins package.

 data("haps_matrix")

External data

We take the first 2000 rows of haps_matrix as the haplotypes of the 1000 unrelated individuals (each individual has two corresponding haplotypes).

external_haps <- haps_matrix[1:(2*n_ext), ]
dim(external_haps)
#> [1] 2000  500

Trio data

Next, we use the remaining 1000 rows as the haplotypes of the parents for the 250 trios.

parent_haps <- haps_matrix[(2*n_ext + 1):(2*n_ext + 4 * n_trio), ]
dim(parent_haps)
#> [1] 1000  500

We create the table of ancestries for the offspring haplotypes, and simulate the offspring haplotypes with the digitaltwins::generate_offspring function.

anc <- matrix(1:(4*n_trio), nrow = 2*n_trio, byrow = TRUE) #index of ancestors of haplotypes
print(dim(anc))
#> [1] 500   2
head(anc)
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    3    4
#> [3,]    5    6
#> [4,]    7    8
#> [5,]    9   10
#> [6,]   11   12
set.seed(300)
# use the "generate_offspring" function for each row of the ancestry table
offspring_haps <- mapply(
  function(i, j) {digitaltwins::generate_offspring(parent_haps[i, ], 
                                                 parent_haps[i, ], 
                                                 d = d)},
  anc[, 1], anc[, 2])
dim(offspring_haps)
#> [1] 500 500

Response variable

Lastly, we will simulate a response variable from a sparse linear regression model with 5 causal variants. The digitaltwins::haps_to_gen function converts the matrix of haplotypes to a matrix of genotypes by adding rows 1 and 2, rows 3 and 4, etc.

#create the regression coefficients
beta <- rep(0, p)
causal_variants <- sample(1:p, k)
beta[causal_variants] <- 1

#sample the response variable from a sparse linear model
Y_ext <- digitaltwins::haps_to_gen(external_haps) %*% beta + rnorm(n_ext)
Y_offspring <- digitaltwins::haps_to_gen(offspring_haps) %*% beta + rnorm(n_trio)

We now have a population of unrelated haplotypes and parent-offspring trios, and so we will next turn to the Digital Twin Test.

DTT Step 1: Modeling using the External Data

The first step of the Digital Twin Test is model fitting. The model fitting is done on the external data set of size 1000. This can be done using any fitting software that yields a linear predictor. We will use the glmnet package for sparse linear regression, tuning the model with cross-validation.

library(glmnet)
lasso_fit <- cv.glmnet(digitaltwins::haps_to_gen(external_haps), Y_ext)
beta_hat <- coef(lasso_fit)
length(beta_hat) #Fitted linear predictor. First entry is an intercept.
#> [1] 501

DTT Step 2: Inference using the Trio Data

Next, we perform inference using the trio data. While our test uses a linear predictor, we emphasize that the validity of the test does not rely on the correctness of our model whatsoever. A better model fit will lead to higher power, but inference remains valid no matter the quality of the model fit.

We will take a group that contains a causal variant.

test_region <- 1:100
sum(causal_variants %in% test_region) #number of causal variants in the test region
#> [1] 1
p_value <- digitaltwins::linear_crt(offspring_haps, parent_haps, anc, Y_offspring, 
                      matrix(beta_hat, ncol = 1), 
                      group = test_region, d = d, family = "gaussian")
#> Pre-computing forward-backward results... done.
#> Number of possible recombination events detected: 500.
#> Computing stats... done.
#> Computing CRT p-values over 100 repetitions (using single core):
#> |----------------------------------------------------------------------------------------------------|
#> |
#null p-value
p_value 
#> [1] 0.00990099

We find that the p-value for this region is significant.

If we instead test a region that does not contain a causal variant, we will find a p-value that is not siginificant.

test_region <- 400:500
sum(causal_variants %in% test_region) #number of causal variants in the test region
#> [1] 1
p_value <- digitaltwins::linear_crt(offspring_haps, parent_haps, anc, Y_offspring, 
                      matrix(beta_hat, ncol = 1), 
                      group = test_region, d = d, family = "gaussian")
#> Pre-computing forward-backward results... done.
#> Number of possible recombination events detected: 500.
#> Computing stats... done.
#> Computing CRT p-values over 100 repetitions (using single core):
#> |----------------------------------------------------------------------------------------------------|
#> |

#non-null p-value
p_value 
#> [1] 0.00990099