This document shows how to analyze genetic data with the Digital Twin Test (DTT) using the digitaltwins
package.
library(digitaltwins)
First, we create a synthetic data set which has two parts:
#simulation parmaters
n_ext <- 1000 #number of external observations
n_trio <- 250 #number of trios
p <- 500 #number of observed genetic variants
chrome_width <- 50 #length of chromosome
d <- rep(p / chrome_width, p) #genetic distance between sites
k <- 5 #number of causal variants
We will consider a single chromosome of length 50 Mb (roughly the length of chromosome 22), and take 500 genetic variants equally spaced across the chromosome. For simplicitly, we assume recombination occurs uniformly at random.
We begin with a matrix set of simulated haplotypes included in the digitaltwins
package.
data("haps_matrix")
We take the first 2000 rows of haps_matrix
as the haplotypes of the 1000 unrelated individuals (each individual has two corresponding haplotypes).
external_haps <- haps_matrix[1:(2*n_ext), ]
dim(external_haps)
#> [1] 2000 500
Next, we use the remaining 1000 rows as the haplotypes of the parents for the 250 trios.
parent_haps <- haps_matrix[(2*n_ext + 1):(2*n_ext + 4 * n_trio), ]
dim(parent_haps)
#> [1] 1000 500
We create the table of ancestries for the offspring haplotypes, and simulate the offspring haplotypes with the digitaltwins::generate_offspring
function.
anc <- matrix(1:(4*n_trio), nrow = 2*n_trio, byrow = TRUE) #index of ancestors of haplotypes
print(dim(anc))
#> [1] 500 2
head(anc)
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 3 4
#> [3,] 5 6
#> [4,] 7 8
#> [5,] 9 10
#> [6,] 11 12
set.seed(300)
# use the "generate_offspring" function for each row of the ancestry table
offspring_haps <- mapply(
function(i, j) {digitaltwins::generate_offspring(parent_haps[i, ],
parent_haps[i, ],
d = d)},
anc[, 1], anc[, 2])
dim(offspring_haps)
#> [1] 500 500
Lastly, we will simulate a response variable from a sparse linear regression model with 5 causal variants. The digitaltwins::haps_to_gen
function converts the matrix of haplotypes to a matrix of genotypes by adding rows 1 and 2, rows 3 and 4, etc.
#create the regression coefficients
beta <- rep(0, p)
causal_variants <- sample(1:p, k)
beta[causal_variants] <- 1
#sample the response variable from a sparse linear model
Y_ext <- digitaltwins::haps_to_gen(external_haps) %*% beta + rnorm(n_ext)
Y_offspring <- digitaltwins::haps_to_gen(offspring_haps) %*% beta + rnorm(n_trio)
We now have a population of unrelated haplotypes and parent-offspring trios, and so we will next turn to the Digital Twin Test.
The first step of the Digital Twin Test is model fitting. The model fitting is done on the external data set of size 1000. This can be done using any fitting software that yields a linear predictor. We will use the glmnet
package for sparse linear regression, tuning the model with cross-validation.
library(glmnet)
lasso_fit <- cv.glmnet(digitaltwins::haps_to_gen(external_haps), Y_ext)
beta_hat <- coef(lasso_fit)
length(beta_hat) #Fitted linear predictor. First entry is an intercept.
#> [1] 501
Next, we perform inference using the trio data. While our test uses a linear predictor, we emphasize that the validity of the test does not rely on the correctness of our model whatsoever. A better model fit will lead to higher power, but inference remains valid no matter the quality of the model fit.
We will take a group that contains a causal variant.
test_region <- 1:100
sum(causal_variants %in% test_region) #number of causal variants in the test region
#> [1] 1
p_value <- digitaltwins::linear_crt(offspring_haps, parent_haps, anc, Y_offspring,
matrix(beta_hat, ncol = 1),
group = test_region, d = d, family = "gaussian")
#> Pre-computing forward-backward results... done.
#> Number of possible recombination events detected: 500.
#> Computing stats... done.
#> Computing CRT p-values over 100 repetitions (using single core):
#> |----------------------------------------------------------------------------------------------------|
#> |
#null p-value
p_value
#> [1] 0.00990099
We find that the p-value for this region is significant.
If we instead test a region that does not contain a causal variant, we will find a p-value that is not siginificant.
test_region <- 400:500
sum(causal_variants %in% test_region) #number of causal variants in the test region
#> [1] 1
p_value <- digitaltwins::linear_crt(offspring_haps, parent_haps, anc, Y_offspring,
matrix(beta_hat, ncol = 1),
group = test_region, d = d, family = "gaussian")
#> Pre-computing forward-backward results... done.
#> Number of possible recombination events detected: 500.
#> Computing stats... done.
#> Computing CRT p-values over 100 repetitions (using single core):
#> |----------------------------------------------------------------------------------------------------|
#> |
#non-null p-value
p_value
#> [1] 0.00990099