## Loading required package: maps
## Loading required package: magrittr
##
## Attaching package: 'nlme'
## The following object is masked from 'package:phylofactor':
##
## getGroups
Phylogenetic Comparative Methods (PCMs) aim to control for dependence of observations from related species when performing tests or regression models among traits and between traits and environmental meta data.
PCMs have been developed often with explicit evolutionary hypotheses in mind. Usually, these come in the form of diffusion processes along the phylogeny, such as a Brownian Motion model that can be simulated using fastBM from the package phytools. The tree is primed with an initial, ancestral trait value (default 0), and then each edge undergoes a Brownian motion. At nodes in the tree, the brownian motion is split in two. The brownian motion will lead to covariances between two species’ traits equal to the volatility, sig2, times the amount of shared ancestry of two species.
x <- fastBM(tree,sig2 = 0.01)
par(mfrow=c(1,2))
plot.phylo(tree,main='Phylogeny')
phenogram(tree,x,main='Phenogram of BM on tree')
One can also simulate categorical variables using sim.history and a transition matrix, Q (note: Q is not a stochastic matrix, but a matrix whose rows and columns sum to 0).
########### Simulate categorical variable
Q <- matrix(c(-1,.5,0,0,.5,
.5,-1,.5,0,0,
0,.5,-1,.5,0,
0,0,.5,-1,.5,
.5,0,0,.5,-1),ncol=5)
M <- sim.history(tree,Q,anc='3')
## Done simulation(s).
par(mfrow=c(1,1))
plotSimmap(M)
## no colors provided. using the following legend:
## 2 3 4 5
## "black" "red" "green3" "blue"
One can also use this machinery to simulate integer-valued traits, such as RNA copy number.
Q <- diag(9) %>% cbind(rep(0,9),.) %>% rbind(.,rep(0,10))
Q <- Q+t(Q)-diag(10)
RNAcopyNumber <- sim.history(tree,Q,anc = '3')
## Some columns (or rows) of Q don't sum to 0.0. Fixing.
## Done simulation(s).
RNAcopyNumber$states
## t10 t6 t9 t1 t2 t7 t3 t8 t4 t5
## "2" "2" "2" "7" "7" "6" "5" "2" "3" "1"
plotSimmap(RNAcopyNumber)
## no colors provided. using the following legend:
## 1 2 3 4 5 6 7
## "black" "red" "green3" "blue" "cyan" "magenta" "yellow"
Phylogenetic comparative methods were developed to control for confounding variation caused by common ancestry of organisms when looking for consistent differences between traits within species. For instance, a phylogenetic paired t-test would be used to test whether males of consistently higher metabolic rates than females across primates, or whether there are consistently more genome space occupied by genes for carbohydrate consumption than protein consumption within a clade. Trait observations between close relatives are likely to be similar to one-another under random evolution along the tree because the bulk of evolution has happened prior to the divergence of the close ratives.
Intuitively, phylogenetic dependence requires a reduced sample size - if we observe a feature of body size (trait 1) and metabolic rate (trait 2) in 3 dove species and 2 E. coli strains, our sample size is not 5 but instead close to 2 as the bulk of variation occured prior to divergence of doves from one-another and E. coli strains from one-another. Lower sample sizes have higher standard-errors of the mean, and failing to correct for the lower sample size in a t-test, for instance, can lead to higher test-statistics, lower P-values, and consequently a higher false-positive rate.
Correcting for the phylogenetic dependence for traits with phylogenetic signal can control the error rates of statistical tests. To illustrate this, we’ll simulate 300 pairs of independent, continuous-valued traits diffusing along the phylogeny using fastBM and look at the resulting distribution of P-values obtained from regular t-tests and phylogenetic t-tests. The null hypothesis is true, and so the distribution of P-values should be uniform (its cumulative distribution function should be a line from [0,0] to [1,1]).
reps=1000
Pvals.orig <- numeric(reps)
Pvals.phyl <- numeric(reps)
for (nn in 1:reps){
x <- fastBM(tree,sig2 = 20)
y <- fastBM(tree,sig2 = 20)
Pvals.orig[nn] <- t.test(x,y)$p.value
Pvals.phyl[nn] <- phyl.pairedttest(tree,x,y,fixed=T)$P.dbar
}
par(mfrow=c(1,2))
plot(ecdf(Pvals.orig),main='t-test',xlab='P',ylab='Fn(P)')
lines(c(0,1),c(0,1),lwd=2,col='blue')
plot(ecdf(Pvals.phyl),main='PCM t-test',xlab='P',ylab='Fn(P)')
lines(c(0,1),c(0,1),lwd=2,col='blue')
If, however, traits little phylogenetic signal, as one might expect with extensive horizontal gene transfer, correcting for phylogenetic dependence may lead to erroneous results, depending on the rate of horizontal gene transfer (HGT).
We produced a function to simulate HGT - simHGT in the R package phylofactor. The function simHGT is similar to fastBM in producing a brownian motion along the phylogeny, but each extant lineage has a fixed probability per unit time of being a recipient of a horizontally transmitted gene.
To illustrate our simulation of HGT, we create a random tree and simulate & visualize HGT. Black lines are trait values of lineages over time, red lines indicate HGT events in which one lineage jumps in its trait value to arrive at a horizontally transmitted value (black circle).
set.seed(5)
tree2 <- rtree(5)
sim <- simHGT(tree2,rate=0.3)
par(mfrow=c(2,1))
plot.phylo(tree2)
HGTplot(sim,lwd=2,main='HGT Dynamics')
To show the sensitivity of paired t-tests and phylogenteic paired t-tests to HGT, we simulate 300 pairs of HGT brownian motions as described above and perform t-tests on the each of the resulting pairs of trait values. Estimation of Pagel’s lambda is often used to modulate phylogenetic comparative methods based on the intensity of the signal. Consequently, to test the robustness of phylogenetic paired t-tests to HGT, we set fixed=FALSE in phyl.pairedttest to allow estimation of Pagel’s lambda and subsequent modification of the residual covariance matrix.
Pvals.low.HGT.orig <- numeric(reps)
Pvals.low.HGT.phyl <- numeric(reps)
Pvals.high.HGT.orig <- numeric(reps)
Pvals.high.HGT.phyl <- numeric(reps)
total.edge.length <- sum(tree$edge.length)
lowrate=0.5/total.edge.length #on average 0.5 HGT event in tree
highrate=5/total.edge.length #on average 5 HGT events in tree
for (nn in 1:reps){
### low HGT
xn <- simHGT(tree,lowrate)$states
yn <- simHGT(tree,lowrate)$states
Pvals.low.HGT.orig[nn] <- t.test(xn,yn,paired=T)$p.value
Pvals.low.HGT.phyl[nn] <- phyl.pairedttest(tree,xn,yn,fixed=F,lambda=0.5)$P.dbar
### high HGT
xn <- simHGT(tree,highrate)$states
yn <- simHGT(tree,highrate)$states
Pvals.high.HGT.orig[nn] <- t.test(xn,yn,paired=T)$p.value
Pvals.high.HGT.phyl[nn] <- phyl.pairedttest(tree,xn,yn,fixed=F,lambda=0.5)$P.dbar
}
par(mfrow=c(2,2))
plot(ecdf(Pvals.low.HGT.orig),main='low HGT + t-test',xlab='P',ylab='Fn(P)')
lines(c(0,1),c(0,1),lwd=2,col='blue')
plot(ecdf(Pvals.low.HGT.phyl),main='low HGT + PCM t-test',xlab='P',ylab='Fn(P)')
lines(c(0,1),c(0,1),lwd=2,col='blue')
plot(ecdf(Pvals.high.HGT.orig),main='high HGT + t-test',xlab='P',ylab='Fn(P)')
lines(c(0,1),c(0,1),lwd=2,col='blue')
plot(ecdf(Pvals.high.HGT.phyl),main='high HGT + PCM t-test',xlab='P',ylab='Fn(P)')
lines(c(0,1),c(0,1),lwd=2,col='blue')
Increasing HGT yields a higher false-positive rate for both a standard paired t-test and a phylogenetic paired t-test, although the phylogenetic paired t-test still outperforms a standard paired t-test.
Often, researchers are interested in simple to more complex associations between traits, such as whether body size and habitat preverence are related, or whether two traits are associated with a third. PGLS was developed to correct for phylogenetic dependence in least-squares fitting and can be done by inputting a particular correlation matrix to the generalized least squares function gls.
x1 <- fastBM(tree)
x2 <- fastBM(tree)
y <- fastBM(tree)
Data <- data.frame(x1,x2,y)
pgls1 <- gls(y~x1+x2,correlation = corBrownian(phy=tree),data=Data)
#Assumes brownian motion
pgls2 <- gls(y~x1+x2,correlation = corPagel(value=1,phy=tree,fixed=T),data=Data)
#Allows tuning for phylogenetic signature, 'value' AKA Pagel's lambda, and setting fixed=F to estimate phylogenetic signature much like the phylogenetic paired t-test.
The robustness of PGLS to HGT follows similar principles outlined above. We can repeat the simulations above, with no HGT and high HGT, and obtain the distribution of P-values for regression coefficients in PGLS.
Pvals.gls <- matrix(numeric(2*reps),ncol=2)
colnames(Pvals.gls) <- c('x1','x2')
Pvals.pgls <- Pvals.gls
Pvals.gls.HGT<- Pvals.gls
Pvals.pgls.HGT <- Pvals.gls
highrate=80/total.edge.length #on average 10 HGT events in tree
for (nn in 1:reps){
### no HGT
x1 <- simHGT(tree,rate=0)$states
x2 <- simHGT(tree,rate=0)$states
y <- simHGT(tree,rate=0)$states
Data <- data.frame(x1,x2,y)
s <- gls(y~x1+x2,data=Data) %>% summary
Pvals.gls[nn,] <- s$tTable[2:3,'p-value']
lambda <- min(mean(phylosig(tree,x1),phylosig(tree,x2)),1)
s <- gls(y~x1+x2,correlation=corPagel(value=lambda,fixed=T,phy=tree),data=Data) %>%
summary
Pvals.pgls[nn,] <- s$tTable[2:3,'p-value']
### high HGT
x1 <- simHGT(tree,rate=highrate)$states
x2 <- simHGT(tree,rate=highrate)$states
y <- simHGT(tree,rate=highrate)$states
Data <- data.frame(x1,x2,y)
s <- gls(y~x1+x2,data=Data) %>% summary
Pvals.gls.HGT[nn,] <- s$tTable[2:3,'p-value']
lambda <- min(mean(phylosig(tree,x1),phylosig(tree,x2)),1)
if (lambda<0){lambda=0}
s <- gls(y~x1+x2,correlation=corPagel(value=lambda,fixed=T,phy=tree),data=Data) %>%
summary
Pvals.pgls.HGT[nn,] <- s$tTable[2:3,'p-value']
}
par(mfrow=c(2,2))
plot(ecdf(Pvals.gls[,1]),main='no HGT, gls',xlab='P',ylab='Fn(P)')
lines(ecdf(Pvals.gls[,2]),lwd=2,col='red')
lines(c(0,1),c(0,1),lwd=2,col='blue')
legend(0.6,0.6,legend=c('x1','x2','uniform'),col=c('black','red','blue'),lwd=c(2,2,1),cex=0.5)
plot(ecdf(Pvals.pgls[,1]),main='no HGT, PGLS',xlab='P',ylab='Fn(P)')
lines(ecdf(Pvals.pgls[,2]),lwd=2,col='red')
lines(c(0,1),c(0,1),lwd=2,col='blue')
plot(ecdf(Pvals.gls.HGT[,1]),main='high HGT, gls',xlab='P',ylab='Fn(P)')
lines(ecdf(Pvals.gls.HGT[,2]),lwd=2,col='red')
lines(c(0,1),c(0,1),lwd=2,col='blue')
plot(ecdf(Pvals.pgls.HGT[,1]),main='high HGT, PGLS',xlab='P',ylab='Fn(P)')
lines(ecdf(Pvals.pgls.HGT[,2]),lwd=2,col='red')
lines(c(0,1),c(0,1),lwd=2,col='blue')
For phylogenetic generalized least squares, we see a similar result to that observed for the phylogenetic t-test - increasing HGT increases the false-positive rate, even when incorporating Pagel’s lambda. For the original gls, however, the story is different - increasing HGT does not radically change the false-positive rate.
Phylogenetic comparative methods control for dependence among random variables connected by a phylogeny. The lasting methods from PCM have been developed with clear models of trait evolution in mind. Most commonly, traits arise as a diffusion process taking the form of a Brownian Motion for unconstrained continuous variables, an Ornstein-Uhlenbeck diffusion for constrained continuous variables, or poisson processes with or without constant rates for discrete variables. The prevalence of evolutionary and ecological models in the development of phylogenetic comparative methods allows users to readily assess the assumptions of the method and whether or not those assumptions hold in a given PSDA effort.
Phylogenetic comparative methods, such as the phylogenetic paired t-test and phylogenetic generalized least squares, can control for dependence among random variables connected by a phylogeny, but their use in light of horizontal gene transfer, or with a phylogeny that disagrees with the gene tree, leads to erroneous results. Horizontal gene transfer leads to a high false-positive rate in the PCMs illustrated here, even when attempting to estimate Pagel’s lambda. Given the uncertainty of which traits are driving abundance patterns in microbial communities and whether or not they are horizontally transmitted, there is need for robust phylogenetic comparative methods in light of uncertain phylogenies, horizontal gene transfer, and variable phylogenetic dependence among traits and habitat associations.