Phylogenetic dependency networks: Inferring patterns of adaptation in hiv

bet	2/12
Sana	23.11.2017
Hajmi	4,8 Kb.
	#20686

1 2 3 4 5 6 7 8 9 ... 12

a
b
Figure 2.1: Examples illustrating the (a) overcounting and (b) undercounting of evi-
dence for an association between X and Y .
easily confound the statistical search for associations within such data.
An important point that has not been emphasized in previous work is that diﬀerent
applications may involve diﬀerent evolutionary processes leading to diﬀerent kinds of
confounding and requiring diﬀerent solutions. For example, in one process, X and
Y coevolve according to the phylogenetic tree—any change in Y during evolution
inﬂuences the evolution of X, and vice versa. Here, the phylogenetic tree serves as a
confounder of X and Y in the traditional sense—the tree is a hidden common cause
of both X and Y , leading to spurious correlations between X and Y when the tree is
ignored. In another process, only Y evolves according to the tree and the inﬂuence
of X on Y occurs only at the tips of the tree. X need not evolve according to the
tree or follow the tree, but instead can have any distribution, including one in which
the observations of X are IID. We refer to these two processes as coevolution and
conditional adaptation, respectively.

9
2.3
Related work on the comparative method
In this section, we brieﬂy review existing methods for the comparative method that
account for phylogeny. The interested reader is referred to the textbooks of Harvey
and Pagel [
92
] and Felsenstein [
62
], as well as the review by Martins [
149
], for a more
detailed review of the ﬁeld.
2.3.1
Correlated evolution of continuous variables
One of the clearest arguments for identifying correlated traits in the context of phy-
logenies come from Felsenstein [
61
], who laid out a similar argument to that given
in
section 2.2
proposed the method of independent contrasts as a remedy. In this
method, the (continuous) traits are assumed to be derived from a Brownian motion
(or other Gaussian) model of evolution and the n samples are converted to n − 1
diﬀerences between adjacent nodes in the tree, with the variance of the diﬀerences
computed according to the branch lengths of the tree. The resulting diﬀerences are
independent, allowing for regression tests between two traits.
Of course, other models of neutral evolution may be more appropriate, as the
random walk of Brownian motion is rather implausible for most systems in which at
least some purifying selection provides a continuous draw toward some steady state. A
variation of Felsenstein’s method using the Ornstein-Uhlenbeck (OU) process, which
can be thought of as a random walk tethered by an elastic band to some center,
provides a more realistic framework [
89
]. Butler and King [
29
], building on Hansen’s
model [
89
], provide a natural framework in which diﬀerent selective regimes can be
tested. In essence, a speciﬁc source of selection pressure, hypothesized to act on a
speciﬁc set of branches, is allowed to change the center to which the OU process
gravitates.
The univariate conditional model (
section 4.1
and
chapter 5
) can be
thought of as a discrete-variable version of this OU approach. These methods are
eﬀective and well used, but are appropriate only for continuous data. Felsenstein

10
provided a method for modeling a discrete variable using an underlying continuous
variable and a thresholding procedure, but likelihood estimation requires Markov
chain Monte Carlo and is quite slow in practice [
64
].
2.3.2
Correlated evolution of discrete variables
For discrete data, early work focused on reconstructing ancestral states using parsi-
mony and looking for correlations between transitions using standard methods such
as the χ
2
test of independence [
141
,
190
]. The problem with these approaches is
that they ignore the uncertainty in the assignment of ancestral states. In some cases,
choosing an ancestral state when there is roughly equal evidence between two states
can alter the conclusions. To account for this uncertainty, Pagel [
171
] presented a
maximum likelihood approach that averages over all possible conﬁgurations of the
internal nodes. In eﬀect, Pagel performs a likelihood ratio test where the null model
is two binary traits evolving independently using classical maximum likelihood over
phylogenies [
60
], and the alternative model has 2 × 2 = 4 states that deﬁne correlated
evolution. At about the same time, Muse independently proposed the four-state ver-
sion of Pagel’s method for the purpose of detecting compensatory mutations in RNA
secondary structure [
158
].
2.3.3
Correlated evolution of amino acids
Coevolution of discrete traits has been propelled largely by the amino acid coevolu-
tion community. Here the goal is to identify coevolving pairs of codons in the same
or interacting proteins (see [
39
] for a recent review). Early methods ignored the prob-
lem of phylogeny and used simple methods of correlation such as mutual information
[
11
,
122
] or the correlation coeﬃcient [
80
,
173
,
215
], but the phylogeny was soon
shown to play a confounding role [
168
,
177
,
228
]. A number of methods emerged
that attempted to calibrate p-values based on the phylogeny. The idea behind these
approaches is that the primary concern of phylogeny-based confounding is that the

11
resulting statistics do not follow the expected distribution, making any p-values that
are computed on the assumption of a speciﬁc distribution invalid. To recalibrate, the
null distribution is estimated from the data using some method that accounts for the
phylogeny. Brieﬂy, methods have been proposed based on global or local measures of
average similarity [
26
,
127
,
138
,
159
,
213
] or ﬂat clusters based on early bifurcations
of the phylogenetic tree [
11
,
27
]. One approach that explicitly incorporates the phy-
logeny is the parametric bootstrap, in which standard IID models are recalibrated by
generating independently evolved amino acids according to an independence model
of evolution [
27
,
228
] (in
chapter 5
, we consider this method in more detail). One
broad criticism of these approaches is that they leave the underlying ordering of the
statistics unchanged. That is, it is implicitly assumed that the phylogeny does not
aﬀect the strength of the association, only the interpretation of the relative strength.
As we have argued in
section 2.2
, however, the phylogeny may profoundly alter the
relative strength of the statistic, even changing the ordering of associations. Thus,
directly incorporating the phylogeny into the derivation of the statistic may dramat-
ically increase power.
A more direct way to incorporate the phylogeny is to adapt one of the discrete
methods from the evolutionary biology community. Several methods have derived
from Ridley’s original proposal of reconstructing ancestral states [
17
,
74
,
102
,
182
].
Similarly, Poon et al. [
181
] and Pollock et al. [
178
], which we study in some detail
in
chapter 5
, develop special cases of Pagel’s method [
171
] that force the model to
be reversible, and Yeang and Haussler [
233
] develop a version of Pagel’s method that
does not require the collapse of amino acids into binary space, but rather achieves
the necessary state-space reduction by setting the instantaneous rate of simultaneous
mutation in both both codons at a nonzero constant that is the same for all amino
acids.

12
2.3.4
Identifying adaptation from DNA sequence alone
An alternative approach to discrete data is to simply identify codons that are under
diversifying (positive) or purifying (negative) selection pressure, an approach that
is particularly suited in the absence of testable hypotheses regarding the source of
selection pressure. Several methods have been developed that compare the rate of
synonymous mutations d
s
to that of nonsynonymous mutations d
n
to identify genes,
regions, or even individual codons that experience more amino acid substitutions than
expected based on the underlying neutral substitution rate [
160
]. One of the more
common methods is PAML, which uses a codon substitution model over a phylogeny
to compute d
n
/d
s
[
165
,
231
]. Stewart et al. [
207
] extended this concept in the program
QUASI to identify speciﬁc residues that are positively selected, although this model
assumes a star phylogeny in that only deviations from consensus are considered. Pe-
ters et al. [
174
] identiﬁed HLA-associated polymorphisms by identifying positively
selected residues in HIV using QUASI, then correlating those residues to HLA alleles
using standard tests that assume independence (see
chapter 3
for an introduction to
the importance of HLA-derived selection pressure on HIV). Delport et al. [
48
] devel-
oped a codon model that speciﬁcally identiﬁes amino acids that toggle back an forth,
on the assumption that these represent speciﬁc adaptations to certain environments
(assumed to be HLA in their case study) that revert in the absence of the environment.
Chen and Lee [
35
] developed a more direct approach to identifying sources of selection
pressure by deﬁning a d
n
/d
s
ratio that was conditional on speciﬁc environmental vari-
ables (in their case, HIV antiretroviral drugs), but also did so using a na¨ıve deﬁnition
of d
n
/d
s
based on deviation from consensus. Such deﬁnitions implicitly assume a star
phylogeny and ignore branch lengths (i.e., they assume the traits are IID), and are
thus likely to lead to statistical bias in cases where there is some structure in the tree,
as in HIV [
17
,
31
]. Even when the phylogeny and branch lengths are considered, as
in PAML, d
n
/d
s
ratios can be viewed as a means to calibrating statistics. That is,

13
the synonymous substitution rate serves to normalize observations regarding nonsyn-
onymous substitutions. Thus, models that explicitly model evolutionary interactions
between two variables (e.g. [
31
,
102
,
141
,
158
,
171
,
178
,
181
,
190
]) are expected to
have greater statistical power than d
n
/d
s
ratios, due to the fact that they can up-
weight surprising deviations between evolutionarily similar species whereas d
n
and d
s
represent summary statistics across all species and cannot leverage such information.
2.3.5
Similarities to population genetics
Population structure in biological data has also been addressed in the area of genome
wide association studies (GWAS). Although standard population genetic models as-
sume populations mate randomly, violations of this assumption result in latent pop-
ulation structure that inﬂates false positive rates, an eﬀect that will only increase
as study sizes increase [
144
,
145
]. There are two rather diﬀerent approaches in this
community that have been used to compensate for population structure. The more
commonly used approach attempts to recalibrate standard statistics by normalizing
results according to the distribution of the statistic across the entire genome [
50
,
51
].
As we shall see, solutions to calibration are insuﬃcient, as population structure also
aﬀects discriminatory power. The other approach assumes population structure is
ﬂat and can be captured by a small number of (perhaps overlapping) clusters or con-
tinuous hidden variables [
185
,
186
,
198
,
204
,
217
]. Although these methods increase
discriminatory power relative to simple IID models, there is mounting evidence that
populations are hierarchically structured. For example, in addition to high level ge-
ographical/social constraints that impose population structure [
195
], structure exists
within a number of subpopulations that have been studied [
13
,
30
,
96
,
223
]. If hi-
erarchical models describe the data better than ﬂat cluster models, then it stands
to reason that such models will have higher discriminatory power. Thus, several au-
thors have suggested that a more accurate approach would be to model population
structure hierarchically [
8
,
114
,
240
]. Aranzana et al. [
8
] described one such model in

14
their Arabidopsis study, which we re-examine in
section 5.5
. A more general approach
that can capture pedigree structure is the linear mixed-eﬀects model [
97
,
114
], most
recently extended by Yu et al. [
241
]. This linear model includes a pairwise correla-
tion term that models genetic relatedness, a white noise term, and an environmental
impact term that is used to identify sources of selection pressure. This approach
can be thought of as unifying population genetics and phylogenies, in that Brownian
motion and OU processes over phylogenies can be captured in the genetic relatedness
term [
101
,
140
]. Unfortunately, modeling discrete data with these approaches is com-
putationally slow and diﬃcult to optimize, at it requires variational approximations
(H. Kang and D. Heckerman, unpublished data).
2.4
Limitations of existing methods
Although there is a long and rich history of the development of methods for the
comparative method, we note two major gaps in the literature, each of which will be
addressed by this dissertation.
2.4.1
Chains of interaction
First, we note that the traditional application of the comparative method is to look
for correlations among pairs of variables. The problem with this approach can be
seen by a simple example. We have been considering the case where trait X exerts
selection pressure on trait Y , which can be graphically depicted as
X → Y.
Suppose, however, that Y in turn exerts selection pressure on another trait Z. A
common example is if Y and Z are both codons in the same protein. In this scenario,
a mutation in Y may serve as an adaptation in response to X, but the change in Y
destabilizes the protein unless a compensatory change in Z occurs. This causal model

15
can be graphically depicted as
X → Y → Z.
The problem is, when we apply the comparative method to all pairs of variables, we
will ﬁnd that all three pairs (X, Y ), (Y, Z), and (X, Z) are correlated, even though
Z is conditionally independent of X. We refer to such causal models as chains of
interactions. If chains are common, then indirect associations ((X, Z) in the present
example) will be common, leading to a large number of false positive results.
To our knowledge, only Poon et al. [
182
] have addressed the problem of chains
of interactions when considering discrete traits in the context of phylogeny. (Several
authors have employed Bayesian networks, which solve this problem, but have done
so assuming the traits are IID [
45
,
181
].) Poon’s approach can be described in three
steps: (1) ﬁrst, a phylogeny is inferred from all traits (they focus on amino acid
covariation); (2) next, they infer the most likely state of each trait for each hidden
node in the phylogeny in a method similar to [
190
]; (3) ﬁnally, they feed both the
observed and inferred traits into a standard directed acyclic graphical (DAG) model
inference algorithm [
93
], treating the inferred states as observed data. This paper
represents a major advance in the ﬁeld in that it was the ﬁrst to both identify and
address the problem of chains of interactions.
We must, however, note several weakness to this approach. First, one must be
cautious whenever hidden nodes are treated as observed data. In cases where there
is strong evidence that the hidden node takes on a speciﬁc value, this approximation
may yield good results in practice. For large phylogenies, however, it is often the
case that the evidence for one state is only slightly greater than the evidence for
another state, especially for nodes near the root (and thus farther removed from
the observed data). Although Poon et al. showed that their method was reasonably
robust to diﬀerent instantiations of the hidden nodes on the data sets they considered,
in general, power is typically gained when all the hidden nodes are integrated out.
Second, the DAG model has some inherent constraints. For example, the acyclicity

16
constraint, which is required for computational tractability, may not be a reasonable
assumption. Certainly in the case of amino acid covariation one can expect cycles
to exist in the true causal model. The result can be diﬃculty in interpreting the
independencies implied by the resulting structure [
94
]. Additionally, the goal of the
DAG model is to maximize the joint likelihood over all the attributes. This requires an
exponential number of parameters relative to the number of traits considered, leading
to a substantial computational burden and a requirement for a massive amount of
data to infer all but the simplest interaction networks.
2.4.2
Coevolution versus conditional adaptation
As discussed in
section 2.2
, there are at least two evolutionary processes that can
lead to phylogenetic confounding: coevolution and conditional adaptation. To our
knowledge, all of the existing approaches assume a coevolutionary process, by explic-
itly incorporating the assumption in a generative model (e.g., [
158
,
171
,
178
]), or by
mapping both traits to the same tree (e.g., [
17
,
74
,
102
,
141
,
182
,
190
]). In principle,
most of the p-value calibration techniques could be adapted to estimate null data by
randomizing only one variable with respect to the tree, though we are not aware of
any explicit discussion in the literature to this eﬀect. The problem with this assump-
tion is clear. When one of the traits (typically the environmental trait) does not map
well to the phylogeny of the other trait, forcing it to do so will hurt the modeling
procedure. In practice, there are two solutions to this problem. (1) Many approaches
can model an IID variable on a tree as a boundary condition of the model parameters.
For example, the generative model of Pollock et al. [
178
] can, in principle, handle ei-
ther or both traits being IID by setting the corresponding mutation rate parameter
to inﬁnity. In practice, however, we ﬁnd that this model does not perform well on
conditional adaptation data (see
chapter 5
).
The diﬀerence between conditional adaptation and coevolution is, however, deeper
than the fact that conditional adaptation is more likely to accommodate the two traits

17
following two diﬀerent distributions. The speciﬁc interactions assumed by the causal
models are diﬀerent. In the coevolution case, the assumption is typically that the
two traits are either positively or negative correlated in the traditional sense. By this
we mean that, if the traits are positively correlated, the mutation of either trait to
1 will apply pressure for the other trait to mutate to 1. Conversely, the mutation of
either trait to 0 will apply pressure for the other trait to mutate to 0. (The opposite
is of course true for negative correlation.) The conditional adaptation assumption
can posit two diﬀerent deﬁnitions. (1) It can be a directed association. Speciﬁcally,
changing trait X (the environmental variable in our example) may inﬂuence trait
Y , but changing trait Y will have no inﬂuence on X. Second, the interaction may
be only partially positive (or negative). For example, it maybe that X = 1 induces
positive selection pressure for Y to transition to 1, but when X = 0, Y is under
neutral selection.
The point is not that the coevolution model is fundamentally unsound. Indeed,
there are many examples where this model is likely closer to the true causal model
than is conditional adaptation, with a prime example being amino acid covariation.
Rather, each model appears to better describe diﬀerent evolutionary processes. In
this dissertation, we propose several speciﬁc conditional adaptation models, which
are inspired by the process of adaptation described in
section 2.1
. In
chapter 5
, we
explicitly compare one of these models to the coevolution model of Pollock et al [
178
].
As we shall see, both the coevolution and conditional adaptation models are distinct,
though we also ﬁnd that our conditional adaptation model is able to approximate the
coevolution model reasonably well.

18
Chapter 3
HIV IMMUNE ESCAPE:
INTRODUCTION AND REVIEW
The extraordinary mutational capacity of HIV-1 represents a major challenge to
vaccine development [
76
,
143
]. On average, the error-prone HIV-1 Reverse Transcrip-
tase introduces one mutational “error” per replication cycle, while template-switching
and recombination represent additional mechanisms for generating alternative viral
species [
143
]. Within an infected individual, a progressive expansion of viral diversity
occurs over the disease course [
205
], with multiple variants co-existing as a heteroge-
neous swarm or quasispecies that is unique to each patient. On a global scale, HIV-1
has undergone dramatic diversiﬁcation since its introduction into humans less than
100 years ago [
121
,
229
]: nucleotide sequences from the multiple viral subtypes and
circulating recombinant forms comprising HIV-1 Group “M” strains (which account
for the majority of infections worldwide) may diﬀer by up to 35–40% [
76
]. Achieving
a broader understanding of the factors driving viral evolution on both an individual
and a global level is thus of paramount importance to vaccine design.
Over the natural course of infection, the host immune response acts as a major
selective force driving HIV-1 evolution in a continuous dynamic process known as im-
mune escape [
85
]. Directed against three-dimensional epitopes on the virion surface,
the role of antibodies is to neutralize free-ﬂoating virus or to tag them for destruction
by eﬀector cells or complement. Escape from the HIV-1-speciﬁc antibody response
thus takes the form of amino acid substitutions within the viral Envelope protein
and represents a main driver of both intra-individual and global HIV Envelope di-
versity [
226
]. In contrast, the role of cytotoxic T-lymphocytes (CTL) is to eliminate

19
virus-infected cells through recognition of short, linear peptides processed intracellu-
larly and presented on the cell surface by Human Leukocyte Antigen (HLA) class I
molecules. Since peptides from all viral proteins have the capacity to bind and be
presented by class I molecules, HLA-restricted CTL select for escape mutations on
a proteome-wide basis. This fact, combined with the observation that CTL likely
contribute more to immune control of HIV-1 infection than antibodies [
133
], high-
lights CTL as a potentially major in vivo selective force driving genome-wide viral
evolution.
It is generally agreed that a successful HIV-1 vaccine will require stimulation of
an eﬀective CTL-based immune response in addition to an antibody response [
133
].
The recent suspension of a major CTL-based HIV-1 vaccine trial [
161
] underscores the
need to improve our understanding of host antiviral immunity, including the impact of
immune-driven viral adaptation on HIV-1 sequence diversity and its potential conse-
quences for future vaccine strategies. In what follows, we summarize recent advances
in our knowledge of CTL-driven HIV evolution at a population level and discuss the
implications for vaccine design.
3.1
The HLA-restricted CTL response is a major selective force driving
HIV-1 evolution within an infected host
HLA-restricted CTL are major mediators of host antiviral control during HIV-1 infec-
tion [
85
,
86
]. A temporal correlation exists between the appearance of HIV-1-speciﬁc
CTL in vivo and the decline of acute-phase viremia [
123
] (antibodies appear only
later [
85
,
133
]), and experimental depletion of CD8
+
cells in rhesus macaques prior
to Simian Immunodeﬁciency Virus (SIV) infection results in inability to control virus
levels [
199
]. In addition, a strong epidemiological link exists between speciﬁc HLA
class I alleles and diﬀerential rates of HIV-1 disease progression [
14
,
34
,
44
], suggesting
that the quality of the CTL response and/or the characteristics of targeted epitopes
strongly inﬂuences the eﬀectiveness of antiviral control [
85
]. However, the strongest

20
evidence supporting CTL as a major determinant of HIV-1 control may be muta-
tional immune escape. First described in 1997, selection of viral escape mutations
within key CTL epitopes during primary [
16
,
19
] and chronic [
84
,
85
] HIV-1 infection
identiﬁed immune-driven evolution as a continuous process occurring throughout the
disease course. Although escape mutations are often selected within CTL epitopes
(thus disrupting peptide-HLA binding or recognition of the peptide/HLA complex by
the T-cell receptor), escape is by no means conﬁned to epitope boundaries. Muta-
tions in epitope ﬂanking regions, which impair intracellular peptide processing and
presentation, have also been described [
3
,
54
,
239
], as have secondary or compensatory
sequence changes that can stabilize escape mutations selected elsewhere [
41
,
85
,
201
].
3.2
Escape follows generally predictable patterns in response to speciﬁc
immune pressures
In the past decade, observational studies have identiﬁed a large number of CTL escape
mutations that are reproducibly selected in the context of speciﬁc HLA restrictions
[
19
,
84
,
85
,
113
,
132
,
201
]. This has led to a monumentally important observation:
HIV-1 evolution follows generally predictable patterns when speciﬁc immune pressures
are applied. This phenomenon was most strikingly demonstrated in a unique case-
report of monozygotic twins infected with the same virus on the same day through
needle sharing: over a three-year follow-up period, the kinetics and patterns of CTL
and antibody escape mutations were nearly identical in both twins [
53
]. Even among
unrelated individuals, kinetics and patterns of HIV-1 evolution are broadly predictable
based on HLA restriction. The majority of persons expressing the “protective” HLA-
B*57 allele [
34
], for example, will select for a T to N mutation at position three of the
TW10 epitope in the p24 Gag protein within the ﬁrst weeks after infection [
85
,
132
].
In B*27-expressing individuals, the ﬁrst mutation that arises in the immunodominant
Gag epitope KK10 is an L to M change at position six, followed years later by an R
to K change at position two [
85
,
113
]. The fact that sites and pathways of escape are

21
broadly predictable indicates that despite the extensive worldwide sequence diversity
of HIV-1, substantial constraints govern the evolution of this virus [
2
,
53
]. This
raises the possibility that immunogens incorporating knowledge of common escape
pathways may be designed. Until relatively recently, however, identiﬁcation of escape
mutations have generally been limited to smaller observational studies, and largely
biased towards “protective” HLA alleles associated with long term viremic control
[
19
,
84
,
85
,
113
,
132
,
201
].
3.3
Immune selection pressures drive HIV evolution at the population
level: but to what extent?
If within-host CTL escape patterns are broadly predictable based on HLA proﬁle, the
frequency and distribution of HLA alleles in humans likely shape viral evolution at
the population level in a similarly predictable manner. Indeed, the HLA footprinting
hypothesis states that the circulating HIV-1 consensus sequence reﬂects viral adap-
tation to the most commonly-expressed HLA alleles in a population [
131
,
157
]. One
potential mechanism underlying this hypothetical footprinting eﬀect is the repeated
selection of ﬁtness-neutral escape mutations in the context of frequently-observed
HLA alleles, eventually leading to ﬁxation of “inactive” forms of CTL epitopes in the
circulating pool of viral strains and potentially rendering HIV less immunogenic as the
epidemic progresses [
131
,
157
]. Alternatively, a bottleneck eﬀect may have occurred
early in the course of the pandemic. Under this hypothesis, escape mutations arising
in the earliest patients persist in the circulating strain. In either scenario, persisting
escape mutations must have minimal ﬁtness cost in the absence of CTL to prevent
reversion back to the susceptible form following transmission to patients who lack the
restricting HLA allele.
The role of CTL in shaping population HIV-1 sequence diversity is inﬂuenced by
a complex interaction among multiple conﬂicting selective forces, one example being
the delicate balance between the beneﬁts of escape versus the associated costs to

22
viral ﬁtness [
137
]. While escape mutations, by deﬁnition, confer a selective advantage
under active CTL pressure, these mutants may not represent the most eﬃciently
replicating species in the absence of immune pressure. For example, the dominant
T242N escape mutation at position three of the B*57-restricted TW10 epitope in p24
Gag abrogates B*57-epitope binding [
132
], but also confers a substantial replicative
cost in the absence of CTL pressure as demonstrated by rapid reversion to wild
type following transmission to a B*57-negative individual [
132
] and in vitro assays
measuring viral replicative capacity [
21
,
147
]. Reversion of escape mutations following
transmission to an individual lacking the HLA allele, however, does not necessarily
occur in all cases: the ﬁtness cost of the substitution, the presence of compensatory
mutations, and a complex array of other host and viral selective forces inﬂuence which
immune-selected mutations may reach appreciable levels in the population [
132
].
Estimating the extent of immune imprinting on HIV-1 is additionally challenging
due to the lack of a comprehensive map of HLA-associated escape sites across the
viral genome. Until recently, studies of CTL escape have focused on select alleles
and/or HLA-restricted epitopes in small observational studies; however, recent ad-
vances in DNA sequencing technologies have facilitated the collection of HLA and
HIV-1 sequence data in large cohorts of HIV-infected individuals, thus allowing the
ﬁrst population-based assessments of HLA-driven imprinting on the viral genome.
3.4
Assessing the extent of HLA-driven HIV-1 evolution at the popu-
lation level: challenges and controversies
The ﬁrst study investigating HLA-mediated imprinting on HIV-1 at the population
level was published by Moore et al. in 2002 [
157
]. Speciﬁc HLA alleles associated
with the presence or absence of the consensus amino acid over codons 20-227 of the
Reverse Transcriptase protein were identiﬁed in a cross-sectional analysis of over 400
clinically-derived HIV-1 sequences from Western Australia, using Fisher’s exact test
and logistic regression. A total of 64 “positive” and 25 “negative” correlations (where

23
the HLA was respectively associated with divergence from, or preservation of, the
consensus residue) were identiﬁed. Positive correlations conﬁrmed the predictable
selection of HLA-restricted escape mutations and highlighted the substantial portion
of viral codons whose evolution is driven by escape. Negative associations, on the
other hand, were interpreted as evidence to support the HLA footprinting eﬀect;
in other words, conﬁrmation that the predominant circulating HIV-1 sequence had
arisen in response to selection pressures imposed by the most commonly-expressed
HLA alleles in the population [
157
].
There is one major concern, however, with this type of statistical association
study. Evolutionary biologists have pointed out that standard tests of association
that assume independence (such as Fisher’s exact test or logistic regression) are inap-
propriate for the analysis of inter-species (or in this case, inter-strain) data, because
sequences with a shared phylogenetic history do not represent statistically indepen-
dent observations [
61
] (see
chapter 2
for review). This problem is most apparent in
the analysis of heterogeneous cohorts, where standard tests will identify correlations
between subtype-speciﬁc viral polymorphisms and HLA alleles prevalent among indi-
viduals infected with those subtypes. In this case, results do not necessarily indicate
that these polymorphisms are selected under contemporary immune pressures; rather,
they more likely reﬂect a correlation between HLA alleles enriched in particular hu-
man populations and a subtype (or lineage)-speciﬁc viral polymorphism shared by
all sequences in this branch of the tree (a so-called founder eﬀect) [
17
]. Even in rela-
tively homogeneous cohorts, failure to account for the evolutionary relatedness among
HIV-1 sequences leads to increased variance in traditional statistical tests and thus
uncertainty in interpreting the results (see ref. [
31
] and
chapter 2
).
To address this issue, Bhattacharya et al. [
17
] proposed two methods to account
for the underlying phylogenetic structure of HIV-1 sequences when identifying sites
of CTL-mediated selection (the second method is the univariate model described and
explored in Chapters
4
and
5
). These approaches attempted to identify associations

24
that were unusual in the context of a shared lineage while statistically downplaying
associations that could have been explained by neutral evolution. Their methods were
used to identify HLA-associated polymorphisms in the Gag and Protease proteins in
a mixed-clade data set of 96 sequences from the same cohort investigated by Moore et
al. [
17
]. By comparing associations identiﬁed using standard versus lineage-corrected
analyses, Bhattacharya et al. [
17
] demonstrated that a number of associations iden-
tiﬁed by the original logistic regression method actually represented spurious asso-
ciations that were better explained by founder eﬀects. The authors noted however
that phylogenetic correction achieves more than simply the elimination of spurious
associations due to lineage eﬀects. Consideration of tree structure also identiﬁed
novel associations that were overlooked by the uncorrected analysis, thus allowing
the potential for increased power compared to standard methods (though a formal
power analysis was not undertaken) [
17
]. Overall, however, in the 96 sequences ana-
lyzed by Bhattacharya et al., only 14 strong HLA allele/HIV-1 sequence associations
were identiﬁed [
17
]. The relative paucity of associations identiﬁed by Bhattacharya
et al. [
17
] compared to Moore et al. [
157
] led some to question the contribution of
HLA-mediated selection pressures to viral evolution [
118
].
What followed was a scientiﬁc debate regarding the extent of HLA-mediated vi-
ral imprinting at the population level. Was the data we previously accepted to be
strong evidence of immune-driven viral evolution simply explained by founder eﬀects?
Not necessarily. Not only did the Moore [
157
] and Bhattacharya [
17
] analyses fea-
ture diﬀerent HIV-1 gene regions sequenced from diﬀerent patient groups, but the
latter was based on a data set less than one-quarter the size of the former, and thus
had substantially reduced power to detect associations. Indeed, mathematical mod-
eling experiments suggest that the false-negative rate (i.e. the % of true associations
missed) may be over 80% in a data set of N=100 (
chapter 6
), and Bhattacharya et
al. themselves emphasized that their results should not be interpreted as evidence
that immune pressure is a weak force in HIV-1 evolution [
17
]. Rather, the results un-

25
derscore the importance of disentangling the eﬀects of immune selection from founder
eﬀects and emphasize the need for larger data sets to determine the extent to which
immune imprinting shapes HIV-1 diversity at the population level.
3.5
HLA-associated immune pressures inﬂuence population HIV diver-
sity at up to 40% of positions in some proteins
Soon after the Bhattacharya et al. publication [
17
], we applied the phylogenetically-
corrected methods to the ﬁrst large-scale, population-based analysis of HLA-mediated
imprinting on multiple HIV-1 genes [
24
]. Nearly 500 HLA-associated polymorphisms
in the HIV-1 Protease, Reverse Transcriptase, Vpr and Nef proteins were identiﬁed in
a cohort of ≈ 700 chronically-infected, antiretroviral na¨ıve individuals [
24
]. Polymor-
phisms were dichotomized based on the direction of selection pressure: escape and
reversion associations represented amino acids enriched in the presence or absence
of a speciﬁc HLA allele, respectively. As expected, escape and reversion associations
generally represented non-consensus and consensus residues, respectively, although
some exceptions were noted. In addition, HIV-1 codons under diametrically opposed
HLA selection pressures (where the escaped amino for one allele represented the re-
verted (immunologically susceptible) form for another, and vice versa) were identiﬁed,
highlighting a dynamic tug-of-war of immune selective pressures inﬂuencing HIV-1 di-
versity at the population level [
24
]. Of note, the fact that the majority of identiﬁed
associations fell outside the boundaries of known epitopes lead us to conlcude that
escape is not conﬁned to single point mutations selected within regions directly tar-
geted by CTL; rather, immune pressures select for a broad range of polymorphisms
on a protein- (and genome-) wide level. (As will be discussed in
section 3.8
and
chapter 6
, however, at least some of these association may be due to other sources of
confounding.)
In addition, substantial diﬀerences in the number of immune selection events were
observed among HIV proteins [
24
]. Nef exhibited the greatest evidence for immune

26
adaptation, with ≈ 40% of its codons harboring at least one HLA association. In
contrast, 10–15% of codons in Protease, Reverse Transcriptase, and Vpr exhibited
evidence for HLA-mediated selection. More recent work suggests that the portion of
Gag codons exhibiting HLA-associated substitutions is slightly higher than Pol [
25
].
Taken together, data indicate that amino acid variation at a minimum of 10–40% of
HIV-1 subtype B codons is driven by HLA class I-associated immune pressures, even
after correction for lineage eﬀects [
24
]. In addition, we later applied the same methods
to the entire clade C proteome on a cohort of 261 South Africans and found evidence
for 310 distinct HLA-mediated escape events spanning the entire proteome [
196
], thus
reconﬁrming that CTL immune selection substantially shapes HIV-1 diversity at the
population level.
3.6
Clinical consequences of immune-mediated evolution
Immune escape is believed to be a major factor limiting the immune system’s ability
to control HIV-1 in the long term [
85
]. However, with the exception of documented
loss of viremia control following escape within the B*27-restricted KK10 epitope in
Gag [
84
], a clear relationship between CTL escape and HIV-1 disease progression has
been diﬃcult to demonstrate. A multitude of factors inﬂuence the frequency and
kinetics of viral escape, making the clinical consequences of immune-driven HIV-1
evolution challenging to quantify. Even among individuals expressing the same HLA
allele, substantial diﬀerences in the magnitude and frequency of epitope targeting are
often observed, rendering it problematic to study the clinical consequences of escape
on a population basis. A second complication is the issue of immunodominance, an
incompletely understood phenomenon describing the fact that, despite the expression
of up to six HLA class I alleles (and the potential to target multiple epitopes per allele),
an individual’s CTL response is often initially directed against a single or few epitopes
with a single HLA restriction [
134
,
236
,
237
]. Moreover, the breadth and speciﬁcity
of targeted epitopes changes throughout the disease course. While the acute-phase

27
CTL response is generally narrowly directed, a progressive broadening of the epitopic
repertoire occurs over time, likely as a consequence of viral escape within the early
immunodominant epitopes [
83
]. In addition, the balance between increased ﬁtness due
to immune escape versus decreased ﬁtness from the resulting substitution [
137
], the
fact that escape is not always absolute (such that, in many cases the selected variant
retains partial HLA binding and/or CTL recognition capacity [
219
]), and the ability
of the immune response to adapt to a continuously evolving target (as demonstrated
by the emergence of de novo CTL responses against newly-selected escape variants
[
4
]), all contribute to the complexity of this issue. Finally, the fact that escape does
not occur in isolation must also be considered, because mutations are likely to be
selected in the context of a variety of compensatory changes and at the same time that
reversion of transmitted mutations is occurring. Taken together, it seems unlikely to
expect that the selection of any single mutation would be accompanied by a clinically
detectable impact on HIV-1 disease progression. Nevertheless, the proteome-wide
association study of Rousseau et al. [
196
] identiﬁed seven polymorphisms that each
signiﬁcantly predicted viral load, highlighting the role these mutations play in viremic
control.
In addition, the identiﬁcation of HLA-associated polymorphisms in large clinically-
derived data sets allows, for the ﬁrst time, the characterization of the relationship
between escape and markers of HIV-1 disease progression on a broader level. In their
2002 study, Moore et al. reported that the presence of HLA-associated polymorphisms
in RT predicted pre-treatment plasma virus load (pVL) on a population basis [
157
];
however, this observation was not conﬁrmed in the larger Brumme et al. [
24
] study
that included Protease and Reverse Transcriptase. More recent work has, however,
begun to identify classes of associations that contribute strongly to viremic control,
with the importance of B alleles, the Gag protein and reversion associations partic-
ularly highlighted (see
section 5.6
for details). Further research is needed to assess
whether there may be gene-speciﬁc diﬀerences in the contribution of CTL escape mu-

28
tations to markers of HIV disease, the answer to which may greatly inform vaccine
design.
3.7
Strategies to cope with viral diversity in HIV-1 vaccine design
Numerous strategies have been proposed to address the challenge of global sequence
diversity in HIV-1 vaccine design. Immunogens based on consensus or phylogenetic
reconstructions of ancestral and/or “center-of-tree” sequences attempt to minimize
genetic distance between the vaccine and circulating HIV-1 strains [
76
,
163
]. Polyva-
lent vaccine immunogens featuring maximal coverage of viral diversity and potential
epitopes in compact sequence space have also been proposed [
65
,
107
,
164
]. To ad-
dress the substantial mutational capacity of HIV-1, one proposed approach is to limit
vaccine design to immunogenic yet highly conserved regions, such that escape could
only occur at a substantial ﬁtness cost [
7
]. A complimentary strategy would be to de-
sign immunogens that incorporate both “wild-type” and “escaped” variants (as long
as the variant retains its ability to bind HLA), thereby blocking preferred routes of
escape in infected individuals [
20
].
The development of phylogenetically-informed methods to accurately identify vi-
ral sites under active immune selection pressure using large clinically-derived HIV-1
sequence data sets represents a major advancement to the study of how viral genomes
are shaped by human immunogenetic selection pressures. Achieving a complete pic-
ture of the sites, pathways and kinetics of immune escape will not only help us gain
an understanding of the extent to which host immunity shapes HIV-1 evolution, but
will also inform the rational design of future vaccine immunogens.
3.8
Remaining challenges
Despite recent advances in the ﬁeld, a genome-wide map of HLA-associated polymor-
phisms in HIV-1 remains far from complete. Although the importance of addressing
the confounding eﬀects of HIV-1 population structure are now recognized [
17
,
24
],

29
three important issues need to be addressed: statistical power, linkage disequilibrium
(LD) among HLA alleles and coevolution of amino acids in viral sequences. As we have
seen, the size of the study can dramatically eﬀect the conclusions that are drawn. To
date, the largest study analyzes a fairly homogenous cohort of approximately 700 in-
dividuals, which resulted in dramatically diﬀerent conclusions than studies on smaller
cohorts. We will discuss this issue in the context of our proposed model in
subsec-
tion 6.2.5
, but the results discussed thus far underscore the importance of exploring
these types of cross-sectional studies to even larger cohorts, as well as to assess other
HIV-1 subtypes, where levels and patterns of HLA-mediated imprinting may diﬀer.
Indeed, one advantage of lineage-corrected methods may be the ability to combine
heterogeneous cohorts to increase power, though the extent of shared epitopes among
clades remains incompletely characterized.
Of major concern is the potential confounding eﬀects of linkage disequilibrium
(LD) among HLA alleles. The concern over LD arises due to the fact that HLA
class I alleles are situated in close proximity on the human genome and are not
inherited independently [
28
]; therefore, an escape mutation driven by HLA-B*57, for
example, may also be detected as being associated with the tightly linked Cw*06
allele. Whereas Moore et al. [
157
] addressed the issue of LD using logistic regression
and a backwards elimination technique to identify the allele(s) that best explained the
escape polymorphism [
157
], current lineage-corrected methods [
17
,
24
] do not directly
address this issue.
Similarly, none of the current approaches account for the eﬀects of amino acid
coevolution, such as when an amino acid substitution at one site preferentially occurs
in the context of a secondary (or compensatory) mutation at another site [
21
,
201
].
Failure to account for this issue may result in both the primary and compensatory
mutations being identiﬁed as correlated with the restricting HLA allele, when in fact
only the former is directly selected. Although Moore et al. attempted to include
co-varying amino acids as regressors [
157
], the confounding eﬀects of phylogeny are

30
even greater when two amino acids share the same evolutionary history; indeed, a
substantial body of literature exists to address the problem of covarying amino acids
[
39
], though these methods have not previously been extended to incorporate HLA
allele information or even correlations among multiple codons (see Chapters
2
and
5
).
Similarly, although Bhattacharya et al. [
17
] found evidence of compensatory
mutations using a modiﬁed version of the evolutionary history reconstruction method
proposed by Ridley in 1983 [
190
], the results are diﬃcult to interpret due to the
confounding of long and short range eﬀects that occur when only pairs of variables
are considered. Clearly more comprehensive approaches are needed to disentangle the
complex interactions that govern HIV-1 escape pathways.

31
Chapter 4
PHYLOGENETIC DEPENDENCY NETWORKS
We have argued in Chapters
2
and
3
that the study of HIV adaptation (and, by
analogy, any study of adaptation) requires a framework that can model the evolu-
tionary history of an amino acid in conjunction with selection pressure from one or
more factors (HLA alleles, other amino acids, etc.). In this chapter, we describe the
phylogenetic dependency network (PDN), an extension of the dependency network
[
94
] that is well-suited for modeling adaptation. We will describe the PDN generally,
using the area of HLA-mediated HIV adaptation to illustrate the concepts.
A dependency network represents the probabilistic dependencies among a set of
predictor and target attributes [
94
]. In our domain, target traits, denoted Y, corre-
spond to the presence or absence of amino acids at all codons in an HIV protein. For
a given Y in Y, the predictor traits, denoted X, correspond to the presence or absence
of amino acids at all codons other than that for Y and the presence or absence of all
HLA alleles. We constrain all traits to be binary, though generalizations are possible.
We have found that this choice yields more statistical power in practice.
A dependency network (phylogenetically corrected or otherwise) has two compo-
nents. The ﬁrst component, sometimes referred to as the structure of a dependency
network, is a directed graph linking nodes, where each node corresponds to one of
the traits in the domain. (We use the same name—e.g., Y —for the trait and its
corresponding node in the graph.) An arc from X to Y in the graph is a statement
that the probability distribution for Y depends on X. Thus, in our domain, a depen-
dency network graphically depicts which HLA and codon traits predict each codon.
The second component is a collection of conditional or local probability distributions,

32
one for every target trait of interest. The local probability distribution for target
trait Y is P (Y | ˆ
X), where ˆ
X ⊆ X are the parents of Y in the graph. Therefore, in
our domain, a dependency network contains a probability distribution for each codon
trait conditioned on various HLA and codon traits. When constructing a dependency
network, each local probability distribution is learned independently. This approach
is computationally eﬃcient, although it can lead to a decrease in statistical eﬃciency
(see Discussion).
A phylogenetic dependency network (PDN) for our HIV application is a depen-
dency network in which each local probability distribution is corrected for the phylo-
genetic structure of the HIV sequences. That is, the probability that a codon in an
individual is a given amino acid depends on not only the traits ˆ
X, but also on where
that individual’s HIV sequence sits in the phylogeny (
Figure 4.1
). Speciﬁcally, a PDN
is a collection of the distributions P
Ψ
(Y | ˆ
X), one for each Y in Y, where P
Ψ
refers to
a distribution corrected for phylogeny.
We use a model-selection approach to identify ˆ
X, the set of parents for Y . Specif-
ically, we use signiﬁcance tests—False Discovery Rate (FDR) thresholds based on
likelihood-ratio tests (LRTs)—to determine ˆ
X. To avoid the inappropriate use of an
LRT, we exclude traits as possible predictors when the corresponding predictor-target
pair has a 2 × 2 contingency table that includes at least one bin where both the ob-
served and expected value is at most three. This parameter was chosen based on
performance with independent data (not shown).
4.1
Phylogenetically corrected distributions for one predictor trait
A simple approach for identifying a set of traits that predict a given codon (i.e., for
identifying the parents of a target trait in a PDN) is to test for pairwise correlations
between a target codon and each predictor trait. The details of a statistical model
that follows this approach, hereafter referred to as the univariate model, are described
in
section 4.4
and evaluated in
chapter 5
. The model was ﬁrst presented in [
17
], with

33
Figure 4.1: Phylogenetic dependency network (PDN). A PDN is a graphical model
consisting of target traits whose outcome is a probabilistic function of predictor traits.
Each of these probabilistic functions takes the phylogeny of the sequences into ac-
count. Here, the target traits (green nodes) are binary and represent the presence or
absence of amino acids at codons. These target traits may have dependencies on other
codons (codon covariation) and/or on HLA alleles (HLA-mediated escape), which are
denoted by blue nodes. Arcs represent the learned dependencies between target and
predictor traits. All target traits are assumed to be inﬂuenced by the phylogeny (red
arcs). The probability components of a PDN are the local conditional probabilities,
each of which relates a single target trait to the phylogeny and a subset of the pre-
dictor traits. These local conditional probabilities are learned independently for each
target trait. In the hypothetical example depicted here, B*57 and B*58 predict M1
and A*02 predicts A5. A5 predicts A3, and there is a cyclical dependency among
M1, G2, A3 and R4, in which most of the arcs are bidirectional.

34
the details described in [
31
]. We will provide a high level overview of the model here.
To determine whether there is a signiﬁcant pairwise correlation between predictor
trait X and target amino acid Y , we compare the likelihood of a null model (sometimes
referred to as the single variable model) that reﬂects the assertion that Y is under no
selection pressure to an alternative model that reﬂects the assertion that Y is under
selection pressure induced by a single predictor trait X. The null model assumes the
target codon Y can be described completely by a model of independent evolution along
a phylogenetic tree (
Figure 4.2
A). The leaves of the tree correspond to individuals in
the study and are typically observed. The interior nodes of the tree correspond to
unseen individuals infected by an HIV sequence that is a point of divergence. These
nodes are hidden—that is, never observed. We use Y
i
to denote trait Y for the ith
individual in the study (i = 1, . . . , N ). (Note that Y
i
is a variable in the ordinary
statistical sense.) Because each target trait is binary, a natural null model is the two-
state version of the continuous time Markov process, commonly used in phylogenetics
[
60
]. This model assumes evolution is independent between diﬀerent branches of the
phylogeny and that the only informative predictor of a node in the evolutionary tree
is its parent node in the tree.
The alternative model adds a component of selection pressure derived from the
predictor trait X (
Figure 4.2
B). We use variable X
i
to denote the trait X for the ith
individual in the study (i = 1, . . . , N ), and do not explicitly name X for the unseen
individuals represented in the interior of the phylogenetic tree. Because X may not
share the same evolutionary history as Y , we assume X inﬂuences Y only at the
leaves of the tree. In particular, we assume that, among the variables corresponding
to trait X, only X
i
inﬂuences Y
i
for each i. This assumption was evaluated more
fully by Carlson et al. [
31
] and found to be a reasonable approximation, even when
X and Y share the same evolutionary history. To model selection pressure at the
leaves, we extend the null model by adding a hidden trait H (with corresponding
variables H
i
, i = 1, . . . , H) that represents what Y would have been had there been

35
Figure 4.2: The univariate model. (A) The null model, in which an amino acid evolves
independently down the tree until it reaches a leaf. (B) The alternate model, in which
an amino acid evolves independently down the tree until is reaches an individual,
where it is inﬂuenced by selection pressure from the predictor. The variable H
i
for
the ith individual represents the variable Y
i
had there been no inﬂuence from X
i
. Only
the Y
i
and X
i
are observed. Conditional probability distributions are not shown.
no selection pressure. The probability distribution for Y
i
then depends on H
i
and X
i
.
When the values of H
i
and Y
i
are diﬀerent, we say that a transition conditioned on
X
i
has taken place. The precise rules governing the transitions conditioned on X
i
are
given by the univariate leaf distribution P
Ψ
(Y |X) = P (Y
i
|H
i
, X
i
). We assume that
this leaf distribution is not a function of i—that is, this distribution is the same for
each individual i = 1, . . . , N . Also note that the subscript Ψ is a reminder that Y
i
depends not only on X
i
, but also on the phylogeny through variable H
i
.
In the univariate case, we deﬁne four possible leaf distributions. Escape means an
individual may transition to Y
i
= 0 only when X
i
= 1. Reversion means an individual

36
may transition to Y
i
= 1 only when X
i
= 0. Attraction means an individual may
transition to Y
i
= 1 only when X
i
= 1. Repulsion means an individual may transition
to Y
i
= 0 only when X
i
= 0. Given a univariate leaf distribution, a single parameter
s speciﬁes the probability that the transition occurs given the appropriate state of
X
i
. Note that attraction/repulsion correspond to a positive correlation between X
i
and Y
i
, whereas escape/reversion correspond to a negative correlation.
The names of these leaf distributions correspond to various processes for selection
pressure [
31
]. For example, the B*57-restricted CTL response selects for escape from
the susceptible threonine at position 242 of the HIV Gag protein [
132
]. So, from the
perspective of hidden and target traits that correspond to the presence and absence
of threonine, the amino acid can transition from threonine to not threonine (H = 1,
Y = 0) with a non-zero probability only when the individual has the B*57 allele
(X = 1), which corresponds to the escape distribution just described. In addition,
escape from threonine bears a ﬁtness cost that leads to reversion in B*57-negative
individuals [
132
]. Consequently, the amino acid can transition from not threonine to
threonine (H = 0, Y = 1) with non-zero probability only when the individual lacks
B*57 (X = 0), corresponding to the reversion distribution. The codon for threonine
usually escapes to the resistant amino acid asparagine [
132
]. Continuing the exam-
ple from the perspective of hidden and target traits that correspond to the presence
and absence of asparagine, the amino acid can transition from not asparagine to as-
paragine (H = 0, Y = 1) only when the individual has the B*57 allele (X = 1), which
corresponds to the attraction distribution. Finally, the amino acid can transition from
asparagine to not asparagine (H = 1, Y = 0) only when the individual lacks the B*57
allele (X = 0), which corresponds to the repulsion distribution. Although there is a
natural pairing between escape/reversion and attraction/repulsion, in that the former
indicates a negative correlation and the latter a positive correlation, the processes are
each distinct and may provide information as to the underlying mechanism (see the
section on distinguishing leaf distributions in Results). Furthermore, whereas the vast

37
majority of clinically-derived HIV sequences have either threonine or asparagine at
codon 242, most codons are more variable, with more than one amino acid suscepti-
ble to, or resistant from, CTL pressure mediated by the HLA allele. Consequently,
escape/attraction and reversion/repulsion for alternate amino acids often provide ad-
ditional information. Note that by restricting the univariate leaf distribution to one
of these four forms, we have assumed that only one process (escape, reversion, at-
traction, or repulsion) is occurring for a given predictor-target pair. Although in
reality both escape and reversion (or attraction and repulsion) may occur with the
same HLA-epitope combination, relaxing our assumption leads to substantial loss of
power. Thus, we apply each of the four leaf distributions to the predictor-target pair
and include only the most signiﬁcant correlation in the model.
4.2
Phylogenetically corrected distributions for more than one predic-
tor trait
The univariate model works well when there are no correlations among predictor traits
or among target traits [
31
]. As discussed, however, use of the model in the presence of
linkage disequilibrium among HLA alleles and HIV codon covariation will likely lead
to spurious associations. To avoid this problem, we use a multivariate model [
33
], in
which more than one trait can be used to predict a particular target trait. In this
model, for a given target trait Y , shown in
Figure 4.3
, the target trait is allowed to
evolve independently down the tree until it reaches a leaf in the tree corresponding
to an individual in the study. At this point, selection pressure within the individual
is governed by a multivariate leaf distribution, denoted P
Ψ
(Y | ˆ
X), which depends on
multiple predictor traits ˆ
X. As in the univariate case, this leaf distribution is the
same for each individual i = 1, . . . , N .
The set of signiﬁcant predictor traits can be identiﬁed by a number of meth-
ods including forward, backward, and forward/backward selection. In this work, we
concentrate on forward selection, wherein ˆ
X is iteratively augmented with the most

38

X

Hidden attribute

H

Observed target attribute

Y

Predictor attribute

X

Figure 4.3: The multivariate model. Here, an amino acid evolves independently down
the tree until is reaches an individual, where it is inﬂuenced by one or more predictor
traits.
signiﬁcantly associated trait at each iteration. For each added trait, we record only
the most signiﬁcant leaf distribution (escape, reversion, attraction, or repulsion). The
signiﬁcance of a predictor X with respect to target trait Y is computed using false
discovery rates based on an LRT in which both the null and alternative models are
conditioned on all signiﬁcant predictors that were identiﬁed in previous iterations of
forward selection. For practical purposes, we terminate forward selection when the
most signiﬁcant association has a p-value greater than or equal to some threshold to
be described.
There any many possibilities for the form of the multivariate leaf distribution
P
Ψ
(Y | ˆ
X). In this paper, we consider two distributions: Decision Tree and Noisy
Add.

39
Decision Tree
A straightforward way to represent the multivariate leaf distribution P
Ψ
(Y | ˆ
X) is to
list the probability distribution for Y given every possible instance of the traits H
and ˆ
X. Unfortunately, the length of this list grows exponentially with the number
of predictor traits. An alternative is to use a Decision Tree, which is a compact
representation of such a list. The use of the Decision Tree as a multivariate leaf
distribution was recently employed by Matthews et al. [
153
] to account for HLA LD.
Here, we describe the approach in some detail.
A graphical depiction of the Decision Tree leaf distribution is shown in
Figure 4.4
.
Note that this tree should not be confused with the phylogenetic tree. To help avoid
this confusion, we use the term tip to refer to the bottom points on the Decision Tree.
Each path in the tree from root to tip deﬁnes a particular instance of a subset of the
traits ˆ
X, which in turn deﬁnes a conditioning event for the distribution of the target
trait. For example, in
Figure 4.4
, we consider the set of predictor traits ˆ
X = (B57,
C06, M28), with each branch labeled 0 or 1. The path that follows the value 0 for the
trait B57, the value 0 for the trait C06, and the value 1 for the trait M28 corresponds
to the instance (B57 = 0, C06 = 0, M28 = 1)—that is, the individual has M28 but
not B57 or C06. At the tip of this path sits the corresponding conditional probability
distribution P
Ψ
(Y
i
|B57 = 0, C06 = 0, M28 = 1). In general, each tip k in the Decision
Tree is associated with the conditional distribution P
Ψ
(Y
i
| ˆ
X = path
k
), where path
k
is the conditioning event corresponding to the kth path. The collection of these
conditional distributions over all tips constitutes the multivariate leaf distribution.
A Decision Tree leaf distribution can be constructed in many ways. As mentioned,
we use forward greedy search. First, we initialize the tree to a single root node, which
is simply the univariate leaf distribution for the most signiﬁcant trait. We then grow
the tree iteratively. At each iteration, we consider extending (or splitting) a tip node
k on some trait not already in the path to the tip. When splitting tip node k on

40

B57
C06
M28
Predictor variable X
Local probability distribution
1
0
1
0
1
0
Figure 4.4: Decision Tree leaf distribution. Each path from root to leaf yields a
distinct local probability distribution.
an trait X, the node is replaced with two branches and two corresponding tip nodes.
The left and right branches correspond to adding X = 1 and X = 0, respectively,
to the conditioning event associated with the original tip node. The split is made
if the resulting local distribution is a signiﬁcantly better estimate than that prior to
the split, as measured by an LRT. The LRT is computed using the univariate model
applied to those individuals whose trait values match those described by path
k
. To
make the process more eﬃcient in our HIV application, we consider splitting the tip
node only under the path X = 0 for all X in ˆ
X. That is, we repeatedly apply the
univariate model to all individuals for whom X = 0 for all the previously identiﬁed
signiﬁcant predictor traits. We iterate this process until no signiﬁcant predictors are
found, using a threshold of p < 0.05.
Noisy Add
One drawback of the Decision Tree approach is that, as the tree grows, the number of
samples that we use to test for the next split decreases. Rather than consider smaller
and smaller subsets of the data, the Noisy Add leaf distribution models selection pres-
sure as an additive process among the predictor traits. That is, the Noisy Add leaf

41
distribution is based on the assumption that each predictor trait independently con-
tributes a positive or negative selection pressure on the target trait. These pressures
then sum to determine the value of the target trait.
In the univariate case, each leaf distribution can be seen as representing three mu-
tually exclusive and exhaustive events (for each individual): (1) the selection pressure
is absent, either because the state of the predictor trait excludes selection pressure or,
with probability 1 − s, no transition occurred despite the potential for selection pres-
sure; (2) selection pressure leads to Y
i
= 1 (attraction or reversion); or (3) selection
pressure leads to Y
i
= 0 (escape or repulsion). We can represent these three possible
events by a hidden trait I that takes on the values 0, 1, and −1, respectively. Given
a set of M predictor traits, we can associate a hidden variable I
j
i
for the jth trait
in the ith individual. Then, assuming that selection pressure across the predictor
traits contributes independently and equally to the outcome of Y
i
, we can determine
the outcome of Y
i
by summing the values of the I
j
i
variables: Σ
i
= I
1
i
+ · · · I
M
i
. If
Σ
i
is 0, then it is as if no selection occurred. If Σ
i
< 0, then negative selection (es-
cape/repulsion) has occurred, and the target variable Y
i
will be zero. If Σ
i
> 0, then
positive selection (attraction/reversion) has occurred, and the target variable will be
one. Of course, we don’t know the actual values of I
j
for each predictor variable,
so we must sum over the possibilities, resulting in a probability distribution over Σ
i
.
The strength or frequency of selection pressure contributed by each predictor trait
j is captured by the parameter s
j
. Like the corresponding parameter s in the uni-
variate model, s
j
is the probability that the predictor trait exerts selection pressure
(I
j
i
= 0), given the appropriate state for the predictor trait. A more precise deﬁnition
of Noisy Add, including the generalization from the univariate model, speciﬁcs of
learning the parameters s
j
, and methods for reducing computation time can be found
in the section on model details.
The contribution of a given predictor trait X
j
∈ ˆ
X as a predictor of target Y
is quantiﬁed using an LRT against the null model consisting of ˆ
X − X
j
. The most

42
signiﬁcant predictor trait is added to the Noisy Add model on each iteration, stopping
when the most signiﬁcant predictor fails to achieve p < 0.005. (We use a more aggres-
sive threshold than that for Decision Tree because Noisy Add is more computationally
intensive.)
4.3
q-values
We identify signiﬁcance using q-values [
211
], which conservatively estimate the false
discovery rate (FDR) [
15
] for each p-value. The FDR is deﬁned to be the expected
proportion of false positives among results called signiﬁcant at a given threshold t.
The q-value of t is the minimum FDR observed for all t ≥ t [
211
]. Following Storey
and Tibshirani [
211
], we use the approximation
FDR(t) = E
F (t)
S(t)
≈
E [F (t)]
E [S(t)]
,
(4.1)
where S(t) is the number of associations called signiﬁcant at t and F (t) is the number
of true nulls (false positives) at t. To estimate the numerator, we order the p-values
of the association tests in increasing order p
1
, . . . , p
m
and use the approximation
E [S(p
i
)] ≈ S(p
i
) = i. To compute E [F (t)], Storey and Tibshirani point out that
uniformity of p-values allows the approximation
E [F (p
i
)] ≈ ˆ
π
0
p
i
m
(4.2)
where ˆ
π
0
is a (conservative) estimate of the proportion of all hypotheses that are truly
null. In our case, we assume a priori that the vast majority of the many hypotheses
tested will be null (i.e., most codons and HLA alleles have no direct eﬀect on a given
target trait), and so conservatively set ˆ
π
0
= 1.
It should be pointed out that
E [F (p
i
)] = ˆ
π
0
p
i
m
(4.3)
is only guaranteed if the data are continuous and the p-values are uniformly dis-
tributed under the null hypothesis. The continuous assumption is certainly not met

43
for genetic data, which is discrete, and the p-values are quite often not uniformly
distributed. Thus, E [F (p)] often has to be estimated directly from the data. We will
discuss this in more detail in sections
5.1.2
and
6.1.1
.
4.4
Model Details
In this section, we provide details regarding the univariate and Noisy Add models, in
addition to a brief discussion on computational requirements for the models.
4.4.1
Details of univariate model
First, let us consider the null model. Consider target trait Y that denotes the presence
(Y = 1) or absence (Y = 0) of a particular amino acid at a particular codon. We
use variable Y
i
, i = 1, . . . , N to denote the trait Y for the ith individual in the study.
(We use corresponding notation for predictor traits and variables.) It is quite common
to assume that the variables Y
1
, . . . , Y
N
are independent and identically distributed
(IID). In our application, however, the variables are related through a phylogenetic
tree. We can model these relationships using a probabilistic phylogenetic model as
shown in
Figure 4.2
A. Nodes at the leaves of the tree, labeled Y
1
, . . . , Y
N
correspond
to the variables with the same name. (In general, we will use the same designation for
both a variable and its node.) Unlabeled nodes in the interior of the tree correspond
to events of divergence. We use Ψ to denote the structure (branchings and branch
lengths) of the tree.
Associated with each variable (or node) B in this phylogenetic tree is a condi-
tional probability distribution P (B|A), where A is the parent node of B. As in the
probabilistic model of Felsenstein [
60
] for a phylogenetic tree, we assume that the con-
ditional probability table is described by a continuous time Markov process (CTMP)
and parameterized by θ = (π, λ), where π is the stationary distribution of Y = 1 and
λ is the rate of mutation. The conditional probability table of the CTMP from parent

44
node A to child node B along a branch of length d is given by
P (B = b|A = a, d) =



e
−λd
+ π
b
· (1 − e
−λd
) if a = b
π
b
· (1 − e
−λd
)
if a = b.
(4.4)
where π
b
= π when b = 1, and π
b
= 1 − π when b = 0. This evolution model is
reversible, making the choice of root in the tree arbitrary [
60
].
Given a set of observations for (typically, all of) Y
1
, . . . , Y
N
, there are several
criteria that can be used to identify good values for the parameters π and λ and
the structure Ψ of this model (or, in the Bayesian case, a distribution over these
quantities). For this and all models discussed in this paper, we choose parameters and
structure using the maximum likelihood criterion, as is done in (e.g.) [
60
]. There are
a number of methods for identifying the maximum-likelihood parameters, including
gradient decent and the Expectation-Maximization (EM) algorithm. In this work, we
use the EM algorithm [
49
] to learn θ. To learn the structure Ψ, we apply PhyML to
the nucleotide sequences using the general time reversible GTR model with all other
parameters estimated from the data [
87
].
We denote this null model P
Ψ
(Y |θ), as it represents a phylogenetically corrected
distribution for Y . Note that this model includes the situation where the observations
of Y
1
, . . . , Y
N
are IID as a special case (i.e., the limit as λ tends to inﬁnity.)
Now let us consider the alternative model, which reﬂects the assumption that a
codon is under selection pressure induced by a single predictor trait X. To construct
this model, shown in
Figure 4.2
B, we begin with the null model and ﬁrst change each Y
i
to H
i
, which represents what Y
i
would have been had there been no inﬂuence from X
i
.
Then, we assume that, for each individual i, the probability distribution for Y
i
depends
on X
i
and H
i
. Further, we assume that these conditional distributions P (Y
i
|H
i
, X
i
)
are the same for each individual i, and collectively denote them by P
ψ
(Y |X). In
general, this univariate leaf distribution can have four parameters corresponding to
the four states of the conditional variables H
i
and X
i
. In our experience, however, use
of such a distribution leads to loss of power. Consequently, we consider four separate

45
distributions (as was previously deﬁned [
31
]) and, for any given association, choose
the one that best ﬁts the data:
Escape P (Y
i
= 0|H
i
= 1, X
i
= 1) = s > 0; P (Y
i
= 1|H
i
= 0, X
i
= 1) = 0;
P (Y
i
= a|H
i
= a, X
i
= 0) = 1. That is, H
i
and Y
i
can be in diﬀerent states
only when H
i
= 1 and X
i
= 1.
Reversion P (Y
i
= 1|H
i
= 0, X
i
= 0) = s > 0; P (Y
i
= 0|H
i
= 1, X
i
= 0) = 0;
P (Y
i
= a|H
i
= a, X
i
= 1) = 1. That is, H
i
and Y
i
can be in diﬀerent states
only when H
i
= 0 and X
i
= 0.
Attraction P (Y
i
= 1|H
i
= 0, X
i
= 1) = s > 0; P (Y
i
= 0|H
i
= 1, X
i
= 1) = 0;
P (Y
i
= a|H
i
= a, X
i
= 0) = 1. That is, H
i
and Y
i
can be in diﬀerent states
only when H
i
= 0 and X
i
= 1.
Repulsion P (Y
i
= 0|H
i
= 1, X
i
= 0) = s > 0; P (Y
i
= 1|H
i
= 0, X
i
= 0) = 0;
P (Y
i
= a|H
i
= a, X
i
= 1) = 1. That is, H
i
and Y
i
can be in diﬀerent states
only when H
i
= 1 and X
i
= 0.
This model is reversible in the sense that the choice of root node among non-leaf
nodes does not aﬀect the likelihood of the data. We also note that, in principle,
all parameters θ=(π, λ, s) and the structure Ψ can be optimized simultaneously.
In practice, however, we ﬁnd that using the structure Ψ learned in the absence of
information about X works well, and is computationally more eﬃcient. In addition,
it may seem counter-intuitive that the HLA alleles of the individuals corresponding to
the interior nodes of the phylogeny are not being taken into account. A path from one
node to the next in the phylogeny, however, presumably reﬂects a series of infections
over many individuals, some who will have the allele and some who will not. Thus,
there will be some net evolution, which we account for by optimizing the parameters
π and λ for each codon individually. Finally, we note that this model can be thought

46
of as a (discrete) mixed-eﬀects model, wherein the predictor variables X
i
correspond
to the ﬁxed eﬀects and the hidden variables H
i
correspond to the random eﬀects [
40
].
Rather than being related by (e.g.) a Gaussian covariance matrix, the random eﬀects
are related by a phylogenetic tree.
Both the null and alternative models are instances of what is known as a generative
or directed acyclic graphical (DAG) model. In general, a generative model consists
of a structure, a directed acyclic graph, in which nodes correspond to variables and
missing arcs specify conditional independencies among the variables, and a set of
conditional probability distributions, one distribution for each node. The conditional
probability distribution for a given node is the distribution of the node given its
parents. The conditional independencies speciﬁed by the structure of the graph allow
the joint distribution of the data to be written as the product over the nodes of their
conditional distributions. The independencies represented by the model facilitate
computationally eﬃcient inference, parameter estimation, and structure learning [
93
].
Importantly, given a set of parameters learned from real data, synthetic data can be
easily generated from the model.
When constructing PDNs, we separately learn
a DAG model to encode each local probability distribution. As mentioned in the
Discussion, however, one can restrict the arcs in a PDN to be acyclic, thus resulting
in a single (phylogenetic) DAG model for all the traits in the data set.
In the following section, we consider the multiple-predictor case and again use
graphical models to represent phylogenetically corrected distributions. As we shall
see, the computational eﬃciencies aﬀorded by graphical models will play an even more
important role.
4.4.2
Details of Noisy Add model
To understand the Noisy Add leaf distribution, let us recast the univariate distribu-
tion as the generative process shown in
Figure 4.5
A. (Recall that this distribution is
independent of i. In the text that follows, we describe this and the generalized process

47
for an arbitrary individual i. In the corresponding ﬁgures, we drop the subscript i to
simplify the notation.) If X
i
= 1 (for escape or attraction; X
i
= 0 for reversion or
repulsion), a coin weighted with probability s for heads is ﬂipped. If the coin lands
heads, then the intermediate variable I
i
gets the value 1 (for attraction or repulsion;
-1 for escape and reversion). Otherwise, I
i
gets the value 0, corresponding to no selec-
tion pressure. The value of I
i
is then copied to the value of another variable Σ
i
. (The
copy is not necessary here, but will help us generalize.) Finally, the target variable Y
i
is assigned a value based on the deterministic function shown in
Figure 4.5
B. With a
little checking, it can be seen that this process produces precisely the univariate leaf
distributions for escape, reversion, attraction, and repulsion.
The generalization of this process to multiple predictor variables ˆ
X
i
= X
1
i
, . . . , X
M
i
is shown in
Figure 4.5
C. Here, there is an X
j
i
and I
j
i
node for each predictor variable
X
j
i
. The weight on the coin is possibly diﬀerent for each predictor variable. We use
s
j
to denote the weight for predictor variable X
j
i
, and s to represent the collection of
parameters (s
1
, . . . , s
M
). The variable Σ
i
is now a sum of the intermediate variables
I
1
i
, . . . , I
M
i
. Finally, as in the univariate case, Y
i
is a deterministic function of Σ
i
and
H
i
as given in
Figure 4.5
B.
Applying this generative process to individuals i = 1, . . . , N , we obtain the con-
ditional distribution P (Y
1
, . . . , Y
N
, H
1
, . . . , H
N
, I
1
1
, . . . , I
M
N
|X
1
1
, . . . , X
M
N
, θ), where θ =
(s, π, λ) are the parameters of the model. Maximum likelihood values for these param-
eters can be inferred eﬃciently. The summation Σ
i
= I
1
i
+ . . . + I
M
i
can be grouped as
Σ
i
= (((I
1
i
+ I
2
i
) + I
3
i
) + . . . + I
M
i
), yielding the graphical model shown in
Figure 4.5
D.
This grouping makes it possible to compute the distribution for Y
i
for any instance
of the variables ˆ
X
i
and H
i
in time that is quadratic in M . Furthermore, given any
instance of the predictor variables ˆ
X
i
, H
i
, and Y
i
, the probability distributions for
I
1
i
, . . . , I
M
i
can be computed in time that is quadratic in M . Consequently, we can
use the EM algorithm to estimate the parameters s eﬃciently. To estimate the full
set of Noisy Add parameters θ, we embed this estimation procedure within an outer

48

Σ
H
Y
0
0
0
0
1
1
>0
0
1
>0
1
1
<0
1
0
<0
0
0

Download 4,8 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 ... 12