Phylogenetic dependency networks: Inferring patterns of adaptation in hiv

bet	7/12
Sana	23.11.2017
Hajmi	4,8 Kb.
	#20686

1 2 3 4 5 6 7 8 9 ... 12

Rank
# Known epitopes

Direct
One Hop
Clean One Hop
LC
PL
L
P
Figure 7.2: Number of associations in optimal epitopes as a function of q-value rank.
additionally had direct associations. It is therefore not surprising that M
PL
, which
fails to account for codon-codon covariation, identiﬁed escape mutations within almost
as many optimal epitopes as the full model M
PLC
(
Figure 7.2
).
We further compared the HLA-codon associations of the other three models to
the optimal epitopes. Only M
L
and M
FET
, which fail to account for both phylogeny
and codon covariation and are thus quite prone to founder eﬀects [
17
], performed
signiﬁcantly worse than the other models (p < 0.0001). The models that roughly
account for clade diﬀerences, either through codon covariation (M
LC
) or phylogeny
(M
P
), performed slightly worse than the full model, though these diﬀerences were
not signiﬁcant.
7.3
Discussion
We have presented the ﬁrst approach to simultaneously account for viral phylogeny,
codon covariation, and HLA linkage disequilibrium in population-based association
studies. It is also the ﬁrst large scale multiclade analysis of HLA-mediated escape

113
in HIV-1, as well as the ﬁrst approach that simultaneously accounts for HLA linkage
disequilibrium, HIV ancestral relationships, and codon covariation. The large number
of direct HLA-codon associations conﬁrms a substantial role of the HLA-restricted
CTL response in driving HIV evolution, and supports the observation that patterns
of HIV evolution are broadly predictable based on host immunogenetic proﬁles [
2
,
17
,
24
,
25
,
157
,
196
]. Moreover, results demonstrate that escape and reversion mutations
often arise in the context of a complex set of correlated substitutions that reﬂect
both compensatory mutations and dependencies among epitopes. On the whole, the
phylogenetic dependency network predicts that a major proportion of p17 (41%) and
p24 (20%) codons are under selective pressure from at least one HLA allele, a result
that conﬁrms a dominant role of T-cell responses in driving viral evolution [
2
,
32
,
85
].
This study also represents a signiﬁcant step forward by providing a statistical ap-
proach that can help diﬀerentiate direct (H → A) HLA escape polymorphisms from
indirect or, more speciﬁcally, one-hop (H → B) escape polymorphisms in situations
where the true interaction is the chain H → A → B. Although the direct–indirect
distinction can arise under several mechanisms, the explicit statistical interpretation
is as follows: a direct HLA-polymorphism H → A association means the HLA allele H
is a strong predictor of the polymorphism A, whereas an indirect HLA-polymorphism
association H → A → B means the polymorphism B is better predicted by the poly-
morphism A than by the HLA allele H. Although B is in a sense HLA-associated,
the distinction of direct versus indirect associations may have important biological
implications. For example, many of the indirect associations identiﬁed by the de-
pendency network for the B*57-restricted TW10 and B*27-restricted KK10 epitopes
are consistent with known compensatory mutations associated with escape in these
epitopes [
21
,
201
]. In addition to these described pathways, the dependency network
reports a number of covarying amino acids. Many of these are in close physical con-
tact, and thus likely candidates for compensatory pathways that can be tested via in
vitro replication capacity assays, although distal covarying codons may also exhibit

114
compensatory relationships [
38
,
142
,
179
,
232
,
242
]. Understanding the speciﬁcs of
compensatory-based covariation has important implications for T-cell-based vaccine
design, as escapes that require multiple compensatory mutations may take longer to
arise due to chance and the compensatory mutations may not completely recover lost
ﬁtness [
21
,
85
,
200
,
201
].
Compensation is not the only potential causal mechanism of codon covariation.
Other mechanisms include those associated with CTL-mediated covariation. Indeed,
the PDN indicates that up to 50% of the observed codon-codon covariation occurs
between epitopes restricted by the same HLA allele, suggesting much of the ob-
served codon covariation in HIV is CTL-mediated.
Two possible mechanisms of
CTL-mediated covariation include inter-patient variability in the immune system’s
ability to target epitopes and consistent patterns of epitope targeting due to immun-
odominance. Distinguishing between these two mechanisms may have direct relevance
to vaccine design, but will require comparing the results of the PDN to clinical re-
sponse data that can measure epitope targeting and longitudinal samples that can
identify order of escape. Although it is well known that the order in which the epi-
topes of some HLA alleles are targeted is broadly consistent [
7
,
18
,
23
], identifying new
patterns may yield new vaccine candidates. Speciﬁcally, it is possible that HLA alle-
les that are currently considered non-protective target ineﬀective dominant epitopes
during acute infection. Redeﬁning the immunodominance hierarchy via immunogen
exposure may thus increase the eﬀectiveness of these alleles upon subsequent HIV
challenge [
73
].
A major challenge to vaccine design is global HIV diversity [
76
,
143
]. Although
there is accumulating evidence that suggests that patterns of escape appear to be
broadly predictable [
24
,
25
,
32
,
157
,
174
,
196
], these studies have been limited to
relatively small sample sizes or cohorts consisting predominantly of a single clade.
Although a comparison of the Durban and British Columbia results showed instances
of both consistency and divergence of associated escape in the two clades [
196
], these

115
studies were run separately, did not account for codon covariation, and used diﬀerent
methods for determining associations. Thus, the extent to which escape pathways
are shared across clades was largely unknown. Our results, which reﬂect data equally
distributed between clade B and clade C sequences and are evaluated by taking HLA
LD, viral lineage and codon covariation into account, conﬁrm the existence of common
escape pathways. This similarity suggests that a broadly reactive vaccine may be pos-
sible, though more work to further characterize inter-clade similarities and diﬀerences
will be necessary.
Despite the broad similarities seen between clades B and C, we noted several
intriguing examples where the resistant form of an epitope matched the consensus
sequence for one of the clades. Such examples support the HLA footprinting hypoth-
esis [
131
,
157
], which proposes that consensus sequences of circulating strains in a
population are a result of consistent escape (and lack of reversion) from the most
common HLA alleles in that population, an hypothesis that is especially well founded
in cases where the consensus polymorphisms are diﬀerent in diﬀerent populations. For
example, 53% of individuals in the British Columbia cohort [
25
] have A*01, whereas
only 24% of the Durban cohort [
116
,
196
] have A*01, and F79 (clade B consensus) is
the resistant form of the association. Furthermore, alleles A*29 and A*68 have higher
frequencies in the South African cohort, and Y79 (clade C consensus) is the resistant
form of their associations. Thus, at codon 79, there appears to be broad selection
pressure for evolutionary ﬁxation of F in the South African cohort and ﬁxation of Y in
the British Columbian cohort. Our analyses identiﬁed a total of 21 codons (four with
independent experimental support [
70
]) where the predicted escape matched clade B
or C consensus, adding support to the hypothesis that CTL pressure serves a broad,
population-level role in shaping HIV evolution, and may even serve a key role in clade
diﬀerentiation [
157
].
We have focused this study on the highly immunogenic Gag p17 and p24 proteins,
which are believed to serve a key role in eﬀective control of HIV [
56
,
77
,
116
,
243
].

116
Moving forward, it will be important to extend such studies to full length genomes,
where patterns of covariation may reﬂect cites of protein-protein interaction [
39
,
88
,
224
] and may further reveal broad patterns of immunodominance. Furthermore, as
the number and diversity of large cohorts of HIV-infected, HLA-typed, individuals
continue to grow [
25
,
116
,
157
,
174
,
196
], it will be important to combine data sets in
order to increase statistical power and further detail the similarities and diﬀerences
among clades that may inform broad-coverage immunogen design.
One limitation of our two-clade study is that, because the HLA data in the
HOMER cohort had only 2-digit resolution, we truncated the HLA data in the Dur-
ban cohort to 2 digit types as well. Although closely related HLA alleles often target
the same or similar epitopes [
70
], making 2-digit resolution an appropriate choice for
some allele-epitope pairs, important diﬀerences do exist. An example is the distinction
between the B*5801 allele that is associated with eﬀective viral control, and B*5802
which is associated with poor viral control [
115
,
162
]. In cases where the prevalence
of four-digit resolution types diﬀers substantially between cohorts (as is the case with
B*58) and the four-digit types target diﬀerent epitopes, truncation to two-digit types
before combining cohorts will lead to confounding in which the two-digit types from
one cohort will tend to lead to escape whereas the two-digit types from another co-
hort do not. In principle, a better approach would be to include both 2-digit and
4-digit HLA alleles (or any other grouping of alleles) as predictor attributes in our
model. For example, if all B*57 alleles select for the same escape mutation, then B*57
would be chosen by the model as a stronger predictor than B*5701, whereas escape
mutations selected only by B*5701 would lead to the 4-digit allele being chosen. Of
course, these facts should encourage researchers to perform high resolution typing
on individuals in their cohorts. In addition, Listgarten et al. [
135
] have developed a
statistical approach for inferring high resolution HLA alleles from low resolution hap-
lotypes. Although incorporating the uncertainly of those predictions into the PDN
is beyond the scope of this paper, the ability to infer high resolution HLA data will

117
allow for more eﬀective evaluation of large, multi-cohort studies.
The comparative method has long been used to generate hypotheses regarding
traits and the environment [
92
,
148
,
149
]. Because (quasi-) species share a com-
mon history, the inherent population structure (in this case, the phylogeny) must
be accounted for [
61
], and numerous methods that do so have been proposed (e.g.,
[
64
,
92
,
148
] and references therein). Our study on HLA immune escape mutations
suggests, however, that simply accounting for population structure is not enough, as
HLA linkage disequilibrium (structure among environmental predictors) and codon
covariation (structure among target traits) are at least as important as phylogeny in
both increasing statistical power and avoiding false positives.
This issue is relevant to applications beyond those studied here.
Speciﬁcally,
whenever chains of interactions are common, pairwise methods will tend to identify
direct as well as indirect correlations. This eﬀect was most dramatically seen in the
synthetic codon-covariation tests, in which using a logistic regression-like approach
(which accounts for chains) dramatically outperformed the phylogenetic pairwise ap-
proach. Although many phylogeny-aware comparative methods have been developed
for codon-covariation [
39
], the problem of chains of interactions has only recently
been addressed [
45
,
182
]. The PDN provides an eﬃcient framework in which chains
of interactions can be identiﬁed in the context of both the phylogeny and confounding
from external sources of selection pressure (here, HLA-mediated CTL response).
The ﬁrst approach to identifying chains of interactions in a phylogenetic context
was recently provided by Poon et al. [
182
]. They employed a directed acyclic graphical
(DAG) model rather than a dependency network. In a DAG model, arcs from predic-
tor to target attributes form a directed acyclic graph and local distributions take the
same form as in a PDN. (A DAG model is often referred to as a Bayesian network,
although the latter name is misleading as non-Bayesian procedures can be used to
construct DAG models.) When learning the distributions in the DAG model, Poon
et al. took phylogeny into account, although in a way diﬀerent from our approach.

118
In particular, when learning the distribution of an attribute given its parents in the
DAG model, they imputed for each individual the value of the attribute correspond-
ing to the ancestor of that individual in the phylogeny. These imputed values were
then treated as observed data and fed to a standard DAG model structure learning
algorithm.
The PDN provides an alternative approach that leverages the strengths of depen-
dency networks. The most apparent diﬀerence is that dependency networks allow
cycles, resulting in a network that is easier for the non-expert to interpret than is
the DAG model [
94
]. In addition, Poon et al. used unrestricted local distributions
in contrast to our use of Noisy Add. The use of Noisy Add, where the number of
parameters is linear in the number of parents rather than exponential, results in a
substantial increase in power. Finally, because the PDN is concerned only with local
probabilities, only the target variable is conditioned on the phylogeny, allowing the
PDN to eﬃciently model associations with attributes, such as HLA alleles, that are
not expected to follow the phylogeny, as well as attributes, such as other codons, that
are expected to follow the same phylogeny [
31
]. The result is an eﬃcient method that
can simultaneously incorporate a diverse range of selection pressure attributes.
One drawback of a dependency network relative to a DAG model is that the local
distributions among the target attributes overlap and yet are learned independently.
(For example, the local distribution for A given B and the local distribution for
B given A are closely related, yet are learned independently.) This independent
learning leads to a decrease in statistical eﬃciency. In practice, however, this decrease
is typically minimal [
94
]. Another drawback of a dependency network is that the
presence of cycles make inference of the joint distribution cumbersome, requiring an
ineﬃcient modiﬁed Gibbs sampling procedure to estimate the joint likelihood [
94
].
One possible solution is to modify the method for constructing a PDN to yield a
DAG model. In particular, we can choose a random ordering for the attributes, and
then build a PDN wherein the allowed predictors of a target attribute are only those

119
that precede the target attribute in the ordering. The resulting collection of local
probability distributions deﬁnes a DAG model (where acyclicity is guaranteed by the
ordering constraint). The resulting model can be improved substantially by applying
the above procedure to a dozen or so random orderings, and then choosing the best
model according to some criterion (e.g., a Bayesian criterion or cross validation) [
104
].
The resulting DAG is a generative model that can be used to perform inference on
the joint distribution.

120
Chapter 8
SUMMARY
This dissertation has introduced the Phylogenetic Dependency Network, a frame-
work for the identiﬁcation of adaptive traits and the speciﬁc sources of selection pres-
sure that drive adaptation. The idea of the PDN is to ﬁt each potentially adaptive
target trait to a probabilistic model of evolution that conditions the target on both
the phylogeny and other predictive traits that exert selection pressure on the target.
The structure of the PDN then consists of arcs connecting targets to the sources of
selection pressure that putatively drive the adaptation of each target.
We have deﬁned and explored three speciﬁc probabilistic models of conditional
adaptation, which can be used as the probability components of the PDN. These
models are similar in that they each make the simplifying assumption that each tar-
get trait has evolved independent of the predictor traits throughout its evolutionary
history. It is only when the target trait reaches the environment in which we are
able to observe it that we model the interaction between predictor and target. Al-
though this assumption certainly does not describe the true biological process, we
have provided empirical evidence that it is a useful and reasonable approximation.
Speciﬁcally, we considered two extreme examples. In the ﬁrst, the predictor trait(s)
is IID and uniformly distributed. In this case, models based on this assumption are
indistinguishable from models that integrate out the interaction of the predictor(s) in
each hidden internal node, as this integration is indistinguishable from increasing the
rate of evolution in the independence model. In the opposite extreme, the predictor
and target are coevolving. In our simulation studies, we found that the conditional
adaptation model is nearly as good as the generative coevolution model, suggesting

121
that conditional adaptation is a reasonable approximation even when the true causal
model is coevolution.
Of the three conditional adaptation models we proposed, we found that Noisy
Add is the most expressive. In this model, multiple predictor traits are combined
to drive the adaptation of the target. The parameters of the model specify whether
each predictor exerts positive or negative selection pressure on the trait, as well as
the probability that any non-zero selection pressure is exerted. Because there is only
a single parameter for each predictor trait, the number of parameters grows linearly
with the number of traits in the model. This represents a major advance over previous
approaches, in all of which the number of parameters grows exponentially with the
number of traits in the model. This property allows us to ﬁt a large number of traits
to the model, even with a modest amount of data, and to do so in a reasonable amount
of computational time.
Because the conditional adaptation assumption can reasonably model interactions
of target traits with predictors that exhibit a wide range of distributions relative to
the target trait’s, the Noisy Add model is able to simultaneously incorporate a diverse
range of predictor traits into the model. Traditionally, approaches to the comparative
method have assumed that the predictor traits follow a speciﬁc, prespeciﬁed distribu-
tion. Most often, the assumption is that the predictor has coevolved with the target
along the same phylogeny. Although this model is reasonable for many domains, such
as the study of codon covariation in proteins, we have argued that it is not appropriate
for many applications in which the predictor trait is derived from the environment.
Speciﬁcally, we have considered the case where we model the adaptation of HIV to the
speciﬁc HLA alleles of the human host. In this case, we have shown that a coevolu-
tion assumption does not describe the data well and is signiﬁcantly outperformed by
the conditional adaptation model. Furthermore, because the conditional adaptation
model can describe coevolution data reasonably well, the Noisy Add model is able to
simultaneously incorporate a diverse range of predictors, including those that have

122
coevolved with the target traits and those whose distributions are quite diﬀerent from
that of the target. In addition, our forward selection approach to determining the
predictors for a given target can eﬀectively account for correlations among predictors,
which we have shown is important for increasing accuracy.
Importantly, the ability to simultaneously model the inﬂuence of a diverse range of
predictor traits has enabled us to explore HIV adaptation from a rich new perspective.
Focusing on HIV adaptation to the HLA-mediated adaptive immune response, we have
argued that there are three major sources of statistical confounding: (1) population
structure due to the HIV phylogeny, (2) linkage disequilibrium among HLA alleles,
and (3) HIV codon covariation. Although these sources of confounding are often
acknowledged, limitations of previous approaches have prevented an in depth study
of HIV adaptation that can correctly account for these problems. In contrast, we have
demonstrated that a PDN that uses the Noisy Add model is able to simultaneously
account for all three sources of confounding. The result is that we are able to more
accurately predict which speciﬁc HLA alleles are directly correlated with speciﬁc HIV
codon substitutions.
When we applied our model to a large and diverse cohort,
we found a rich network of interactions. It was not uncommon for a single HIV
target codon to be directly correlated with a handful of HLA alleles and a large
number of other HIV codons. The density of the dependency network underscores the
importance of using a model that can simultaneously account for multiple interactions.
As we demonstrated with both synthetic data and arguments based on observations
of the real data, failure to do so results in a large number of spurious associations,
especially as the size of the data set (and the corresponding power to detect deviations
from the null model) increases. Even when the goal of a study is simply to identify
coevolving codons, the limitation of previous coevolution approaches to comparing
pairs of traits leads to a large number of spurious associations that derive from indirect
interactions.
Our application of the PDN to the largest cohort of HLA-typed, chronically in-

123
fected and antiretroviral-na¨ıve patients studied to date reveals the complexity—yet
promising consistency—of HIV adaptation to the cellular immune response. In addi-
tion to identifying several of the handful of well-studied escape pathways, we identiﬁed
a plethora of putative escape adaptations, many of which lie within or adjacent to
known epitopes, but many of which also lie in putative epitopes. Remarkably, many
of these adaptations correlate strongly with coevolutionary adaptations elsewhere in
the protein. Although we cannot infer a causal mechanism for the coevolutionary
changes, the combination of recently characterized compensatory mutations coupled
with the observation that coevolving pairs of codons in our data tend to be proxi-
mal in the folded protein suggests that many of the coevolving pairs may represent
compensatory mutations. Moving forward, a major goal of T-cell vaccine design is
to identify the speciﬁc epitopes that should be included in a vaccine. Given the
patterns of escape and compensation that have been characterized in some epitopes
that are targeted by protective alleles, the dense map of escape and coevolution that
we have identiﬁed may provide key new insights into which regions of the virus are
more vulnerable to an eﬀective immune attack and should therefore be included in a
vaccine.
In conclusion, the main contributions of this dissertation can be succinctly sum-
marized as follows
1. The introduction of the phylogenetic dependency framework, which uses the
assumption of conditional adaptation to map sources of selection pressure to
adaptive traits.
2. The deﬁnition and exploration of three probabilistic models of conditional adap-
tation. In particular, the Noisy Add distribution, which eﬃciently and parsimo-
niously models the eﬀect of selection pressure derived from multiple predictor
traits on a single target trait.

124
3. The application of Noisy Add and the PDN to the study of HLA-mediated
adaptation in HIV.
The source code and executables for the PDN, the distributions discussed, and
the PhyloDv viewer that we use to display the results are freely available at
http:
//www.codeplex.com/MSCompBio
.
8.1
Limitations and future directions
We now discuss some limitations of the PDN framework and the conditional adapta-
tion models we have discussed, as well as some future directions that could address
many of these problems.
8.1.1
Numerical and computational issues
Perhaps the most obvious limitation is the amount of computing power required to
run these models. Indeed, even the largest data set considered in this dissertation
can be analyzed using Fisher’s exact test with only an hour or two of CPU time,
whereas running Noisy Add took the better part of year’s worth of computing time.
Nevertheless, as we have shown, the vast increase in computing time brings with it
a substantial increase in statistical power and conﬁdence in the calibration of the
resulting signiﬁcance scores. Indeed, in our largest synthetic studies, the application
of FET yielded all but unusable predictions. In addition, given the advent of cheap,
high performance computing, we feel that the computational cost is a small price
to pay for the increase in accuracy. This tradeoﬀ is especially worth while if the
algorithm can be run on a shared resource to defray the marginal cost. As such,
we are developing a web-based, cluster-backed server as a service to the biological
community so that computing costs are not an issue for researchers.
On the technical side, there is always a concern that EM optimization will settle
on a local optimum. In principle, with enough random restarts this problem can be

125
eliminated, though restarts are costly computationally and we have found that the
ﬁtness landscape tends to be such that, when a local optima is a problem, it takes a
large number of restarts to ﬁnd the global optimum. In practice, we have found that
the best solution is to initialize all parameters with “reasonable” starting values, which
can be estimated from the data by, for example, assuming the data are IID When
local optima are a problem, there are three possibilities: (1) only the calculation of
the alternative likelihood falls into a local optima, (2) only the calculation of the null
likelihood falls into a local optima, or (3) both calculations fall into a local optima.
Typically the most damaging scenario occurs in case (2), as the result can be a wildly
inﬂated signiﬁcance measure. Fortunately, the fact that the null model is nested in
the alternative model implies that if case (2) occurs when there is no true correlation,
then there is probably a conﬁguration of the null model parameters that will yield a
likelihood close to the alternative model. This fact suggests the following heuristic:
for every signiﬁcant association, reﬁt the null model using the applicable parameters
from the alternative model as a starting point for EM. While this heuristic has no
guarantees of helping, we have found that in practice it greatly reduces the number
of spurious associations arising from case (2).
8.1.2
Problems with trees
A natural criticism of our approach is that we condition on a phylogeny, which of
course cannot be known with complete certainty.
There are, in fact, two major
criticisms that can be leveled with regard to our use of phylogenies: (1) we ignore
any uncertainty in the phylogeny, and (2) we infer the phylogeny assuming there is no
selection pressure, a rather circular assumption given that we then set out to identify
selection pressure acting on individual traits from which the phylogeny was inferred.
Let us consider each of these criticisms in turn.

126
Uncertainty in phylogeny inference
The myriad assortment of phylogeny inference algorithms and their parameter settings
makes it clear that inferring phylogenies is an imprecise art. Indeed, when we run a
number of available programs on the data sets considered in this dissertation, each
yields a slightly diﬀerent phylogeny. Can we really just pick one and assume it’s the
correct one? The good news is, we have found that the precise phylogeny used in our
method does not have a huge impact on the results, and what impact it does have
appears to be conﬁned to reducing statistical power and does not appear to inﬂate
our estimates of statistical signiﬁcance (see
subsection 5.2.1
). In other unpublished
work, we have found this observation to hold up on real data as well. Speciﬁcally, in
one example, we ran the model on ﬁve “similar” trees (meaning the major topological
features are preserved), then found that of those associations identiﬁed using any of
the ﬁve trees at q < 0.05, 80% were found using all of the ﬁve trees.
Although we can be reasonably optimistic that the model is robust to imperfect
trees, a solution that incorporated the phylogenetic uncertainty may increase statisti-
cal power and would at the very least be more intellectually satisfying. One possibility
is to take a Bayesian perspective and compute the posterior distribution of phyloge-
nies (using, for example Mr. Bayes [
103
], which will provide such a thing if given
enough time). We could then sample trees from this posterior distribution and create
a PDN for each tree. Because the PDN consists of local probability distributions, the
question then focusses on how to integrate over the sampled trees in the local condi-
tional probabilities. One simple possibility would to compute the alternative and null
likelihoods as the average of the alternative and null distributions for each tree. The
resulting likelihood ratio would then reﬂect the uncertainty in the tree structure.

127
How does selection pressure inﬂuence inference of the phylogeny?
Even if we ignore the uncertainty in inferring the phylogeny, one must still wonder
about the common assumption in the inference process that each site in the sequence
is evolving neutrally. Not only is this assumption violated in general, but the whole
point of the PDN is to identify where and how it is violated. Indeed, some of our
collaborators have recently pointed out that the strongest associations can strongly
bias the inference of the phylogeny [
152
]. In their case, simply removing the sites
known to escape in response to B*57 drastically altered the inferred phylogeny. This
observation suggests a simple, iterative solution:
1. Infer the phylogeny using all sites.
2. Build the PDN using the inferred phylogeny.
3. Identify the target variables involved in the strongest associations and remove
the underlying codons from the DNA sequences.
4. Infer a new phylogeny using the DNA sequences modiﬁed in step three.
5. If the phylogeny is very similar to that from the last iteration, stop. Otherwise,
continue with step two.
In the end, we will be left with a phylogeny built from sites that are not under
strong selection pressure (or at least, those that are under strong selection pressure
contribute relatively little to the phylogeny inference procedure). The PDN from this
last step can thus be considered as the most reliable. One concern might be that,
because the ﬁrst iterations of this algorithm involve a tree that is apparently incorrect,
the associations we identify from those trees are likely incorrect and thus the wrong
sites are removed, leading us down a wrong path. Two responses can be given to this.
First, the fact that we are reasonably robust to wrong trees should guard against this

128
problem. Second, and perhaps more important, any association that is strong enough
to alter the phylogeny inference is likely strong enough to be identiﬁed given most
any phylogeny. Indeed, the associations identiﬁed in Matthews et al. were those that
have long been known precisely because they can be identiﬁed using just about any
method (including those that ignore the phylogeny altogether).
Nevertheless, a more principled approach can be taken to this problem. Speciﬁ-
cally, one could infer the structure of the phylogeny and the PDN simultaneously. One
approach could be to use an EM-like procedure. That is, as we have demonstrated,
given a phylogeny we can easily identify associations. The reverse is also likely to be
true. Given a PDN, in which we model the selection pressure acting on each target
variable, we should be able to infer a phylogeny that incorporates that selection pres-
sure. By iterating back and forth between these two steps, we may expect to improve
both our phylogenetic inference and the structure of the PDN.
8.1.3
Deterministic cycles
A more fundamental limitation of the PDN is an inherent limitation of dependency
networks in general. Because a dependency network consists of local conditional
probability distributions, in certain circumstances, true causal interactions can be
impossible to detect. Consider the following causal model:
X → Y → Z.
Further, suppose the conditional probability table for Y → Z is the deterministic
relation
Pr {Z = 1|Y = 1} = 1
Pr {Z = 0|Y = 0} = 1

129
Now when we ﬁt the local conditional probability distributions for the target variables
Y and Z, we will ﬁnd that Y is fully suﬃcient to predict Z and Z is fully suﬃcient to
predict Y . Thus, if the relation between X and Y is non-deterministic, our forward
selection procedure will never ﬁnd X → Y , because there can be no gain in likelihood
over the relation Z → Y (even though in the true causal model, the direction of
causation is in reverse).
An example of where we have seen this problem is in the case of B27-mediated
escape in the epitope KK10. In the combined HOMER-Durban cohort, we were able
to recover the known escape pathway
B27 → R264K → L268M,
which is one of two possible escape pathways originating at position 264. When we
look only at the HOMER cohort, however, we do not see this pathway. What we do
see is
R264K → L268M,
L268M → R264K.
Upon closer inspection, we ﬁnd that the selection parameter in both cases is 1, mean-
ing if escape occurs in one position, escape must occur in the other position. Indeed,
in the HOMER cohort, every patient with one of the polymorphisms has the other
polymorphism. Consequently, we fail to ﬁnd the association of these escape mutations
with B27, which is the known cause of the escapes.
One solution to this problem is to use a directed acyclic graphical model (DAG)
and compute the joint likelihood. This is the approach of Poon et al [
182
]. We
could still achieve the beneﬁts of the Noisy Add model (linearity in the number
of parameters and the conditional adaptation assumption) by using the procedure
outlined in
section 7.3
. Namely, create a random ordering over all the traits, then run
the PDN as described, but with the restriction that the only allowable predictors for

130
a given target trait are those that precede the target in the random ordering. The
result is a DAG, for which the joint likelihood can be easily computed. The resulting
model can be improved substantially by applying the above procedure to a dozen or
so random orderings, and then choosing the best model according to some criterion
(e.g., a Bayesian criterion or cross validation) [
104
]. Because the resulting model is
acyclic, we cannot learn the (incorrect) circular model Y ↔ Z, and thus are likely to
learn the (correct) model X → Y → Z.
8.1.4
Binarization of data and results
On ﬁnite data sets, binarization of multistate discrete data is a simple means of
increasing the power to detect associations. In our case, a model that can incorporate
all 20 possible amino acid states would represent an enormous increase in the number
of parameters (up to 380 for the null model, if all possible transitions were to be
considered; many more to capture selection pressure).
Nevertheless, binarization
has its drawbacks. The most obvious drawback comes in interpreting the results.
Typically, the biologist wants to know what speciﬁc mutations are likely to arise in
response to a given selection pressure. In the binarization process, we create multiple
traits for each codon (one for each observed amino acid at that codon). In the best
case, it can be diﬃcult to reconstruct the escape process from the set of traits at
that position. In the worst case, noise in the data can lead to inconsistencies or an
incomplete picture.
Furthermore, any evolutionary model that does not model the mutation of codons
directly (i.e., either binarizes the data or uses some compression of the codon space,
such as the 20 amino acids), is not a ﬁrst order continuous time Markov process
(CTMP), as our models assume. That is, the ﬁrst order CTMP assumption states
that the rate of transition between two states is constant. Because of redundancy in
the mapping from trinucleotides to amino acids, however, this assumption does not
hold. For example, the rate of transition from Leucine (Leu) to Phenylalanine (Phe)

131
is dependent on the speciﬁc trinucleotide that underlies the Leu. If the trinucleotide
is CUU, then a single mutation is required to yield UUU, a Phe trinucleotide. If, how-
ever, the Leu trinucleotide is CUG, then two mutations are required to yield either
UUU or UUC (the two possible Phe trinucleotides). Thus, the probability of transi-
tioning is not constant, but rather depends on the underlying trinucleotide, which in
turn is a function of the previous amino acid state that was visited. For example, if
the trait had just transitioned from Phe to Leu, then it is more likely to transition
back to Phe than if it had previously been a Valine, whose GUG trinucleotide is one
mutation away from Leu’s CUG. The violation of the Markov assumption becomes
even more pronounced when the data are binarized as the transition from not Leu to
Leu can require anywhere from one to three DNA mutations, and thus the probability
of the transition is a function of how long the trait has not been a Leu (the longer it
hasn’t been a Leu, the more mutations are likely to require a transition back to Leu).
Nevertheless, we must point out the the empirical results we have presented
strongly suggest that PDN is robust to violations in the CTMP assumption. Fur-
thermore, for many traits, unmodeled purifying selection will, in practice, constrain
the range of possible amino acid states, eﬀectively minimizing the likelihood that the
rate of transition from Not Leu to Leu (for example) deviates from a constant. In-
deed, for many of the codons we saw in the real data sets, only 2–4 amino acids were
actually observed at that position.
Although it is not feasible to learn the structure of the PDN using multistate
distributions (especially one that fully satisﬁes the CTMP assumption), one could
use a hybrid approach. Speciﬁcally, one could ﬁrst ﬁt the structure of the PDN
using binarized data. The resulting structure could then be ﬁxed, and the binarized
traits could be collapsed back in to multistate traits. Finally, the parameters of
the multistate model could be ﬁt to the data, assuming the PDN structure from
the binarized data. There are, of course, several open questions that have to be
addressed, including how to deal with inconsistencies in the structure when binary

132
traits are collapsed back into their original multistate traits, and the precise deﬁnition
of selection pressure over multiple states (the simple deﬁnition of positive and negative
selection would no longer apply). Nevertheless, the result would be the statistical
conﬁdence in the structure provided by binarizing the data, with an easier task of
interpreting the parameters of the model and the speciﬁc adaptations that are selected
for in speciﬁc circumstances.
In addition, this process, coupled with converting the PDN to a DAG as described
in the previous section, would greatly facility the calculation and interpretation of
joint inference. Such inference could, for example, lead to the inference of the most
likely viral sequence to arise in response to the adaptation of a particular HLA reper-
toire, a functionality that could prove extremely useful for vaccine design, especially
if we incorporate a model of which epitopes are actually targeted in an individual
(as governed by the HLA repertoire in addition to immunodominance and/or vacci-
nation). As the model currently stands, it is not possible to determine the rate of
targeting a speciﬁc epitope. But as shall see in the next section, such an adaptation
for the model may be possible with the advent of next generation sequencing data.
8.1.5
Extension to single genome sequencing data
The input data we used for this dissertation consisted of “bulk” sequences, which
represent the consensus of the HIV population infecting an individual. The primary
reason for this is that bulk sequences are cheap and easy to obtain. They also are intu-
itively pleasing as they presumably represent an estimate of “optimal” HIV sequence
given the current immune system conditions, insofar as the most common virus is the
most ﬁt virus. The bulk sequence is, however, an imperfect reduction of the underly-
ing data. A more accurate sequencing technology is generically referred to as single
genome sequencing (SGS), which involves the sequencing of individual virions. Using
traditional sequencing methods, the isolation and sequencing of individual virions is
quite tedious. Thus, these data sets are currently rare and tend to be much smaller

133
than the data sets considered here. However, next generation sequencing technology
promises to provide a high throughput means of sequencing thousands of individual
virions in each patient. Is the PDN applicable to such data sets?
The answer depends on the speciﬁc question. For example, if codon coevolution
is the sole interest of the study, then the PDN can be applied to the SGS sequences
without modiﬁcation. As we have discussed, however, the ideal scenario is to account
for environmental variables in addition to covariation. Indeed, in the studies we have
considered, the interaction of HIV with the host environment is of primary interest.
In this case, the Noisy Add model as it currently exists is inadequate.
Consider the case where we want to compare a single HLA trait X to a single HIV
trait Y using the leaf distribution escape. The parameterization of this model includes
a single parameter s for selection pressure. In the notation of this dissertation,
s = Pr {Y
i
= 0|X
i
= 1, H = 1} ;
that is, the probability that Y transitions to state 0 given that it started in state
1 and individual i has HLA X. If we consider the underly biology, however, we
will notice that s should really be decomposed further. The probability of escape
is not simply a function of whether the individual has HLA X, it is also a function
of whether the individual’s immune system is actively targeting (using HLA X) the
epitope containing Y . That is
s = Pr {Y
i
= 0|L
i
= 1, H = 1} · Pr {L
i
= 1|X
i
= 1}
(8.1)
where L
i
= 1 means individual i’s immune system is actively targeting the epitope
containing Y . The new (hidden) trait L is conditioned on X because the epitope
cannot be targeted without the HLA, but Pr {L|X} is non-deterministic because im-
munodominance implies that individuals who have the capacity to target the epitope
may not (
chapter 3
). (Also, note that L is independent of H, because H corresponds
to the state of Y in the absence of selection pressure.)

134

Figure 8.1: Univariate model with linked predictors. This phylogeny represents sin-
gle genome sequences taken from two individuals (three sequences for the bottom
individual, two for the top individual).
When we have only one sequence per patient, the two components of Equa-
tion (
8.1
) are unidentiﬁable. With multiple sequences per patient, however, the two
components are identiﬁable, and failure to decompose s may result in a misleading
model. Indeed, continuing our univariate example, a better model is the one given in
Figure 8.1
.
In this model, the (observed) predictor HLA X points to an (unobserved) linker
variable L, which in turn points to the observed target variable Y . The selection
pressure s is now decomposed into two parts, s = (s
1
, s
2
), such that
s
1
= Pr {L = 1|X = 1}
and
s
2
= Pr {Y = 0|L = 1, H = 1} .
Semantically, s
1
is the probability that the immune system targets the relevant epi-

135
tope given that the patient has HLA X, and s
2
is the probability that escape occurs
given that the epitope is targeted. Note that s
2
can be deﬁned for each of the leaf
distributions as in
chapter 4
, but s
1
remains the same for all leaf distributions (i.e.,
Pr {L = 1|X = 0} = 0 for all leaf distributions). More generally, this linked condi-
tional adaptation model is suitable for any case in which the presence of selection
pressure from a predictive source is stochastic, and will be applied equally to groups
of individuals (that is, if the selection pressure is present (or absent) for one individual
in the group, then it will be present (absent) for all individuals in the group).
Following this line of thinking, a linked version of both the univariate and Noisy
Add models can be constructed (see Fig.
8.2
for a mixed linked/unlinked example of
Noisy Add). The resulting model should be a more faithful representation of the un-
derlying biological interaction between HLA alleles and SGS sequences. Furthermore,
the parameters s
1
and s
2
are identiﬁable and may provide valuable insight into the un-
derlying biological mechanism. For example, does apparently moderate to low rate of
escape s derive from a low rate s
1
of epitope targeting, or a low rate s
2
of escape in ac-
tively targeted epitopes? Note that, in the context of Noisy Add, it is straightforward
to include unlinked traits as well. In this case, other codons would be represented as
unlinked traits. The interested reader is referred to Appendix
A
for further technical
details of this model, though thorough experimentation and application are left for
future work.

136

Download 4,8 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 ... 12