Phylogenetic dependency networks: Inferring patterns of adaptation in hiv

bet	4/12
Sana	23.11.2017
Hajmi	4.8 Kb.
	#20686

1 2 3 4 5 6 7 8 9 ... 12

P
re
c
is
ion
Recall
Conditional
Undirected Joint
FET
A PR curves for synthetic coevolution data.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1
-P
rec
ision
q
Conditional
Undirected Joint
FET
FET Parametric
Bootstrap
B Calibration on synthetic coevolution data.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
P
re
c
is
ion
Recall
Conditional
Undirected Joint
FET
C PR curves for synthetic conditional data.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1
-P
re
c
is
ion
q
Conditional
Undirected Joint
FET
D Calibration on synthetic conditional data.
Figure 5.1: PR and calibration curves on synthetic data.

62
5.2.1
Sensitivity to tree structure
Our approach raises the question of how sensitive the results are to the structure
of the tree used by the models. To address this question, we ran the conditional
model on the synthetic conditional evolution data using four diﬀerent trees: the tree
used to generate the data (T
gen
), a tree with the same structure as T
gen
but with the
leaf-to-patient assignments randomized (T
rand
), and two trees reconstructed from the
synthetic amino acids using either a generalized maximum likelihood method (T
M L
,
the method we use throughout this work) or a na¨ıve parsimony method (T
pars
).
As expected, the conditional model performed best using T
gen
, though the discrim-
ination curves were not signiﬁcantly diﬀerent from those of T
M L
and T
pars
, indicating
that the conditional model is robust to variations of the tree on this data set (
Fig-
ure 5.2A
). Importantly, although the discrimination curve was signiﬁcantly worse
using T
rand
rather than T
gen
(p = 0.016), the conditional model was calibrated on
all four trees, indicating that, on this data set, poor trees resulted in a loss of power
but not in an inﬂation of false discovery rates (
Figure 5.2B
). This point is reinforced
by the number of associations identiﬁed at q < 0.20 for the diﬀerent methods: T
gen
yielded eighty nine predictions whereas T
rand
yielded only sixty ﬁve, T
M L
yielded
seventy eight, and T
pars
yielded eighty two. The positive predictive value for these
predictions ranged from 0.80 to 0.85.
Although it may seem counter-intuitive that the randomized tree could ﬁnd any
associations, we note that the problem with the conditional model using T
rand
is
analogous to that of an IID model. Namely, whereas an IID model will over or under
count observations by not accounting for hierarchical structure that exists in the data,
a randomized tree will over or under count observations by assuming false hierarchical
structure. In addition, the conditional model can compensate for a tree that ﬁts the
target data poorly by setting the mutation rate to inﬁnity and thereby assuming the
data to be IID. Indeed, the median λ value under T
rand
was an order of magnitude

63

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
P
re
c
is
ion
Recall
Generating Tree
PhyML Tree
Parsimony Tree
Randomized
Generating Tree
A PR curves.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1
-P
re
c
is
ion
q
Generating Tree
PhyML Tree
Parsimony Tree
Randomized
Generating Tree
B Calibration curves.
Figure 5.2: PR and calibration curves on conditional evolution data when run using
diﬀerent trees.
higher than that under T
gen
.
5.3
Application 1: Eﬀect of immune pressure on HIV evolution
To investigate the eﬀects of immune pressure on HIV evolution, Moore et al. [
157
]
obtained HIV sequences from 234 individuals along with the HLA-A and HLA-B
alleles of the infected individuals. Performing several analyses, all of which assumed
the data to be IID, they found strong correlations between the presence or absence
of amino acids at particular positions and the presence or absence of particular HLA
alleles in the infected patients, presumably reﬂecting the “escape” of amino acids
under immune pressure. In Bhattacharya et al. [
17
], we analyzed a similar data
set (N=96, HLA-I and HLA-II alleles) and showed that use of the conditional model
substantially improved the accuracy of identiﬁcation of such HLA-codon associations.
Here, we analyzed a superset (N=205) of the data used in [
17
] (HLA-I alleles only)
using both the conditional and undirected joint model.
First, we constructed a phylogenetic tree from the full set of sequences (
subsec-

64
tion 5.1.4
). We then used the single-variable model to determine whether any HLA
alleles followed the tree. We found that two pairs of HLA-1 alleles—B*4201, Cw*1701,
and A*0207, B*4601, where each pair is in tight linkage disequilibrium—followed the
tree and thus separated the HLA data into two sets: (1) “C1701” consisting of these
four alleles and (2) “notC1701” consisting of the remaining alleles, and analyzed these
two sets separately.
Our results using BIC show that the conditional model better explains the not-
C1701 data (p = 0.0001, N = 256296), whereas the undirected joint model better
explains the C1701 data (p = 1.9 × 10
−24
, N = 5664). In the case of the C1701 data,
it seems that the phylogenetic tree is more a confounder of the data in the traditional
sense, wherein the tree is associated with both the HLA and the sequences and induces
false correlations between HLA and sequence.
In this application, we were fortunate that additional information was available
to help conﬁrm the HLA-sequence associations that we found. In particular, a known
epitope in the vicinity of a found association supports the validity of that association,
as immune pressure is focused on epitopes and the immediate surrounding regions
that participate in the presentation of the those epitopes on the HLA molecules at
the cell surface [
119
]. Thus, we constructed discrimination curves where an HLA-
sequence association was considered “true” if it is within three amino acids of a known
epitope with the corresponding HLA and “false” otherwise. This bronze standard does
not take into account undiscovered epitopes or linkage disequilibrium, but should
nonetheless be unbiased with respect to a comparison of the alternative methods for
identifying associations. The discrimination curve in
Figure 5.3
for the notC1701 data
is consistent with the BIC and synthetic results, indicating that the conditional model
best ﬁts this data. We could not construct a discrimination curve for the C1701 data,
as there are no known A*0207, B*4201, B*4601, or Cw*1701 epitopes in Gag.
The associations found by the conditional model with q < 0.2 on the real data are
shown in
Table 5.1
.

65

0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
0.02
0.04
0.06
0.08
0.1
P
re
c
is
ion
Recall
Conditional
Undirected Joint
FET
Figure 5.3: PR curves for the real the full HLA-amino-acid data. Ground truth
was estimated by identifying known epitopes within three residues of the predicted
association. This deﬁnition is known to miss a large number of real epitopes.
5.4
Application 2: Pairwise correlations between amino acids in HIV
Identiﬁcation of pairwise correlations between amino acids is important to many areas
of biology, as correlations can indicate structural or functional interaction [
27
,
67
].
Many methods, including the undirected joint model [
178
], have been developed to
identify correlated residues.
Continuing our focus on HIV, we applied both the undirected joint and the con-
ditional model to the sequence data from the Western Australia cohort [
157
]. We
concentrated on the HIV-1 p6 protein, which is cleaved from the Gag
55
polyprotein.
This ﬁfty two amino-acid protein was chosen because it is the shortest HIV protein,
making pair-wise amino acid tests feasible for all models. We ﬁt the conditional
model in both directions (making both X and Y target variables), and selected the
best model according to BIC.
Remarkably, the BIC scores of the conditional model are signiﬁcantly higher than
those of the joint model (p < 10
−100
, N = 52767). We suspect that the conditional

66
Table 5.1: Predicted HLA-amino acid associations in Gag.
Pos
HLA
p
q
242
B5701
4.3E-08
0
28
A0301
1.5E-07
0
242
B5801
3.2E-06
0.03
147
C0602
5.0E-06
0.03
26
C0303
6.9E-06
0.05
482
B4001
2.8E-05
0.1
397
A3101
3.8E-05
0.13
495
B4701
6.9E-05
0.17
model may be better because many mutations could be compensating for other mu-
tations driven by HLA immune pressure. The conditional model ﬁnds that 893 of the
52767 (1.7%) amino acid pairs, and 310 of 1300 (24%) of position pairs, are correlated
at q < 0.2. This dense network of interactions is consistent with the idea that many
of the mutations are compensatory in nature.
For example, the conditional model
identiﬁes two HLA-mediated escape mutations in p6 (
Table 5.1
). Mutations at these
two positions account for forty two (13.5%) of the position-pair correlations.
We have developed a tool for visualizing the network of dependencies (
Figure 5.4
).
The visualization highlights at least one potentially interesting set of interactions. In
particular, R16 is strongly correlated with residues at positions 21 through 36, many
of which are correlated with each other as well as with residues throughout the protein.
This complex network of interactions connects the two α-helix domains of p6 [
68
] and
may be of structural or functional signiﬁcance.

67
Figure 5.4: Correlated amino-acid pairs in HIV-1 p6. The ﬁfty two consensus amino
acids of P6 are drawn as a circle, with the N-terminal end shown at the far right and
the protein extending counter-clockwise. Each arc represents an association predicted
by the conditional model and is signiﬁcant at q < 0.2. Arc color reﬂects the q-value of
the association. Dark gray consensus residues denote positions where there are fewer
than three sequences with a non-consensus residue.

68
5.5
Application 3: Genomic search for genotype-phenotype associations
in Arabidopsis thaliana
Aranzana et al. recently demonstrated the potential utility of genome wide association
studies (GWAS), as well as the importance of accounting for hierarchical population
structure [
8
]. In this study, the authors genotyped 848 loci in ninety six Arabidopsis
thaliana strains and looked for haplotypes that were correlated with hypersensitive
response to P. syringae strains expressing one of three avirulence (avr) genes (avr-
Rpm1, avrRpt2, or avrPph3). In plants, each avr bacterial protein is recognized by a
corresponding resistance (R) gene. If both plant and pathogen have active copies of
the respective avr-R genes, a biochemical cascade is triggered at the point of infec-
tion, leading to massive programmed cell death and containment of the infection (for
review, see [
43
]). Using both an IID-based model and a method that used the hierar-
chical population structure that was constructed from the sequenced loci, the authors
showed that loci adjacent to the known R genes are highly correlated with the corre-
sponding avr phenotypes. Unfortunately, the authors noted that their statistics were
poorly calibrated, precluding conﬁdent predictions of the other pathogen-response
proteins that are involved in the hypersensitive response cascade. Here, we apply our
well-calibrated methods to the same data, using a genetic-similarity tree constructed
from the sequence data.
Although Arabidopsis is a sexually reproducing species, it is highly selﬁng, meaning
that organisms primarily mate with themselves. As a result, the population structure
induced is hierarchical and bears striking resemblance to a phylogenetic tree. Aran-
zana et al. found that a tree built from pairwise similarity matrices on shared alleles
provided a good qualitative description of both the geographic distribution of the
organisms and the distribution of avr and ﬂowering time alleles [
8
]. Quantitatively,
we found that sixty one percent of the haplotypes and two of the three phenotypes
followed the “phylogenetic” tree constructed from the sequence data.

69
When applying our conditional model to this application, it is not clear whether
the target variables should be haplotypes or phenotypes. In general, genetic variations
directly inﬂuence phenotypes, but phenotypes also indirectly inﬂuence haplotypes
through selection pressure. As two thirds of both variables followed the tree, we ran
the conditional model in both directions, once using the phenotypes as the target and
once using them as the predictor, using BIC to determine which direction was best
for any given haplotype-phenotype pair.
We found that the BIC scores for the conditional and undirected joint models
were not signiﬁcantly diﬀerent (p = 0.70, N = 14043). Consequently, we arbitrarily
choose to examine the results of the conditional model in detail.
Figure 5.5
shows
the genome wide distribution of conditional evolution q-values for each of the three
phenotypes. For each phenotype, the most signiﬁcant association is a locus near the
corresponding R gene. We constructed this ﬁgure to be similar to the one in Aranzana
et al. to facilitate comparison.
Our synthetic tests indicate that the conditional method is well calibrated, im-
plying that roughly 80% of the associations we ﬁnd with a q < 0.2 cutoﬀ should
be legitimate. To explore this implication, we took the ﬁfty one genotypic associ-
ations (comprising forty unique loci) that correlate with these three hypersensitive
phenotypes at this cutoﬀ, and noted which of the associated loci were near known
or putative bacterial response proteins according to
http://www.arabidopsis.org
.
Our standard of “true positive” was deﬁned to be proximity to a protein whose de-
scription included the phrase “disease response” (see
subsection 5.1.7
). We found
that twenty three (45%) of the predictions were within ﬁfty kilobases of such proteins
. This bronze standard undoubtedly contains false positives and false negatives, and
therefore cannot be used to conﬁrm that our methods our calibrated. Nonetheless,
we easily can reject the null hypothesis that these twenty three associations are found
near disease-response proteins by chance (p < 0.0001).

70
Figure 5.5: Genomic distribution of genotype-phenotype association scores for Ara-
bidopsis bacterial response. 4681 haplotypes were compared against each of the three
bacterial response phenotypes, Rpm1 (Top), Rpt2 (Middle) and Pph3 (Bottom). For
each haplotype, the four conditional models were run and negative log
10
of the most
signiﬁcant q-value is plotted. For each phenotype, the most signiﬁcant association is
a locus within 10 Kb of the corresponding R gene. The dotted line shows the q = 0.20
threshold.

71
5.6
Studies using the univariate conditional evolution model
The univariate conditional model has proven useful in predicting HLA-mediated CTL
escape mutations in HIV. After presenting the model in the context of demonstrating
the confounding eﬀect of phylogeny in this domain [
17
], as well as a detailed exami-
nation of the model [
31
] (from which the above sections were taken), several studies
have used the conditional model in the context of HLA-mediated escape in HIV. Here
we brieﬂy review these studies.
The ﬁrst paper to use the model in detail was Brumme et al. [
24
], who applied
the approach to the Protease, RT, Vpr and Nef HIV proteins. There the authors
used a cohort of about 700 chronically clade B infected individuals in the largest
study of its kind. The associations predicted by the conditional model suggested
both a broad eﬀect of HIV imprinting on HIV clade B protein diversity, with Nef
exhibiting the broadest eﬀect. Importantly, the fact that associations were identiﬁable
extends previous case studies that suggest CTL escape is broadly consistent and
therefore predictable (see
chapter 3
). With regards to method validation, 35% of
associations at 20% FDR mapped to published epitopes against the same HLA allele.
An additional 50% of the associations mapped to predicted epitopes, of which 20–
40% (depending on protein) were broadly conﬁrmed using independent interferon-γ
(IFN-γ) ELISpot data.
Furthermore, the utility of phylogenetic correction, even
on a relatively homogeneous (single clade) cohort was demonstrated by the higher
proportion of conditional evolution associations that mapped to known epitopes than
did associations computed via Fisher’s exact test. Finally, Brumme et al. found a
weak but signiﬁcant negative correlation between the number of predicted resistant
substitutions and CD4 counts, an observation conﬁrmed independently in each of the
studied proteins.
In a similar follow up study, Brumme et al. [
25
] studied HLA-mediated escape in
Gag, a protein suspected to be particularly inﬂuential in control of HIV (see
chap-

72
ter 3
). Here, the number of potential escape sites (as determined by HLA proﬁle and
the conditional model associations) was found to be negatively correlated with pVL,
suggesting that the potential to broadly target Gag is correlated with relative control
of infection. Furthermore, the proportion of escaped sites was positively correlated
with pVL, suggesting that viral load increased as escape mutations were selected.
Although statistically signiﬁcant, these trends explained only a small fraction of the
pVL variance, suggesting that much of the complexity is not captured by these simple
studies. At 20% FDR, 46% of associations mapped within or near published epitopes,
while an additional 12% mapped within or near putative epitopes supported by ei-
ther IFN-γ ELISpot or epitope prediction data (rules for epitope prediction were
made stricter here than in [
24
] due to further studies on the false discovery rates of
the prediction algorithm).
Rousseau et al. [
196
] looked at associations across the entire clade C genome on
a similar-size cohort, using a combination of the conditional evolution model and
models developed by Los Alamos National Labs (LANL) [
17
] and in the Mullins lab.
This study compared the associations found in their clade C cohort to those found
by Brumme et al.[
24
] and found both common and divergent escape associations,
suggesting that future studies more directly assess clade similarities and diﬀerences
in this domain. In agreement with work by Goulder and others (
chapter 3
), Rousseau
et al. found that, genome wide, HLA B and C alleles were more likely to drive HIV
diversity than were A alleles, suggesting a more active role for HLA-B and C in
HIV control. Remarkably, they found signiﬁcant diﬀerence in the ratio of predicted
susceptible to resistant residues in patients without the HLA on a per protein bases,
suggesting that escape mutations are more costly to viral ﬁtness (and thus under
stronger reversion pressure) in some proteins than in others.
In agreement with
previous work that found that targeting Gag correlates with lower pVL, Gag had
the second highest susceptible to resistant ratio, behind only Vpr, an often ignored
accessory protein. Interestingly, Rousseau et al. also analyzed 9-mers (as opposed to

73
single codons). Although in general the power of 9-mers was greatly reduced due to
the large number of observed 9-mers, they did ﬁnd some escape patterns that involved
two substitutions that were not detected when only a single codon was considered,
suggesting the presence of alternative escape pathways that dilute the signal when
only single codons are considered.
Matthews et al. [
153
] took the correlations with pVL one step further. Using a
superset of the cohort in Rousseau et al., they focused on Gag, Pol and Nef, identiﬁed
new epitopes and looked at the correlation of diﬀerent types of associations with
pVL. Speciﬁcally, using ELISpot data, they ﬁrst conﬁrmed previous reports that most
epitopes are targeted by HLA-B alleles, and that targeting of B-restricted epitopes
in Gag was correlated with lower pVL. They next examined the associations derived
using the univariate conditional model. Remarkably, at 5% FDR, they found that
92% of HLA-B associations were within described epitopes. Furthermore, the number
of associations in Gag per HLA-B allele was strongly correlated with median pVL for
patients with that allele (r=-0.57, p=0.0034). Comparing these results to the ELISpot
results suggests that measuring escape is a good surrogate for measuring the number
of epitopes actually targeted by an allele. Furthermore, Matthews et al. went on to
show that most of the correlation came from reversion associations, suggesting that
targeting epitopes for which escape elicits a ﬁtness cost (and hence pressure to revert
in patients without the allele) has the most eﬀect on viremia control.
It is natural to assume that it would be beneﬁcial for the immune system to target
conserved regions of HIV, as these regions are presumably conserved because ﬁtness
constraints limit the amount of variation that can be tolerated [
194
]. To directly test
this hypothesis, Wang et al. [
225
] analyzed escape associations in a whole-viral genome
study of 98 patients and compared the conservation of escaping sites an protective
and hazardous HLA alleles. Strikingly, while they found no correlation between con-
servation and relative hazard of HLA alleles when looking at all associations, they
found a strong correlation (R = −57, p = 0.028) among associations in epitopes that

74
are immunodominantly targeted during acute infection. This result suggests that tar-
geting conserved epitopes early in infection may lead to relative control of the virus
and improved prognosis.
These studies have collectively added to the vast literature that was previously
based largely on case studies and small cohorts showing that patterns of immune
escape are generally consistent and that control of viremia that is correlated to HLA
allele and protein is determined largely by the speciﬁc epitopes that allele presents
and CTL targets as well as the escape mutations that are available to the virus. From
the principles of selection pressure and adaptation, we can assume that, for a given
epitope, there are essentially three viral load functions of interest and that they can
be ordered as:
pVL(susceptible, CTL) < pVL(resistant) ≤ pVL(susceptible, noCTL)
In cases where pVL(resistant) ≈ pVL(susceptible, noCTL), targeting the epitope will
have only a transient eﬀect that disappears after escape, and the resistant form of
common HLA alleles may come to dominate the population (see
chapter 3
). However,
when pVL(resistant)
pVL(susceptible, noCTL), viremia control may continue even
after escape, and reversion upon transmission to HLA-mismatched recipients is ex-
pected to be rapid. Thus, studies correlating pVL to various types of escape may
suggest which epitopes can most eﬀectively be targeted, and thus, which epitopes are
promising targets for a T-cell-based vaccine. Moving forward, identifying those spe-
ciﬁc escape variants that do not completely restore ﬁtness will be of vital importance.
The most direct assessment of pVL(susceptible, noCTL) − pVL(resistant) recently
came from Goepfert et al. [
81
], who took the conditional model associations from
the clade C Gag sequences and applied them to 117 matched donor-recipient pairs,
with the goal of determining whether transmission of resistant variants from the
donor correlated with reduced pVL in the HLA-mismatched recipient six months after
infection, thus directly measuring the cumulative ﬁtness cost of escape mutations.

75
Similar to the previous studies, the authors found that only Gag HLA-B associations
correlated with reduced pVL, suggesting that B-allele control of viremia is largely due
to the ﬁtness costs of the selected escape substitutions.
Finally, to analyze the rate of escape in acute infection and to quantify the amount
of HIV evolution attributable to HLA-mediated selection pressure, Brumme et al. [
23
]
took longitudinal samples from 100 acutely infected patients sequenced repeatedly
from within six months of infection through two years post infection and compared
the observed substitutions to the HLA associations from [
24
]. They showed that
rates of escape correlated with known patterns of immunodominance and that over
35% of evolution in acutely infected patients is directly attributable to HLA-mediated
selection pressure.
5.7
Limitations of univariate conditional evolution model
Despite the demonstrated eﬀectiveness of the univariate conditional model for CTL-
mediated HIV escape, one key limitation of the univariate model has become apparent
from these studies: the network of correlations is such that pairwise tests ﬁnd too
many spurious associations. The most obvious source of confounding comes from
the linkage disequilibrium (LD) structure among the HLA alleles. Because the A, B
and C alleles lie in the same region on the chromosome, they tend to be inherited
together. For example, in the Brumme et al. study [
24
], 272 HLA allele pairs are
in LD at 20% FDR. Thus, for example, if B14 leads to escape at position i, then
with high probability, we will ﬁnd an association between C08 and position i, as all
B14 patients also have C08, and half of C08 patients have B14. In the studies by
Brumme and Rousseau [
24
,
25
,
196
], HLA LD was accounted for using an ad hoc post
processing step. Brieﬂy, for every codon that had the exact same association (same
amino acid and direction of correlation) with two HLA alleles H
1
and H
2
that were
in LD, if one of the HLA alleles was known to bind an epitope near A, that allele
was kept and other discarded; otherwise, the allele with the lower q value was kept.

76
Although this approach generally works, it requires deﬁning an arbitrary deﬁnition
of LD and will miss associations where both alleles interact with the same epitope.
Thus, in Matthews at al. we implemented a Decision Tree approach that eﬀectively
automated the ad hoc procedure. Here, for a given amino acid at a given codon,
we run the univariate model as described here. We then identify the HLA with the
strongest association (by p-value), remove all patients who have that HLA, then repeat
the procedure on the remaining patients. This procedure is iterated until the most
signiﬁcant association has p > 0.05. Thus, the resulting associations are independent
of each other, as they measure eﬀects in the absence of previously deﬁned (more
signiﬁcant) associations. The tradeoﬀ is loss of power, as on each iteration fewer
patients are considered. Furthermore, interactions between HLA alleles are ignored.
As observed in all three studies, many codons experience “push-pull” eﬀects between
two or more alleles, where each allele selects for a diﬀerent amino acid. Although
measuring associations in the absence of certain alleles is useful, it may be beneﬁcial
to model the combined eﬀects.
Similar to HLA LD, covariation among HIV codons can be expected to be a
source of confounding. As seen in Application 2 above, there appears to be a dense
network of covariation among HIV codons, at least in some proteins. This covariation
network will lead to spurious associations. For example, if HIV codons A and B
are correlated, and HLA H leads to escape in A, then we have the causal model
H → A → B. If these associations are strong enough, the univariate conditional
model will ﬁnd the associations H → A (correctly) and H → B (incorrectly). As the
density of codon covariation increases, so will the number of spurious associations.
In the context of HLA-mediated escape, this may be tolerable, in the sense that such
indirect associations are still “HLA-associated” and may have experimental relevance.
Nevertheless, determining which associations lead to epitope escape and which are (for
example) compensatory may have direct implications on vaccine design. Furthermore,
when codon-codon associations are the primary goal, as in the case of the amino acid

77
covariation problem discussed in
chapter 2
, indirect associations will tend to lead to
false conclusions. In addition, given the density of HLA-associated polymorphisms
[
24
,
25
,
196
], it is clear that these associations will play a confounding role in contexts
where patterns of covariation are sought. It is therefore desirable to build a conditional
model of evolution that simultaneously accounts for all known sources of selection
pressure.
5.8
Conclusions
We have described two evolutionary processes that can confound association analy-
ses and have deﬁned two corresponding generative models for discrete data that can
correct for, and even leverage, the existence of these processes. We have found that
explicitly modeling evolutionary processes increases discriminatory power and results
in well-calibrated estimates of one minus positive predictive value. We have imple-
mented methods for ﬁtting these models to data and a tool for visualizing the results
of the analysis. These tools are available on the internet.
Neither the undirected joint nor the conditional model outperformed the other on
all real data sets, suggesting that both models should be considered when analyzing
new data. Nonetheless, the conditional model better ﬁt most of the real data that
we analyzed. The conditional model better described the eﬀects of immune pressure
on HIV evolution, and perhaps more surprising, it better described the correlation
between HIV-1 p6 amino-acid pairs. This observation may be due to the rapid evolu-
tion of HIV and positive selection pressure from the immune response in conjunction
with compensatory mutations in the observed patients.
Importantly, the conditional model performed well even in synthetic cases gener-
ated by the undirected joint model, suggesting that the conditional model may be a
good approximation to the undirected joint model. This will be an important con-
sideration in scaling up to multiple interactions, as the number of parameters in a
generalized n variable joint evolution model will be n + 2
n
− 1, whereas the num-

78
ber of parameters in comparable conditional models will be 2 + n. Developing such
conditional models is the main future work of this dissertation proposal.

79
Chapter 6
EVALUATION OF MULTIVARIATE MODELS
We now turn to an evaluation of the two multivariate models, described in
sec-
tion 4.2
, using synthetic data as an analysis tool. Although we go to great lengths
to ensure that the synthetic data is a reasonable reﬂection of real data, it is never
surprising when the model that generates the data is the model that performs best
on that data. Nevertheless, as in the previous chapter, we can use synthetic data
to compare the expressiveness of two models. In this case, we show that the Noisy
Add model is superior to the Decision Tree model, because even when the data are
known to follow the assumption of the Decision Tree, Noisy Add still performs as
well as Decision Tree (the reverse is not true). Furthermore, we have already argued
that Noisy Add is conceptually more appropriate than other models. But even as-
suming the conceptual superiority of the model, it remains to be seen whether there
is any practical advantage to the model. As we will see, there is indeed a profound
advantage.
6.1
Technical Details
6.1.1
Computing q-values
The asymptotic conservative guarantee of Equation (
4.1
) (see
section 4.3
) requires
a conservative estimate of Equation (
4.2
), which requires a valid (or stochastically
conservative) p-value. In order to achieve a valid p-value, all model assumptions must
be reasonably met. In particular, all sources of confounding must be accounted for. In
principle, our multivariate models can account for these sources, provided the input
phylogeny is reasonable and all other sources of confounding are provided as predictor

80

0
1
2
3
4
5
6
7
8
9
10
0
2
4
6
8

Download 4.8 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 ... 12