Phylogenetic dependency networks: Inferring patterns of adaptation in hiv
Download 4.8 Kb. Pdf ko'rish
|
a b Figure 2.1: Examples illustrating the (a) overcounting and (b) undercounting of evi- dence for an association between X and Y . easily confound the statistical search for associations within such data. An important point that has not been emphasized in previous work is that different applications may involve different evolutionary processes leading to different kinds of confounding and requiring different solutions. For example, in one process, X and Y coevolve according to the phylogenetic tree—any change in Y during evolution influences the evolution of X, and vice versa. Here, the phylogenetic tree serves as a confounder of X and Y in the traditional sense—the tree is a hidden common cause of both X and Y , leading to spurious correlations between X and Y when the tree is ignored. In another process, only Y evolves according to the tree and the influence of X on Y occurs only at the tips of the tree. X need not evolve according to the tree or follow the tree, but instead can have any distribution, including one in which the observations of X are IID. We refer to these two processes as coevolution and conditional adaptation, respectively. 9 2.3 Related work on the comparative method In this section, we briefly review existing methods for the comparative method that account for phylogeny. The interested reader is referred to the textbooks of Harvey and Pagel [ 92 ] and Felsenstein [ 62 ], as well as the review by Martins [ 149 ], for a more detailed review of the field. 2.3.1 Correlated evolution of continuous variables One of the clearest arguments for identifying correlated traits in the context of phy- logenies come from Felsenstein [ 61 ], who laid out a similar argument to that given in section 2.2 proposed the method of independent contrasts as a remedy. In this method, the (continuous) traits are assumed to be derived from a Brownian motion (or other Gaussian) model of evolution and the n samples are converted to n − 1 differences between adjacent nodes in the tree, with the variance of the differences computed according to the branch lengths of the tree. The resulting differences are independent, allowing for regression tests between two traits. Of course, other models of neutral evolution may be more appropriate, as the random walk of Brownian motion is rather implausible for most systems in which at least some purifying selection provides a continuous draw toward some steady state. A variation of Felsenstein’s method using the Ornstein-Uhlenbeck (OU) process, which can be thought of as a random walk tethered by an elastic band to some center, provides a more realistic framework [ 89 ]. Butler and King [ 29 ], building on Hansen’s model [ 89 ], provide a natural framework in which different selective regimes can be tested. In essence, a specific source of selection pressure, hypothesized to act on a specific set of branches, is allowed to change the center to which the OU process gravitates. The univariate conditional model ( section 4.1 and chapter 5 ) can be thought of as a discrete-variable version of this OU approach. These methods are effective and well used, but are appropriate only for continuous data. Felsenstein 10 provided a method for modeling a discrete variable using an underlying continuous variable and a thresholding procedure, but likelihood estimation requires Markov chain Monte Carlo and is quite slow in practice [ 64 ]. 2.3.2 Correlated evolution of discrete variables For discrete data, early work focused on reconstructing ancestral states using parsi- mony and looking for correlations between transitions using standard methods such as the χ 2 test of independence [ 141 , 190 ]. The problem with these approaches is that they ignore the uncertainty in the assignment of ancestral states. In some cases, choosing an ancestral state when there is roughly equal evidence between two states can alter the conclusions. To account for this uncertainty, Pagel [ 171 ] presented a maximum likelihood approach that averages over all possible configurations of the internal nodes. In effect, Pagel performs a likelihood ratio test where the null model is two binary traits evolving independently using classical maximum likelihood over phylogenies [ 60 ], and the alternative model has 2 × 2 = 4 states that define correlated evolution. At about the same time, Muse independently proposed the four-state ver- sion of Pagel’s method for the purpose of detecting compensatory mutations in RNA secondary structure [ 158 ]. 2.3.3 Correlated evolution of amino acids Coevolution of discrete traits has been propelled largely by the amino acid coevolu- tion community. Here the goal is to identify coevolving pairs of codons in the same or interacting proteins (see [ 39 ] for a recent review). Early methods ignored the prob- lem of phylogeny and used simple methods of correlation such as mutual information [ 11 , 122 ] or the correlation coefficient [ 80 , 173 , 215 ], but the phylogeny was soon shown to play a confounding role [ 168 , 177 , 228 ]. A number of methods emerged that attempted to calibrate p-values based on the phylogeny. The idea behind these approaches is that the primary concern of phylogeny-based confounding is that the 11 resulting statistics do not follow the expected distribution, making any p-values that are computed on the assumption of a specific distribution invalid. To recalibrate, the null distribution is estimated from the data using some method that accounts for the phylogeny. Briefly, methods have been proposed based on global or local measures of average similarity [ 26 , 127 , 138 , 159 , 213 ] or flat clusters based on early bifurcations of the phylogenetic tree [ 11 , 27 ]. One approach that explicitly incorporates the phy- logeny is the parametric bootstrap, in which standard IID models are recalibrated by generating independently evolved amino acids according to an independence model of evolution [ 27 , 228 ] (in chapter 5 , we consider this method in more detail). One broad criticism of these approaches is that they leave the underlying ordering of the statistics unchanged. That is, it is implicitly assumed that the phylogeny does not affect the strength of the association, only the interpretation of the relative strength. As we have argued in section 2.2 , however, the phylogeny may profoundly alter the relative strength of the statistic, even changing the ordering of associations. Thus, directly incorporating the phylogeny into the derivation of the statistic may dramat- ically increase power. A more direct way to incorporate the phylogeny is to adapt one of the discrete methods from the evolutionary biology community. Several methods have derived from Ridley’s original proposal of reconstructing ancestral states [ 17 , 74 , 102 , 182 ]. Similarly, Poon et al. [ 181 ] and Pollock et al. [ 178 ], which we study in some detail in chapter 5 , develop special cases of Pagel’s method [ 171 ] that force the model to be reversible, and Yeang and Haussler [ 233 ] develop a version of Pagel’s method that does not require the collapse of amino acids into binary space, but rather achieves the necessary state-space reduction by setting the instantaneous rate of simultaneous mutation in both both codons at a nonzero constant that is the same for all amino acids. 12 2.3.4 Identifying adaptation from DNA sequence alone An alternative approach to discrete data is to simply identify codons that are under diversifying (positive) or purifying (negative) selection pressure, an approach that is particularly suited in the absence of testable hypotheses regarding the source of selection pressure. Several methods have been developed that compare the rate of synonymous mutations d s to that of nonsynonymous mutations d n to identify genes, regions, or even individual codons that experience more amino acid substitutions than expected based on the underlying neutral substitution rate [ 160 ]. One of the more common methods is PAML, which uses a codon substitution model over a phylogeny to compute d n /d s [ 165 , 231 ]. Stewart et al. [ 207 ] extended this concept in the program QUASI to identify specific residues that are positively selected, although this model assumes a star phylogeny in that only deviations from consensus are considered. Pe- ters et al. [ 174 ] identified HLA-associated polymorphisms by identifying positively selected residues in HIV using QUASI, then correlating those residues to HLA alleles using standard tests that assume independence (see chapter 3 for an introduction to the importance of HLA-derived selection pressure on HIV). Delport et al. [ 48 ] devel- oped a codon model that specifically identifies amino acids that toggle back an forth, on the assumption that these represent specific adaptations to certain environments (assumed to be HLA in their case study) that revert in the absence of the environment. Chen and Lee [ 35 ] developed a more direct approach to identifying sources of selection pressure by defining a d n /d s ratio that was conditional on specific environmental vari- ables (in their case, HIV antiretroviral drugs), but also did so using a na¨ıve definition of d n /d s based on deviation from consensus. Such definitions implicitly assume a star phylogeny and ignore branch lengths (i.e., they assume the traits are IID), and are thus likely to lead to statistical bias in cases where there is some structure in the tree, as in HIV [ 17 , 31 ]. Even when the phylogeny and branch lengths are considered, as in PAML, d n /d s ratios can be viewed as a means to calibrating statistics. That is, 13 the synonymous substitution rate serves to normalize observations regarding nonsyn- onymous substitutions. Thus, models that explicitly model evolutionary interactions between two variables (e.g. [ 31 , 102 , 141 , 158 , 171 , 178 , 181 , 190 ]) are expected to have greater statistical power than d n /d s ratios, due to the fact that they can up- weight surprising deviations between evolutionarily similar species whereas d n and d s represent summary statistics across all species and cannot leverage such information. 2.3.5 Similarities to population genetics Population structure in biological data has also been addressed in the area of genome wide association studies (GWAS). Although standard population genetic models as- sume populations mate randomly, violations of this assumption result in latent pop- ulation structure that inflates false positive rates, an effect that will only increase as study sizes increase [ 144 , 145 ]. There are two rather different approaches in this community that have been used to compensate for population structure. The more commonly used approach attempts to recalibrate standard statistics by normalizing results according to the distribution of the statistic across the entire genome [ 50 , 51 ]. As we shall see, solutions to calibration are insufficient, as population structure also affects discriminatory power. The other approach assumes population structure is flat and can be captured by a small number of (perhaps overlapping) clusters or con- tinuous hidden variables [ 185 , 186 , 198 , 204 , 217 ]. Although these methods increase discriminatory power relative to simple IID models, there is mounting evidence that populations are hierarchically structured. For example, in addition to high level ge- ographical/social constraints that impose population structure [ 195 ], structure exists within a number of subpopulations that have been studied [ 13 , 30 , 96 , 223 ]. If hi- erarchical models describe the data better than flat cluster models, then it stands to reason that such models will have higher discriminatory power. Thus, several au- thors have suggested that a more accurate approach would be to model population structure hierarchically [ 8 , 114 , 240 ]. Aranzana et al. [ 8 ] described one such model in 14 their Arabidopsis study, which we re-examine in section 5.5 . A more general approach that can capture pedigree structure is the linear mixed-effects model [ 97 , 114 ], most recently extended by Yu et al. [ 241 ]. This linear model includes a pairwise correla- tion term that models genetic relatedness, a white noise term, and an environmental impact term that is used to identify sources of selection pressure. This approach can be thought of as unifying population genetics and phylogenies, in that Brownian motion and OU processes over phylogenies can be captured in the genetic relatedness term [ 101 , 140 ]. Unfortunately, modeling discrete data with these approaches is com- putationally slow and difficult to optimize, at it requires variational approximations (H. Kang and D. Heckerman, unpublished data). 2.4 Limitations of existing methods Although there is a long and rich history of the development of methods for the comparative method, we note two major gaps in the literature, each of which will be addressed by this dissertation. 2.4.1 Chains of interaction First, we note that the traditional application of the comparative method is to look for correlations among pairs of variables. The problem with this approach can be seen by a simple example. We have been considering the case where trait X exerts selection pressure on trait Y , which can be graphically depicted as X → Y. Suppose, however, that Y in turn exerts selection pressure on another trait Z. A common example is if Y and Z are both codons in the same protein. In this scenario, a mutation in Y may serve as an adaptation in response to X, but the change in Y destabilizes the protein unless a compensatory change in Z occurs. This causal model 15 can be graphically depicted as X → Y → Z. The problem is, when we apply the comparative method to all pairs of variables, we will find that all three pairs (X, Y ), (Y, Z), and (X, Z) are correlated, even though Z is conditionally independent of X. We refer to such causal models as chains of interactions. If chains are common, then indirect associations ((X, Z) in the present example) will be common, leading to a large number of false positive results. To our knowledge, only Poon et al. [ 182 ] have addressed the problem of chains of interactions when considering discrete traits in the context of phylogeny. (Several authors have employed Bayesian networks, which solve this problem, but have done so assuming the traits are IID [ 45 , 181 ].) Poon’s approach can be described in three steps: (1) first, a phylogeny is inferred from all traits (they focus on amino acid covariation); (2) next, they infer the most likely state of each trait for each hidden node in the phylogeny in a method similar to [ 190 ]; (3) finally, they feed both the observed and inferred traits into a standard directed acyclic graphical (DAG) model inference algorithm [ 93 ], treating the inferred states as observed data. This paper represents a major advance in the field in that it was the first to both identify and address the problem of chains of interactions. We must, however, note several weakness to this approach. First, one must be cautious whenever hidden nodes are treated as observed data. In cases where there is strong evidence that the hidden node takes on a specific value, this approximation may yield good results in practice. For large phylogenies, however, it is often the case that the evidence for one state is only slightly greater than the evidence for another state, especially for nodes near the root (and thus farther removed from the observed data). Although Poon et al. showed that their method was reasonably robust to different instantiations of the hidden nodes on the data sets they considered, in general, power is typically gained when all the hidden nodes are integrated out. Second, the DAG model has some inherent constraints. For example, the acyclicity 16 constraint, which is required for computational tractability, may not be a reasonable assumption. Certainly in the case of amino acid covariation one can expect cycles to exist in the true causal model. The result can be difficulty in interpreting the independencies implied by the resulting structure [ 94 ]. Additionally, the goal of the DAG model is to maximize the joint likelihood over all the attributes. This requires an exponential number of parameters relative to the number of traits considered, leading to a substantial computational burden and a requirement for a massive amount of data to infer all but the simplest interaction networks. 2.4.2 Coevolution versus conditional adaptation As discussed in section 2.2 , there are at least two evolutionary processes that can lead to phylogenetic confounding: coevolution and conditional adaptation. To our knowledge, all of the existing approaches assume a coevolutionary process, by explic- itly incorporating the assumption in a generative model (e.g., [ 158 , 171 , 178 ]), or by mapping both traits to the same tree (e.g., [ 17 , 74 , 102 , 141 , 182 , 190 ]). In principle, most of the p-value calibration techniques could be adapted to estimate null data by randomizing only one variable with respect to the tree, though we are not aware of any explicit discussion in the literature to this effect. The problem with this assump- tion is clear. When one of the traits (typically the environmental trait) does not map well to the phylogeny of the other trait, forcing it to do so will hurt the modeling procedure. In practice, there are two solutions to this problem. (1) Many approaches can model an IID variable on a tree as a boundary condition of the model parameters. For example, the generative model of Pollock et al. [ 178 ] can, in principle, handle ei- ther or both traits being IID by setting the corresponding mutation rate parameter to infinity. In practice, however, we find that this model does not perform well on conditional adaptation data (see chapter 5 ). The difference between conditional adaptation and coevolution is, however, deeper than the fact that conditional adaptation is more likely to accommodate the two traits 17 following two different distributions. The specific interactions assumed by the causal models are different. In the coevolution case, the assumption is typically that the two traits are either positively or negative correlated in the traditional sense. By this we mean that, if the traits are positively correlated, the mutation of either trait to 1 will apply pressure for the other trait to mutate to 1. Conversely, the mutation of either trait to 0 will apply pressure for the other trait to mutate to 0. (The opposite is of course true for negative correlation.) The conditional adaptation assumption can posit two different definitions. (1) It can be a directed association. Specifically, changing trait X (the environmental variable in our example) may influence trait Y , but changing trait Y will have no influence on X. Second, the interaction may be only partially positive (or negative). For example, it maybe that X = 1 induces positive selection pressure for Y to transition to 1, but when X = 0, Y is under neutral selection. The point is not that the coevolution model is fundamentally unsound. Indeed, there are many examples where this model is likely closer to the true causal model than is conditional adaptation, with a prime example being amino acid covariation. Rather, each model appears to better describe different evolutionary processes. In this dissertation, we propose several specific conditional adaptation models, which are inspired by the process of adaptation described in section 2.1 . In chapter 5 , we explicitly compare one of these models to the coevolution model of Pollock et al [ 178 ]. As we shall see, both the coevolution and conditional adaptation models are distinct, though we also find that our conditional adaptation model is able to approximate the coevolution model reasonably well. 18 Chapter 3 HIV IMMUNE ESCAPE: INTRODUCTION AND REVIEW The extraordinary mutational capacity of HIV-1 represents a major challenge to vaccine development [ 76 , 143 ]. On average, the error-prone HIV-1 Reverse Transcrip- tase introduces one mutational “error” per replication cycle, while template-switching and recombination represent additional mechanisms for generating alternative viral species [ 143 ]. Within an infected individual, a progressive expansion of viral diversity occurs over the disease course [ 205 ], with multiple variants co-existing as a heteroge- neous swarm or quasispecies that is unique to each patient. On a global scale, HIV-1 has undergone dramatic diversification since its introduction into humans less than 100 years ago [ 121 , 229 ]: nucleotide sequences from the multiple viral subtypes and circulating recombinant forms comprising HIV-1 Group “M” strains (which account for the majority of infections worldwide) may differ by up to 35–40% [ 76 ]. Achieving a broader understanding of the factors driving viral evolution on both an individual and a global level is thus of paramount importance to vaccine design. Over the natural course of infection, the host immune response acts as a major selective force driving HIV-1 evolution in a continuous dynamic process known as im- mune escape [ 85 ]. Directed against three-dimensional epitopes on the virion surface, the role of antibodies is to neutralize free-floating virus or to tag them for destruction by effector cells or complement. Escape from the HIV-1-specific antibody response thus takes the form of amino acid substitutions within the viral Envelope protein and represents a main driver of both intra-individual and global HIV Envelope di- versity [ 226 ]. In contrast, the role of cytotoxic T-lymphocytes (CTL) is to eliminate 19 virus-infected cells through recognition of short, linear peptides processed intracellu- larly and presented on the cell surface by Human Leukocyte Antigen (HLA) class I molecules. Since peptides from all viral proteins have the capacity to bind and be presented by class I molecules, HLA-restricted CTL select for escape mutations on a proteome-wide basis. This fact, combined with the observation that CTL likely contribute more to immune control of HIV-1 infection than antibodies [ 133 ], high- lights CTL as a potentially major in vivo selective force driving genome-wide viral evolution. It is generally agreed that a successful HIV-1 vaccine will require stimulation of an effective CTL-based immune response in addition to an antibody response [ 133 ]. The recent suspension of a major CTL-based HIV-1 vaccine trial [ 161 ] underscores the need to improve our understanding of host antiviral immunity, including the impact of immune-driven viral adaptation on HIV-1 sequence diversity and its potential conse- quences for future vaccine strategies. In what follows, we summarize recent advances in our knowledge of CTL-driven HIV evolution at a population level and discuss the implications for vaccine design. 3.1 The HLA-restricted CTL response is a major selective force driving HIV-1 evolution within an infected host HLA-restricted CTL are major mediators of host antiviral control during HIV-1 infec- tion [ 85 , 86 ]. A temporal correlation exists between the appearance of HIV-1-specific CTL in vivo and the decline of acute-phase viremia [ 123 ] (antibodies appear only later [ 85 , 133 ]), and experimental depletion of CD8 + cells in rhesus macaques prior to Simian Immunodeficiency Virus (SIV) infection results in inability to control virus levels [ 199 ]. In addition, a strong epidemiological link exists between specific HLA class I alleles and differential rates of HIV-1 disease progression [ 14 , 34 , 44 ], suggesting that the quality of the CTL response and/or the characteristics of targeted epitopes strongly influences the effectiveness of antiviral control [ 85 ]. However, the strongest 20 evidence supporting CTL as a major determinant of HIV-1 control may be muta- tional immune escape. First described in 1997, selection of viral escape mutations within key CTL epitopes during primary [ 16 , 19 ] and chronic [ 84 , 85 ] HIV-1 infection identified immune-driven evolution as a continuous process occurring throughout the disease course. Although escape mutations are often selected within CTL epitopes (thus disrupting peptide-HLA binding or recognition of the peptide/HLA complex by the T-cell receptor), escape is by no means confined to epitope boundaries. Muta- tions in epitope flanking regions, which impair intracellular peptide processing and presentation, have also been described [ 3 , 54 , 239 ], as have secondary or compensatory sequence changes that can stabilize escape mutations selected elsewhere [ 41 , 85 , 201 ]. 3.2 Escape follows generally predictable patterns in response to specific immune pressures In the past decade, observational studies have identified a large number of CTL escape mutations that are reproducibly selected in the context of specific HLA restrictions [ 19 , 84 , 85 , 113 , 132 , 201 ]. This has led to a monumentally important observation: HIV-1 evolution follows generally predictable patterns when specific immune pressures are applied. This phenomenon was most strikingly demonstrated in a unique case- report of monozygotic twins infected with the same virus on the same day through needle sharing: over a three-year follow-up period, the kinetics and patterns of CTL and antibody escape mutations were nearly identical in both twins [ 53 ]. Even among unrelated individuals, kinetics and patterns of HIV-1 evolution are broadly predictable based on HLA restriction. The majority of persons expressing the “protective” HLA- B*57 allele [ 34 ], for example, will select for a T to N mutation at position three of the TW10 epitope in the p24 Gag protein within the first weeks after infection [ 85 , 132 ]. In B*27-expressing individuals, the first mutation that arises in the immunodominant Gag epitope KK10 is an L to M change at position six, followed years later by an R to K change at position two [ 85 , 113 ]. The fact that sites and pathways of escape are 21 broadly predictable indicates that despite the extensive worldwide sequence diversity of HIV-1, substantial constraints govern the evolution of this virus [ 2 , 53 ]. This raises the possibility that immunogens incorporating knowledge of common escape pathways may be designed. Until relatively recently, however, identification of escape mutations have generally been limited to smaller observational studies, and largely biased towards “protective” HLA alleles associated with long term viremic control [ 19 , 84 , 85 , 113 , 132 , 201 ]. 3.3 Immune selection pressures drive HIV evolution at the population level: but to what extent? If within-host CTL escape patterns are broadly predictable based on HLA profile, the frequency and distribution of HLA alleles in humans likely shape viral evolution at the population level in a similarly predictable manner. Indeed, the HLA footprinting hypothesis states that the circulating HIV-1 consensus sequence reflects viral adap- tation to the most commonly-expressed HLA alleles in a population [ 131 , 157 ]. One potential mechanism underlying this hypothetical footprinting effect is the repeated selection of fitness-neutral escape mutations in the context of frequently-observed HLA alleles, eventually leading to fixation of “inactive” forms of CTL epitopes in the circulating pool of viral strains and potentially rendering HIV less immunogenic as the epidemic progresses [ 131 , 157 ]. Alternatively, a bottleneck effect may have occurred early in the course of the pandemic. Under this hypothesis, escape mutations arising in the earliest patients persist in the circulating strain. In either scenario, persisting escape mutations must have minimal fitness cost in the absence of CTL to prevent reversion back to the susceptible form following transmission to patients who lack the restricting HLA allele. The role of CTL in shaping population HIV-1 sequence diversity is influenced by a complex interaction among multiple conflicting selective forces, one example being the delicate balance between the benefits of escape versus the associated costs to 22 viral fitness [ 137 ]. While escape mutations, by definition, confer a selective advantage under active CTL pressure, these mutants may not represent the most efficiently replicating species in the absence of immune pressure. For example, the dominant T242N escape mutation at position three of the B*57-restricted TW10 epitope in p24 Gag abrogates B*57-epitope binding [ 132 ], but also confers a substantial replicative cost in the absence of CTL pressure as demonstrated by rapid reversion to wild type following transmission to a B*57-negative individual [ 132 ] and in vitro assays measuring viral replicative capacity [ 21 , 147 ]. Reversion of escape mutations following transmission to an individual lacking the HLA allele, however, does not necessarily occur in all cases: the fitness cost of the substitution, the presence of compensatory mutations, and a complex array of other host and viral selective forces influence which immune-selected mutations may reach appreciable levels in the population [ 132 ]. Estimating the extent of immune imprinting on HIV-1 is additionally challenging due to the lack of a comprehensive map of HLA-associated escape sites across the viral genome. Until recently, studies of CTL escape have focused on select alleles and/or HLA-restricted epitopes in small observational studies; however, recent ad- vances in DNA sequencing technologies have facilitated the collection of HLA and HIV-1 sequence data in large cohorts of HIV-infected individuals, thus allowing the first population-based assessments of HLA-driven imprinting on the viral genome. 3.4 Assessing the extent of HLA-driven HIV-1 evolution at the popu- lation level: challenges and controversies The first study investigating HLA-mediated imprinting on HIV-1 at the population level was published by Moore et al. in 2002 [ 157 ]. Specific HLA alleles associated with the presence or absence of the consensus amino acid over codons 20-227 of the Reverse Transcriptase protein were identified in a cross-sectional analysis of over 400 clinically-derived HIV-1 sequences from Western Australia, using Fisher’s exact test and logistic regression. A total of 64 “positive” and 25 “negative” correlations (where 23 the HLA was respectively associated with divergence from, or preservation of, the consensus residue) were identified. Positive correlations confirmed the predictable selection of HLA-restricted escape mutations and highlighted the substantial portion of viral codons whose evolution is driven by escape. Negative associations, on the other hand, were interpreted as evidence to support the HLA footprinting effect; in other words, confirmation that the predominant circulating HIV-1 sequence had arisen in response to selection pressures imposed by the most commonly-expressed HLA alleles in the population [ 157 ]. There is one major concern, however, with this type of statistical association study. Evolutionary biologists have pointed out that standard tests of association that assume independence (such as Fisher’s exact test or logistic regression) are inap- propriate for the analysis of inter-species (or in this case, inter-strain) data, because sequences with a shared phylogenetic history do not represent statistically indepen- dent observations [ 61 ] (see chapter 2 for review). This problem is most apparent in the analysis of heterogeneous cohorts, where standard tests will identify correlations between subtype-specific viral polymorphisms and HLA alleles prevalent among indi- viduals infected with those subtypes. In this case, results do not necessarily indicate that these polymorphisms are selected under contemporary immune pressures; rather, they more likely reflect a correlation between HLA alleles enriched in particular hu- man populations and a subtype (or lineage)-specific viral polymorphism shared by all sequences in this branch of the tree (a so-called founder effect) [ 17 ]. Even in rela- tively homogeneous cohorts, failure to account for the evolutionary relatedness among HIV-1 sequences leads to increased variance in traditional statistical tests and thus uncertainty in interpreting the results (see ref. [ 31 ] and chapter 2 ). To address this issue, Bhattacharya et al. [ 17 ] proposed two methods to account for the underlying phylogenetic structure of HIV-1 sequences when identifying sites of CTL-mediated selection (the second method is the univariate model described and explored in Chapters 4 and 5 ). These approaches attempted to identify associations 24 that were unusual in the context of a shared lineage while statistically downplaying associations that could have been explained by neutral evolution. Their methods were used to identify HLA-associated polymorphisms in the Gag and Protease proteins in a mixed-clade data set of 96 sequences from the same cohort investigated by Moore et al. [ 17 ]. By comparing associations identified using standard versus lineage-corrected analyses, Bhattacharya et al. [ 17 ] demonstrated that a number of associations iden- tified by the original logistic regression method actually represented spurious asso- ciations that were better explained by founder effects. The authors noted however that phylogenetic correction achieves more than simply the elimination of spurious associations due to lineage effects. Consideration of tree structure also identified novel associations that were overlooked by the uncorrected analysis, thus allowing the potential for increased power compared to standard methods (though a formal power analysis was not undertaken) [ 17 ]. Overall, however, in the 96 sequences ana- lyzed by Bhattacharya et al., only 14 strong HLA allele/HIV-1 sequence associations were identified [ 17 ]. The relative paucity of associations identified by Bhattacharya et al. [ 17 ] compared to Moore et al. [ 157 ] led some to question the contribution of HLA-mediated selection pressures to viral evolution [ 118 ]. What followed was a scientific debate regarding the extent of HLA-mediated vi- ral imprinting at the population level. Was the data we previously accepted to be strong evidence of immune-driven viral evolution simply explained by founder effects? Not necessarily. Not only did the Moore [ 157 ] and Bhattacharya [ 17 ] analyses fea- ture different HIV-1 gene regions sequenced from different patient groups, but the latter was based on a data set less than one-quarter the size of the former, and thus had substantially reduced power to detect associations. Indeed, mathematical mod- eling experiments suggest that the false-negative rate (i.e. the % of true associations missed) may be over 80% in a data set of N=100 ( chapter 6 ), and Bhattacharya et al. themselves emphasized that their results should not be interpreted as evidence that immune pressure is a weak force in HIV-1 evolution [ 17 ]. Rather, the results un- 25 derscore the importance of disentangling the effects of immune selection from founder effects and emphasize the need for larger data sets to determine the extent to which immune imprinting shapes HIV-1 diversity at the population level. 3.5 HLA-associated immune pressures influence population HIV diver- sity at up to 40% of positions in some proteins Soon after the Bhattacharya et al. publication [ 17 ], we applied the phylogenetically- corrected methods to the first large-scale, population-based analysis of HLA-mediated imprinting on multiple HIV-1 genes [ 24 ]. Nearly 500 HLA-associated polymorphisms in the HIV-1 Protease, Reverse Transcriptase, Vpr and Nef proteins were identified in a cohort of ≈ 700 chronically-infected, antiretroviral na¨ıve individuals [ 24 ]. Polymor- phisms were dichotomized based on the direction of selection pressure: escape and reversion associations represented amino acids enriched in the presence or absence of a specific HLA allele, respectively. As expected, escape and reversion associations generally represented non-consensus and consensus residues, respectively, although some exceptions were noted. In addition, HIV-1 codons under diametrically opposed HLA selection pressures (where the escaped amino for one allele represented the re- verted (immunologically susceptible) form for another, and vice versa) were identified, highlighting a dynamic tug-of-war of immune selective pressures influencing HIV-1 di- versity at the population level [ 24 ]. Of note, the fact that the majority of identified associations fell outside the boundaries of known epitopes lead us to conlcude that escape is not confined to single point mutations selected within regions directly tar- geted by CTL; rather, immune pressures select for a broad range of polymorphisms on a protein- (and genome-) wide level. (As will be discussed in section 3.8 and chapter 6 , however, at least some of these association may be due to other sources of confounding.) In addition, substantial differences in the number of immune selection events were observed among HIV proteins [ 24 ]. Nef exhibited the greatest evidence for immune 26 adaptation, with ≈ 40% of its codons harboring at least one HLA association. In contrast, 10–15% of codons in Protease, Reverse Transcriptase, and Vpr exhibited evidence for HLA-mediated selection. More recent work suggests that the portion of Gag codons exhibiting HLA-associated substitutions is slightly higher than Pol [ 25 ]. Taken together, data indicate that amino acid variation at a minimum of 10–40% of HIV-1 subtype B codons is driven by HLA class I-associated immune pressures, even after correction for lineage effects [ 24 ]. In addition, we later applied the same methods to the entire clade C proteome on a cohort of 261 South Africans and found evidence for 310 distinct HLA-mediated escape events spanning the entire proteome [ 196 ], thus reconfirming that CTL immune selection substantially shapes HIV-1 diversity at the population level. 3.6 Clinical consequences of immune-mediated evolution Immune escape is believed to be a major factor limiting the immune system’s ability to control HIV-1 in the long term [ 85 ]. However, with the exception of documented loss of viremia control following escape within the B*27-restricted KK10 epitope in Gag [ 84 ], a clear relationship between CTL escape and HIV-1 disease progression has been difficult to demonstrate. A multitude of factors influence the frequency and kinetics of viral escape, making the clinical consequences of immune-driven HIV-1 evolution challenging to quantify. Even among individuals expressing the same HLA allele, substantial differences in the magnitude and frequency of epitope targeting are often observed, rendering it problematic to study the clinical consequences of escape on a population basis. A second complication is the issue of immunodominance, an incompletely understood phenomenon describing the fact that, despite the expression of up to six HLA class I alleles (and the potential to target multiple epitopes per allele), an individual’s CTL response is often initially directed against a single or few epitopes with a single HLA restriction [ 134 , 236 , 237 ]. Moreover, the breadth and specificity of targeted epitopes changes throughout the disease course. While the acute-phase 27 CTL response is generally narrowly directed, a progressive broadening of the epitopic repertoire occurs over time, likely as a consequence of viral escape within the early immunodominant epitopes [ 83 ]. In addition, the balance between increased fitness due to immune escape versus decreased fitness from the resulting substitution [ 137 ], the fact that escape is not always absolute (such that, in many cases the selected variant retains partial HLA binding and/or CTL recognition capacity [ 219 ]), and the ability of the immune response to adapt to a continuously evolving target (as demonstrated by the emergence of de novo CTL responses against newly-selected escape variants [ 4 ]), all contribute to the complexity of this issue. Finally, the fact that escape does not occur in isolation must also be considered, because mutations are likely to be selected in the context of a variety of compensatory changes and at the same time that reversion of transmitted mutations is occurring. Taken together, it seems unlikely to expect that the selection of any single mutation would be accompanied by a clinically detectable impact on HIV-1 disease progression. Nevertheless, the proteome-wide association study of Rousseau et al. [ 196 ] identified seven polymorphisms that each significantly predicted viral load, highlighting the role these mutations play in viremic control. In addition, the identification of HLA-associated polymorphisms in large clinically- derived data sets allows, for the first time, the characterization of the relationship between escape and markers of HIV-1 disease progression on a broader level. In their 2002 study, Moore et al. reported that the presence of HLA-associated polymorphisms in RT predicted pre-treatment plasma virus load (pVL) on a population basis [ 157 ]; however, this observation was not confirmed in the larger Brumme et al. [ 24 ] study that included Protease and Reverse Transcriptase. More recent work has, however, begun to identify classes of associations that contribute strongly to viremic control, with the importance of B alleles, the Gag protein and reversion associations partic- ularly highlighted (see section 5.6 for details). Further research is needed to assess whether there may be gene-specific differences in the contribution of CTL escape mu- 28 tations to markers of HIV disease, the answer to which may greatly inform vaccine design. 3.7 Strategies to cope with viral diversity in HIV-1 vaccine design Numerous strategies have been proposed to address the challenge of global sequence diversity in HIV-1 vaccine design. Immunogens based on consensus or phylogenetic reconstructions of ancestral and/or “center-of-tree” sequences attempt to minimize genetic distance between the vaccine and circulating HIV-1 strains [ 76 , 163 ]. Polyva- lent vaccine immunogens featuring maximal coverage of viral diversity and potential epitopes in compact sequence space have also been proposed [ 65 , 107 , 164 ]. To ad- dress the substantial mutational capacity of HIV-1, one proposed approach is to limit vaccine design to immunogenic yet highly conserved regions, such that escape could only occur at a substantial fitness cost [ 7 ]. A complimentary strategy would be to de- sign immunogens that incorporate both “wild-type” and “escaped” variants (as long as the variant retains its ability to bind HLA), thereby blocking preferred routes of escape in infected individuals [ 20 ]. The development of phylogenetically-informed methods to accurately identify vi- ral sites under active immune selection pressure using large clinically-derived HIV-1 sequence data sets represents a major advancement to the study of how viral genomes are shaped by human immunogenetic selection pressures. Achieving a complete pic- ture of the sites, pathways and kinetics of immune escape will not only help us gain an understanding of the extent to which host immunity shapes HIV-1 evolution, but will also inform the rational design of future vaccine immunogens. 3.8 Remaining challenges Despite recent advances in the field, a genome-wide map of HLA-associated polymor- phisms in HIV-1 remains far from complete. Although the importance of addressing the confounding effects of HIV-1 population structure are now recognized [ 17 , 24 ], 29 three important issues need to be addressed: statistical power, linkage disequilibrium (LD) among HLA alleles and coevolution of amino acids in viral sequences. As we have seen, the size of the study can dramatically effect the conclusions that are drawn. To date, the largest study analyzes a fairly homogenous cohort of approximately 700 in- dividuals, which resulted in dramatically different conclusions than studies on smaller cohorts. We will discuss this issue in the context of our proposed model in subsec- tion 6.2.5 , but the results discussed thus far underscore the importance of exploring these types of cross-sectional studies to even larger cohorts, as well as to assess other HIV-1 subtypes, where levels and patterns of HLA-mediated imprinting may differ. Indeed, one advantage of lineage-corrected methods may be the ability to combine heterogeneous cohorts to increase power, though the extent of shared epitopes among clades remains incompletely characterized. Of major concern is the potential confounding effects of linkage disequilibrium (LD) among HLA alleles. The concern over LD arises due to the fact that HLA class I alleles are situated in close proximity on the human genome and are not inherited independently [ 28 ]; therefore, an escape mutation driven by HLA-B*57, for example, may also be detected as being associated with the tightly linked Cw*06 allele. Whereas Moore et al. [ 157 ] addressed the issue of LD using logistic regression and a backwards elimination technique to identify the allele(s) that best explained the escape polymorphism [ 157 ], current lineage-corrected methods [ 17 , 24 ] do not directly address this issue. Similarly, none of the current approaches account for the effects of amino acid coevolution, such as when an amino acid substitution at one site preferentially occurs in the context of a secondary (or compensatory) mutation at another site [ 21 , 201 ]. Failure to account for this issue may result in both the primary and compensatory mutations being identified as correlated with the restricting HLA allele, when in fact only the former is directly selected. Although Moore et al. attempted to include co-varying amino acids as regressors [ 157 ], the confounding effects of phylogeny are 30 even greater when two amino acids share the same evolutionary history; indeed, a substantial body of literature exists to address the problem of covarying amino acids [ 39 ], though these methods have not previously been extended to incorporate HLA allele information or even correlations among multiple codons (see Chapters 2 and 5 ). Similarly, although Bhattacharya et al. [ 17 ] found evidence of compensatory mutations using a modified version of the evolutionary history reconstruction method proposed by Ridley in 1983 [ 190 ], the results are difficult to interpret due to the confounding of long and short range effects that occur when only pairs of variables are considered. Clearly more comprehensive approaches are needed to disentangle the complex interactions that govern HIV-1 escape pathways. 31 Chapter 4 PHYLOGENETIC DEPENDENCY NETWORKS We have argued in Chapters 2 and 3 that the study of HIV adaptation (and, by analogy, any study of adaptation) requires a framework that can model the evolu- tionary history of an amino acid in conjunction with selection pressure from one or more factors (HLA alleles, other amino acids, etc.). In this chapter, we describe the phylogenetic dependency network (PDN), an extension of the dependency network [ 94 ] that is well-suited for modeling adaptation. We will describe the PDN generally, using the area of HLA-mediated HIV adaptation to illustrate the concepts. A dependency network represents the probabilistic dependencies among a set of predictor and target attributes [ 94 ]. In our domain, target traits, denoted Y, corre- spond to the presence or absence of amino acids at all codons in an HIV protein. For a given Y in Y, the predictor traits, denoted X, correspond to the presence or absence of amino acids at all codons other than that for Y and the presence or absence of all HLA alleles. We constrain all traits to be binary, though generalizations are possible. We have found that this choice yields more statistical power in practice. A dependency network (phylogenetically corrected or otherwise) has two compo- nents. The first component, sometimes referred to as the structure of a dependency network, is a directed graph linking nodes, where each node corresponds to one of the traits in the domain. (We use the same name—e.g., Y —for the trait and its corresponding node in the graph.) An arc from X to Y in the graph is a statement that the probability distribution for Y depends on X. Thus, in our domain, a depen- dency network graphically depicts which HLA and codon traits predict each codon. The second component is a collection of conditional or local probability distributions, 32 one for every target trait of interest. The local probability distribution for target trait Y is P (Y | ˆ X), where ˆ X ⊆ X are the parents of Y in the graph. Therefore, in our domain, a dependency network contains a probability distribution for each codon trait conditioned on various HLA and codon traits. When constructing a dependency network, each local probability distribution is learned independently. This approach is computationally efficient, although it can lead to a decrease in statistical efficiency (see Discussion). A phylogenetic dependency network (PDN) for our HIV application is a depen- dency network in which each local probability distribution is corrected for the phylo- genetic structure of the HIV sequences. That is, the probability that a codon in an individual is a given amino acid depends on not only the traits ˆ X, but also on where that individual’s HIV sequence sits in the phylogeny ( Figure 4.1 ). Specifically, a PDN is a collection of the distributions P Ψ (Y | ˆ X), one for each Y in Y, where P Ψ refers to a distribution corrected for phylogeny. We use a model-selection approach to identify ˆ X, the set of parents for Y . Specif- ically, we use significance tests—False Discovery Rate (FDR) thresholds based on likelihood-ratio tests (LRTs)—to determine ˆ X. To avoid the inappropriate use of an LRT, we exclude traits as possible predictors when the corresponding predictor-target pair has a 2 × 2 contingency table that includes at least one bin where both the ob- served and expected value is at most three. This parameter was chosen based on performance with independent data (not shown). 4.1 Phylogenetically corrected distributions for one predictor trait A simple approach for identifying a set of traits that predict a given codon (i.e., for identifying the parents of a target trait in a PDN) is to test for pairwise correlations between a target codon and each predictor trait. The details of a statistical model that follows this approach, hereafter referred to as the univariate model, are described in section 4.4 and evaluated in chapter 5 . The model was first presented in [ 17 ], with 33 Figure 4.1: Phylogenetic dependency network (PDN). A PDN is a graphical model consisting of target traits whose outcome is a probabilistic function of predictor traits. Each of these probabilistic functions takes the phylogeny of the sequences into ac- count. Here, the target traits (green nodes) are binary and represent the presence or absence of amino acids at codons. These target traits may have dependencies on other codons (codon covariation) and/or on HLA alleles (HLA-mediated escape), which are denoted by blue nodes. Arcs represent the learned dependencies between target and predictor traits. All target traits are assumed to be influenced by the phylogeny (red arcs). The probability components of a PDN are the local conditional probabilities, each of which relates a single target trait to the phylogeny and a subset of the pre- dictor traits. These local conditional probabilities are learned independently for each target trait. In the hypothetical example depicted here, B*57 and B*58 predict M1 and A*02 predicts A5. A5 predicts A3, and there is a cyclical dependency among M1, G2, A3 and R4, in which most of the arcs are bidirectional. 34 the details described in [ 31 ]. We will provide a high level overview of the model here. To determine whether there is a significant pairwise correlation between predictor trait X and target amino acid Y , we compare the likelihood of a null model (sometimes referred to as the single variable model) that reflects the assertion that Y is under no selection pressure to an alternative model that reflects the assertion that Y is under selection pressure induced by a single predictor trait X. The null model assumes the target codon Y can be described completely by a model of independent evolution along a phylogenetic tree ( Figure 4.2 A). The leaves of the tree correspond to individuals in the study and are typically observed. The interior nodes of the tree correspond to unseen individuals infected by an HIV sequence that is a point of divergence. These nodes are hidden—that is, never observed. We use Y i to denote trait Y for the ith individual in the study (i = 1, . . . , N ). (Note that Y i is a variable in the ordinary statistical sense.) Because each target trait is binary, a natural null model is the two- state version of the continuous time Markov process, commonly used in phylogenetics [ 60 ]. This model assumes evolution is independent between different branches of the phylogeny and that the only informative predictor of a node in the evolutionary tree is its parent node in the tree. The alternative model adds a component of selection pressure derived from the predictor trait X ( Figure 4.2 B). We use variable X i to denote the trait X for the ith individual in the study (i = 1, . . . , N ), and do not explicitly name X for the unseen individuals represented in the interior of the phylogenetic tree. Because X may not share the same evolutionary history as Y , we assume X influences Y only at the leaves of the tree. In particular, we assume that, among the variables corresponding to trait X, only X i influences Y i for each i. This assumption was evaluated more fully by Carlson et al. [ 31 ] and found to be a reasonable approximation, even when X and Y share the same evolutionary history. To model selection pressure at the leaves, we extend the null model by adding a hidden trait H (with corresponding variables H i , i = 1, . . . , H) that represents what Y would have been had there been 35 Figure 4.2: The univariate model. (A) The null model, in which an amino acid evolves independently down the tree until it reaches a leaf. (B) The alternate model, in which an amino acid evolves independently down the tree until is reaches an individual, where it is influenced by selection pressure from the predictor. The variable H i for the ith individual represents the variable Y i had there been no influence from X i . Only the Y i and X i are observed. Conditional probability distributions are not shown. no selection pressure. The probability distribution for Y i then depends on H i and X i . When the values of H i and Y i are different, we say that a transition conditioned on X i has taken place. The precise rules governing the transitions conditioned on X i are given by the univariate leaf distribution P Ψ (Y |X) = P (Y i |H i , X i ). We assume that this leaf distribution is not a function of i—that is, this distribution is the same for each individual i = 1, . . . , N . Also note that the subscript Ψ is a reminder that Y i depends not only on X i , but also on the phylogeny through variable H i . In the univariate case, we define four possible leaf distributions. Escape means an individual may transition to Y i = 0 only when X i = 1. Reversion means an individual 36 may transition to Y i = 1 only when X i = 0. Attraction means an individual may transition to Y i = 1 only when X i = 1. Repulsion means an individual may transition to Y i = 0 only when X i = 0. Given a univariate leaf distribution, a single parameter s specifies the probability that the transition occurs given the appropriate state of X i . Note that attraction/repulsion correspond to a positive correlation between X i and Y i , whereas escape/reversion correspond to a negative correlation. The names of these leaf distributions correspond to various processes for selection pressure [ 31 ]. For example, the B*57-restricted CTL response selects for escape from the susceptible threonine at position 242 of the HIV Gag protein [ 132 ]. So, from the perspective of hidden and target traits that correspond to the presence and absence of threonine, the amino acid can transition from threonine to not threonine (H = 1, Y = 0) with a non-zero probability only when the individual has the B*57 allele (X = 1), which corresponds to the escape distribution just described. In addition, escape from threonine bears a fitness cost that leads to reversion in B*57-negative individuals [ 132 ]. Consequently, the amino acid can transition from not threonine to threonine (H = 0, Y = 1) with non-zero probability only when the individual lacks B*57 (X = 0), corresponding to the reversion distribution. The codon for threonine usually escapes to the resistant amino acid asparagine [ 132 ]. Continuing the exam- ple from the perspective of hidden and target traits that correspond to the presence and absence of asparagine, the amino acid can transition from not asparagine to as- paragine (H = 0, Y = 1) only when the individual has the B*57 allele (X = 1), which corresponds to the attraction distribution. Finally, the amino acid can transition from asparagine to not asparagine (H = 1, Y = 0) only when the individual lacks the B*57 allele (X = 0), which corresponds to the repulsion distribution. Although there is a natural pairing between escape/reversion and attraction/repulsion, in that the former indicates a negative correlation and the latter a positive correlation, the processes are each distinct and may provide information as to the underlying mechanism (see the section on distinguishing leaf distributions in Results). Furthermore, whereas the vast 37 majority of clinically-derived HIV sequences have either threonine or asparagine at codon 242, most codons are more variable, with more than one amino acid suscepti- ble to, or resistant from, CTL pressure mediated by the HLA allele. Consequently, escape/attraction and reversion/repulsion for alternate amino acids often provide ad- ditional information. Note that by restricting the univariate leaf distribution to one of these four forms, we have assumed that only one process (escape, reversion, at- traction, or repulsion) is occurring for a given predictor-target pair. Although in reality both escape and reversion (or attraction and repulsion) may occur with the same HLA-epitope combination, relaxing our assumption leads to substantial loss of power. Thus, we apply each of the four leaf distributions to the predictor-target pair and include only the most significant correlation in the model. 4.2 Phylogenetically corrected distributions for more than one predic- tor trait The univariate model works well when there are no correlations among predictor traits or among target traits [ 31 ]. As discussed, however, use of the model in the presence of linkage disequilibrium among HLA alleles and HIV codon covariation will likely lead to spurious associations. To avoid this problem, we use a multivariate model [ 33 ], in which more than one trait can be used to predict a particular target trait. In this model, for a given target trait Y , shown in Figure 4.3 , the target trait is allowed to evolve independently down the tree until it reaches a leaf in the tree corresponding to an individual in the study. At this point, selection pressure within the individual is governed by a multivariate leaf distribution, denoted P Ψ (Y | ˆ X), which depends on multiple predictor traits ˆ X. As in the univariate case, this leaf distribution is the same for each individual i = 1, . . . , N . The set of significant predictor traits can be identified by a number of meth- ods including forward, backward, and forward/backward selection. In this work, we concentrate on forward selection, wherein ˆ X is iteratively augmented with the most 38 X Hidden attribute H Observed target attribute Y Predictor attribute X Figure 4.3: The multivariate model. Here, an amino acid evolves independently down the tree until is reaches an individual, where it is influenced by one or more predictor traits. significantly associated trait at each iteration. For each added trait, we record only the most significant leaf distribution (escape, reversion, attraction, or repulsion). The significance of a predictor X with respect to target trait Y is computed using false discovery rates based on an LRT in which both the null and alternative models are conditioned on all significant predictors that were identified in previous iterations of forward selection. For practical purposes, we terminate forward selection when the most significant association has a p-value greater than or equal to some threshold to be described. There any many possibilities for the form of the multivariate leaf distribution P Ψ (Y | ˆ X). In this paper, we consider two distributions: Decision Tree and Noisy Add. 39 Decision Tree A straightforward way to represent the multivariate leaf distribution P Ψ (Y | ˆ X) is to list the probability distribution for Y given every possible instance of the traits H and ˆ X. Unfortunately, the length of this list grows exponentially with the number of predictor traits. An alternative is to use a Decision Tree, which is a compact representation of such a list. The use of the Decision Tree as a multivariate leaf distribution was recently employed by Matthews et al. [ 153 ] to account for HLA LD. Here, we describe the approach in some detail. A graphical depiction of the Decision Tree leaf distribution is shown in Figure 4.4 . Note that this tree should not be confused with the phylogenetic tree. To help avoid this confusion, we use the term tip to refer to the bottom points on the Decision Tree. Each path in the tree from root to tip defines a particular instance of a subset of the traits ˆ X, which in turn defines a conditioning event for the distribution of the target trait. For example, in Figure 4.4 , we consider the set of predictor traits ˆ X = (B57, C06, M28), with each branch labeled 0 or 1. The path that follows the value 0 for the trait B57, the value 0 for the trait C06, and the value 1 for the trait M28 corresponds to the instance (B57 = 0, C06 = 0, M28 = 1)—that is, the individual has M28 but not B57 or C06. At the tip of this path sits the corresponding conditional probability distribution P Ψ (Y i |B57 = 0, C06 = 0, M28 = 1). In general, each tip k in the Decision Tree is associated with the conditional distribution P Ψ (Y i | ˆ X = path k ), where path k is the conditioning event corresponding to the kth path. The collection of these conditional distributions over all tips constitutes the multivariate leaf distribution. A Decision Tree leaf distribution can be constructed in many ways. As mentioned, we use forward greedy search. First, we initialize the tree to a single root node, which is simply the univariate leaf distribution for the most significant trait. We then grow the tree iteratively. At each iteration, we consider extending (or splitting) a tip node k on some trait not already in the path to the tip. When splitting tip node k on 40 B57 C06 M28 Predictor variable X Local probability distribution 1 0 1 0 1 0 Figure 4.4: Decision Tree leaf distribution. Each path from root to leaf yields a distinct local probability distribution. an trait X, the node is replaced with two branches and two corresponding tip nodes. The left and right branches correspond to adding X = 1 and X = 0, respectively, to the conditioning event associated with the original tip node. The split is made if the resulting local distribution is a significantly better estimate than that prior to the split, as measured by an LRT. The LRT is computed using the univariate model applied to those individuals whose trait values match those described by path k . To make the process more efficient in our HIV application, we consider splitting the tip node only under the path X = 0 for all X in ˆ X. That is, we repeatedly apply the univariate model to all individuals for whom X = 0 for all the previously identified significant predictor traits. We iterate this process until no significant predictors are found, using a threshold of p < 0.05. Noisy Add One drawback of the Decision Tree approach is that, as the tree grows, the number of samples that we use to test for the next split decreases. Rather than consider smaller and smaller subsets of the data, the Noisy Add leaf distribution models selection pres- sure as an additive process among the predictor traits. That is, the Noisy Add leaf 41 distribution is based on the assumption that each predictor trait independently con- tributes a positive or negative selection pressure on the target trait. These pressures then sum to determine the value of the target trait. In the univariate case, each leaf distribution can be seen as representing three mu- tually exclusive and exhaustive events (for each individual): (1) the selection pressure is absent, either because the state of the predictor trait excludes selection pressure or, with probability 1 − s, no transition occurred despite the potential for selection pres- sure; (2) selection pressure leads to Y i = 1 (attraction or reversion); or (3) selection pressure leads to Y i = 0 (escape or repulsion). We can represent these three possible events by a hidden trait I that takes on the values 0, 1, and −1, respectively. Given a set of M predictor traits, we can associate a hidden variable I j i for the jth trait in the ith individual. Then, assuming that selection pressure across the predictor traits contributes independently and equally to the outcome of Y i , we can determine the outcome of Y i by summing the values of the I j i variables: Σ i = I 1 i + · · · I M i . If Σ i is 0, then it is as if no selection occurred. If Σ i < 0, then negative selection (es- cape/repulsion) has occurred, and the target variable Y i will be zero. If Σ i > 0, then positive selection (attraction/reversion) has occurred, and the target variable will be one. Of course, we don’t know the actual values of I j for each predictor variable, so we must sum over the possibilities, resulting in a probability distribution over Σ i . The strength or frequency of selection pressure contributed by each predictor trait j is captured by the parameter s j . Like the corresponding parameter s in the uni- variate model, s j is the probability that the predictor trait exerts selection pressure (I j i = 0), given the appropriate state for the predictor trait. A more precise definition of Noisy Add, including the generalization from the univariate model, specifics of learning the parameters s j , and methods for reducing computation time can be found in the section on model details. The contribution of a given predictor trait X j ∈ ˆ X as a predictor of target Y is quantified using an LRT against the null model consisting of ˆ X − X j . The most 42 significant predictor trait is added to the Noisy Add model on each iteration, stopping when the most significant predictor fails to achieve p < 0.005. (We use a more aggres- sive threshold than that for Decision Tree because Noisy Add is more computationally intensive.) 4.3 q-values We identify significance using q-values [ 211 ], which conservatively estimate the false discovery rate (FDR) [ 15 ] for each p-value. The FDR is defined to be the expected proportion of false positives among results called significant at a given threshold t. The q-value of t is the minimum FDR observed for all t ≥ t [ 211 ]. Following Storey and Tibshirani [ 211 ], we use the approximation FDR(t) = E F (t) S(t) ≈ E [F (t)] E [S(t)] , (4.1) where S(t) is the number of associations called significant at t and F (t) is the number of true nulls (false positives) at t. To estimate the numerator, we order the p-values of the association tests in increasing order p 1 , . . . , p m and use the approximation E [S(p i )] ≈ S(p i ) = i. To compute E [F (t)], Storey and Tibshirani point out that uniformity of p-values allows the approximation E [F (p i )] ≈ ˆ π 0 p i m (4.2) where ˆ π 0 is a (conservative) estimate of the proportion of all hypotheses that are truly null. In our case, we assume a priori that the vast majority of the many hypotheses tested will be null (i.e., most codons and HLA alleles have no direct effect on a given target trait), and so conservatively set ˆ π 0 = 1. It should be pointed out that E [F (p i )] = ˆ π 0 p i m (4.3) is only guaranteed if the data are continuous and the p-values are uniformly dis- tributed under the null hypothesis. The continuous assumption is certainly not met 43 for genetic data, which is discrete, and the p-values are quite often not uniformly distributed. Thus, E [F (p)] often has to be estimated directly from the data. We will discuss this in more detail in sections 5.1.2 and 6.1.1 . 4.4 Model Details In this section, we provide details regarding the univariate and Noisy Add models, in addition to a brief discussion on computational requirements for the models. 4.4.1 Details of univariate model First, let us consider the null model. Consider target trait Y that denotes the presence (Y = 1) or absence (Y = 0) of a particular amino acid at a particular codon. We use variable Y i , i = 1, . . . , N to denote the trait Y for the ith individual in the study. (We use corresponding notation for predictor traits and variables.) It is quite common to assume that the variables Y 1 , . . . , Y N are independent and identically distributed (IID). In our application, however, the variables are related through a phylogenetic tree. We can model these relationships using a probabilistic phylogenetic model as shown in Figure 4.2 A. Nodes at the leaves of the tree, labeled Y 1 , . . . , Y N correspond to the variables with the same name. (In general, we will use the same designation for both a variable and its node.) Unlabeled nodes in the interior of the tree correspond to events of divergence. We use Ψ to denote the structure (branchings and branch lengths) of the tree. Associated with each variable (or node) B in this phylogenetic tree is a condi- tional probability distribution P (B|A), where A is the parent node of B. As in the probabilistic model of Felsenstein [ 60 ] for a phylogenetic tree, we assume that the con- ditional probability table is described by a continuous time Markov process (CTMP) and parameterized by θ = (π, λ), where π is the stationary distribution of Y = 1 and λ is the rate of mutation. The conditional probability table of the CTMP from parent 44 node A to child node B along a branch of length d is given by P (B = b|A = a, d) = e −λd + π b · (1 − e −λd ) if a = b π b · (1 − e −λd ) if a = b. (4.4) where π b = π when b = 1, and π b = 1 − π when b = 0. This evolution model is reversible, making the choice of root in the tree arbitrary [ 60 ]. Given a set of observations for (typically, all of) Y 1 , . . . , Y N , there are several criteria that can be used to identify good values for the parameters π and λ and the structure Ψ of this model (or, in the Bayesian case, a distribution over these quantities). For this and all models discussed in this paper, we choose parameters and structure using the maximum likelihood criterion, as is done in (e.g.) [ 60 ]. There are a number of methods for identifying the maximum-likelihood parameters, including gradient decent and the Expectation-Maximization (EM) algorithm. In this work, we use the EM algorithm [ 49 ] to learn θ. To learn the structure Ψ, we apply PhyML to the nucleotide sequences using the general time reversible GTR model with all other parameters estimated from the data [ 87 ]. We denote this null model P Ψ (Y |θ), as it represents a phylogenetically corrected distribution for Y . Note that this model includes the situation where the observations of Y 1 , . . . , Y N are IID as a special case (i.e., the limit as λ tends to infinity.) Now let us consider the alternative model, which reflects the assumption that a codon is under selection pressure induced by a single predictor trait X. To construct this model, shown in Figure 4.2 B, we begin with the null model and first change each Y i to H i , which represents what Y i would have been had there been no influence from X i . Then, we assume that, for each individual i, the probability distribution for Y i depends on X i and H i . Further, we assume that these conditional distributions P (Y i |H i , X i ) are the same for each individual i, and collectively denote them by P ψ (Y |X). In general, this univariate leaf distribution can have four parameters corresponding to the four states of the conditional variables H i and X i . In our experience, however, use of such a distribution leads to loss of power. Consequently, we consider four separate 45 distributions (as was previously defined [ 31 ]) and, for any given association, choose the one that best fits the data: Escape P (Y i = 0|H i = 1, X i = 1) = s > 0; P (Y i = 1|H i = 0, X i = 1) = 0; P (Y i = a|H i = a, X i = 0) = 1. That is, H i and Y i can be in different states only when H i = 1 and X i = 1. Reversion P (Y i = 1|H i = 0, X i = 0) = s > 0; P (Y i = 0|H i = 1, X i = 0) = 0; P (Y i = a|H i = a, X i = 1) = 1. That is, H i and Y i can be in different states only when H i = 0 and X i = 0. Attraction P (Y i = 1|H i = 0, X i = 1) = s > 0; P (Y i = 0|H i = 1, X i = 1) = 0; P (Y i = a|H i = a, X i = 0) = 1. That is, H i and Y i can be in different states only when H i = 0 and X i = 1. Repulsion P (Y i = 0|H i = 1, X i = 0) = s > 0; P (Y i = 1|H i = 0, X i = 0) = 0; P (Y i = a|H i = a, X i = 1) = 1. That is, H i and Y i can be in different states only when H i = 1 and X i = 0. This model is reversible in the sense that the choice of root node among non-leaf nodes does not affect the likelihood of the data. We also note that, in principle, all parameters θ=(π, λ, s) and the structure Ψ can be optimized simultaneously. In practice, however, we find that using the structure Ψ learned in the absence of information about X works well, and is computationally more efficient. In addition, it may seem counter-intuitive that the HLA alleles of the individuals corresponding to the interior nodes of the phylogeny are not being taken into account. A path from one node to the next in the phylogeny, however, presumably reflects a series of infections over many individuals, some who will have the allele and some who will not. Thus, there will be some net evolution, which we account for by optimizing the parameters π and λ for each codon individually. Finally, we note that this model can be thought 46 of as a (discrete) mixed-effects model, wherein the predictor variables X i correspond to the fixed effects and the hidden variables H i correspond to the random effects [ 40 ]. Rather than being related by (e.g.) a Gaussian covariance matrix, the random effects are related by a phylogenetic tree. Both the null and alternative models are instances of what is known as a generative or directed acyclic graphical (DAG) model. In general, a generative model consists of a structure, a directed acyclic graph, in which nodes correspond to variables and missing arcs specify conditional independencies among the variables, and a set of conditional probability distributions, one distribution for each node. The conditional probability distribution for a given node is the distribution of the node given its parents. The conditional independencies specified by the structure of the graph allow the joint distribution of the data to be written as the product over the nodes of their conditional distributions. The independencies represented by the model facilitate computationally efficient inference, parameter estimation, and structure learning [ 93 ]. Importantly, given a set of parameters learned from real data, synthetic data can be easily generated from the model. When constructing PDNs, we separately learn a DAG model to encode each local probability distribution. As mentioned in the Discussion, however, one can restrict the arcs in a PDN to be acyclic, thus resulting in a single (phylogenetic) DAG model for all the traits in the data set. In the following section, we consider the multiple-predictor case and again use graphical models to represent phylogenetically corrected distributions. As we shall see, the computational efficiencies afforded by graphical models will play an even more important role. 4.4.2 Details of Noisy Add model To understand the Noisy Add leaf distribution, let us recast the univariate distribu- tion as the generative process shown in Figure 4.5 A. (Recall that this distribution is independent of i. In the text that follows, we describe this and the generalized process 47 for an arbitrary individual i. In the corresponding figures, we drop the subscript i to simplify the notation.) If X i = 1 (for escape or attraction; X i = 0 for reversion or repulsion), a coin weighted with probability s for heads is flipped. If the coin lands heads, then the intermediate variable I i gets the value 1 (for attraction or repulsion; -1 for escape and reversion). Otherwise, I i gets the value 0, corresponding to no selec- tion pressure. The value of I i is then copied to the value of another variable Σ i . (The copy is not necessary here, but will help us generalize.) Finally, the target variable Y i is assigned a value based on the deterministic function shown in Figure 4.5 B. With a little checking, it can be seen that this process produces precisely the univariate leaf distributions for escape, reversion, attraction, and repulsion. The generalization of this process to multiple predictor variables ˆ X i = X 1 i , . . . , X M i is shown in Figure 4.5 C. Here, there is an X j i and I j i node for each predictor variable X j i . The weight on the coin is possibly different for each predictor variable. We use s j to denote the weight for predictor variable X j i , and s to represent the collection of parameters (s 1 , . . . , s M ). The variable Σ i is now a sum of the intermediate variables I 1 i , . . . , I M i . Finally, as in the univariate case, Y i is a deterministic function of Σ i and H i as given in Figure 4.5 B. Applying this generative process to individuals i = 1, . . . , N , we obtain the con- ditional distribution P (Y 1 , . . . , Y N , H 1 , . . . , H N , I 1 1 , . . . , I M N |X 1 1 , . . . , X M N , θ), where θ = (s, π, λ) are the parameters of the model. Maximum likelihood values for these param- eters can be inferred efficiently. The summation Σ i = I 1 i + . . . + I M i can be grouped as Σ i = (((I 1 i + I 2 i ) + I 3 i ) + . . . + I M i ), yielding the graphical model shown in Figure 4.5 D. This grouping makes it possible to compute the distribution for Y i for any instance of the variables ˆ X i and H i in time that is quadratic in M . Furthermore, given any instance of the predictor variables ˆ X i , H i , and Y i , the probability distributions for I 1 i , . . . , I M i can be computed in time that is quadratic in M . Consequently, we can use the EM algorithm to estimate the parameters s efficiently. To estimate the full set of Noisy Add parameters θ, we embed this estimation procedure within an outer |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling