Phylogenetic dependency networks: Inferring patterns of adaptation in hiv
Download 4.8 Kb. Pdf ko'rish
|
Rank # Known epitopes Direct One Hop Clean One Hop LC PL L P Figure 7.2: Number of associations in optimal epitopes as a function of q-value rank. additionally had direct associations. It is therefore not surprising that M PL , which fails to account for codon-codon covariation, identified escape mutations within almost as many optimal epitopes as the full model M PLC ( Figure 7.2 ). We further compared the HLA-codon associations of the other three models to the optimal epitopes. Only M L and M FET , which fail to account for both phylogeny and codon covariation and are thus quite prone to founder effects [ 17 ], performed significantly worse than the other models (p < 0.0001). The models that roughly account for clade differences, either through codon covariation (M LC ) or phylogeny (M P ), performed slightly worse than the full model, though these differences were not significant. 7.3 Discussion We have presented the first approach to simultaneously account for viral phylogeny, codon covariation, and HLA linkage disequilibrium in population-based association studies. It is also the first large scale multiclade analysis of HLA-mediated escape 113 in HIV-1, as well as the first approach that simultaneously accounts for HLA linkage disequilibrium, HIV ancestral relationships, and codon covariation. The large number of direct HLA-codon associations confirms a substantial role of the HLA-restricted CTL response in driving HIV evolution, and supports the observation that patterns of HIV evolution are broadly predictable based on host immunogenetic profiles [ 2 , 17 , 24 , 25 , 157 , 196 ]. Moreover, results demonstrate that escape and reversion mutations often arise in the context of a complex set of correlated substitutions that reflect both compensatory mutations and dependencies among epitopes. On the whole, the phylogenetic dependency network predicts that a major proportion of p17 (41%) and p24 (20%) codons are under selective pressure from at least one HLA allele, a result that confirms a dominant role of T-cell responses in driving viral evolution [ 2 , 32 , 85 ]. This study also represents a significant step forward by providing a statistical ap- proach that can help differentiate direct (H → A) HLA escape polymorphisms from indirect or, more specifically, one-hop (H → B) escape polymorphisms in situations where the true interaction is the chain H → A → B. Although the direct–indirect distinction can arise under several mechanisms, the explicit statistical interpretation is as follows: a direct HLA-polymorphism H → A association means the HLA allele H is a strong predictor of the polymorphism A, whereas an indirect HLA-polymorphism association H → A → B means the polymorphism B is better predicted by the poly- morphism A than by the HLA allele H. Although B is in a sense HLA-associated, the distinction of direct versus indirect associations may have important biological implications. For example, many of the indirect associations identified by the de- pendency network for the B*57-restricted TW10 and B*27-restricted KK10 epitopes are consistent with known compensatory mutations associated with escape in these epitopes [ 21 , 201 ]. In addition to these described pathways, the dependency network reports a number of covarying amino acids. Many of these are in close physical con- tact, and thus likely candidates for compensatory pathways that can be tested via in vitro replication capacity assays, although distal covarying codons may also exhibit 114 compensatory relationships [ 38 , 142 , 179 , 232 , 242 ]. Understanding the specifics of compensatory-based covariation has important implications for T-cell-based vaccine design, as escapes that require multiple compensatory mutations may take longer to arise due to chance and the compensatory mutations may not completely recover lost fitness [ 21 , 85 , 200 , 201 ]. Compensation is not the only potential causal mechanism of codon covariation. Other mechanisms include those associated with CTL-mediated covariation. Indeed, the PDN indicates that up to 50% of the observed codon-codon covariation occurs between epitopes restricted by the same HLA allele, suggesting much of the ob- served codon covariation in HIV is CTL-mediated. Two possible mechanisms of CTL-mediated covariation include inter-patient variability in the immune system’s ability to target epitopes and consistent patterns of epitope targeting due to immun- odominance. Distinguishing between these two mechanisms may have direct relevance to vaccine design, but will require comparing the results of the PDN to clinical re- sponse data that can measure epitope targeting and longitudinal samples that can identify order of escape. Although it is well known that the order in which the epi- topes of some HLA alleles are targeted is broadly consistent [ 7 , 18 , 23 ], identifying new patterns may yield new vaccine candidates. Specifically, it is possible that HLA alle- les that are currently considered non-protective target ineffective dominant epitopes during acute infection. Redefining the immunodominance hierarchy via immunogen exposure may thus increase the effectiveness of these alleles upon subsequent HIV challenge [ 73 ]. A major challenge to vaccine design is global HIV diversity [ 76 , 143 ]. Although there is accumulating evidence that suggests that patterns of escape appear to be broadly predictable [ 24 , 25 , 32 , 157 , 174 , 196 ], these studies have been limited to relatively small sample sizes or cohorts consisting predominantly of a single clade. Although a comparison of the Durban and British Columbia results showed instances of both consistency and divergence of associated escape in the two clades [ 196 ], these 115 studies were run separately, did not account for codon covariation, and used different methods for determining associations. Thus, the extent to which escape pathways are shared across clades was largely unknown. Our results, which reflect data equally distributed between clade B and clade C sequences and are evaluated by taking HLA LD, viral lineage and codon covariation into account, confirm the existence of common escape pathways. This similarity suggests that a broadly reactive vaccine may be pos- sible, though more work to further characterize inter-clade similarities and differences will be necessary. Despite the broad similarities seen between clades B and C, we noted several intriguing examples where the resistant form of an epitope matched the consensus sequence for one of the clades. Such examples support the HLA footprinting hypoth- esis [ 131 , 157 ], which proposes that consensus sequences of circulating strains in a population are a result of consistent escape (and lack of reversion) from the most common HLA alleles in that population, an hypothesis that is especially well founded in cases where the consensus polymorphisms are different in different populations. For example, 53% of individuals in the British Columbia cohort [ 25 ] have A*01, whereas only 24% of the Durban cohort [ 116 , 196 ] have A*01, and F79 (clade B consensus) is the resistant form of the association. Furthermore, alleles A*29 and A*68 have higher frequencies in the South African cohort, and Y79 (clade C consensus) is the resistant form of their associations. Thus, at codon 79, there appears to be broad selection pressure for evolutionary fixation of F in the South African cohort and fixation of Y in the British Columbian cohort. Our analyses identified a total of 21 codons (four with independent experimental support [ 70 ]) where the predicted escape matched clade B or C consensus, adding support to the hypothesis that CTL pressure serves a broad, population-level role in shaping HIV evolution, and may even serve a key role in clade differentiation [ 157 ]. We have focused this study on the highly immunogenic Gag p17 and p24 proteins, which are believed to serve a key role in effective control of HIV [ 56 , 77 , 116 , 243 ]. 116 Moving forward, it will be important to extend such studies to full length genomes, where patterns of covariation may reflect cites of protein-protein interaction [ 39 , 88 , 224 ] and may further reveal broad patterns of immunodominance. Furthermore, as the number and diversity of large cohorts of HIV-infected, HLA-typed, individuals continue to grow [ 25 , 116 , 157 , 174 , 196 ], it will be important to combine data sets in order to increase statistical power and further detail the similarities and differences among clades that may inform broad-coverage immunogen design. One limitation of our two-clade study is that, because the HLA data in the HOMER cohort had only 2-digit resolution, we truncated the HLA data in the Dur- ban cohort to 2 digit types as well. Although closely related HLA alleles often target the same or similar epitopes [ 70 ], making 2-digit resolution an appropriate choice for some allele-epitope pairs, important differences do exist. An example is the distinction between the B*5801 allele that is associated with effective viral control, and B*5802 which is associated with poor viral control [ 115 , 162 ]. In cases where the prevalence of four-digit resolution types differs substantially between cohorts (as is the case with B*58) and the four-digit types target different epitopes, truncation to two-digit types before combining cohorts will lead to confounding in which the two-digit types from one cohort will tend to lead to escape whereas the two-digit types from another co- hort do not. In principle, a better approach would be to include both 2-digit and 4-digit HLA alleles (or any other grouping of alleles) as predictor attributes in our model. For example, if all B*57 alleles select for the same escape mutation, then B*57 would be chosen by the model as a stronger predictor than B*5701, whereas escape mutations selected only by B*5701 would lead to the 4-digit allele being chosen. Of course, these facts should encourage researchers to perform high resolution typing on individuals in their cohorts. In addition, Listgarten et al. [ 135 ] have developed a statistical approach for inferring high resolution HLA alleles from low resolution hap- lotypes. Although incorporating the uncertainly of those predictions into the PDN is beyond the scope of this paper, the ability to infer high resolution HLA data will 117 allow for more effective evaluation of large, multi-cohort studies. The comparative method has long been used to generate hypotheses regarding traits and the environment [ 92 , 148 , 149 ]. Because (quasi-) species share a com- mon history, the inherent population structure (in this case, the phylogeny) must be accounted for [ 61 ], and numerous methods that do so have been proposed (e.g., [ 64 , 92 , 148 ] and references therein). Our study on HLA immune escape mutations suggests, however, that simply accounting for population structure is not enough, as HLA linkage disequilibrium (structure among environmental predictors) and codon covariation (structure among target traits) are at least as important as phylogeny in both increasing statistical power and avoiding false positives. This issue is relevant to applications beyond those studied here. Specifically, whenever chains of interactions are common, pairwise methods will tend to identify direct as well as indirect correlations. This effect was most dramatically seen in the synthetic codon-covariation tests, in which using a logistic regression-like approach (which accounts for chains) dramatically outperformed the phylogenetic pairwise ap- proach. Although many phylogeny-aware comparative methods have been developed for codon-covariation [ 39 ], the problem of chains of interactions has only recently been addressed [ 45 , 182 ]. The PDN provides an efficient framework in which chains of interactions can be identified in the context of both the phylogeny and confounding from external sources of selection pressure (here, HLA-mediated CTL response). The first approach to identifying chains of interactions in a phylogenetic context was recently provided by Poon et al. [ 182 ]. They employed a directed acyclic graphical (DAG) model rather than a dependency network. In a DAG model, arcs from predic- tor to target attributes form a directed acyclic graph and local distributions take the same form as in a PDN. (A DAG model is often referred to as a Bayesian network, although the latter name is misleading as non-Bayesian procedures can be used to construct DAG models.) When learning the distributions in the DAG model, Poon et al. took phylogeny into account, although in a way different from our approach. 118 In particular, when learning the distribution of an attribute given its parents in the DAG model, they imputed for each individual the value of the attribute correspond- ing to the ancestor of that individual in the phylogeny. These imputed values were then treated as observed data and fed to a standard DAG model structure learning algorithm. The PDN provides an alternative approach that leverages the strengths of depen- dency networks. The most apparent difference is that dependency networks allow cycles, resulting in a network that is easier for the non-expert to interpret than is the DAG model [ 94 ]. In addition, Poon et al. used unrestricted local distributions in contrast to our use of Noisy Add. The use of Noisy Add, where the number of parameters is linear in the number of parents rather than exponential, results in a substantial increase in power. Finally, because the PDN is concerned only with local probabilities, only the target variable is conditioned on the phylogeny, allowing the PDN to efficiently model associations with attributes, such as HLA alleles, that are not expected to follow the phylogeny, as well as attributes, such as other codons, that are expected to follow the same phylogeny [ 31 ]. The result is an efficient method that can simultaneously incorporate a diverse range of selection pressure attributes. One drawback of a dependency network relative to a DAG model is that the local distributions among the target attributes overlap and yet are learned independently. (For example, the local distribution for A given B and the local distribution for B given A are closely related, yet are learned independently.) This independent learning leads to a decrease in statistical efficiency. In practice, however, this decrease is typically minimal [ 94 ]. Another drawback of a dependency network is that the presence of cycles make inference of the joint distribution cumbersome, requiring an inefficient modified Gibbs sampling procedure to estimate the joint likelihood [ 94 ]. One possible solution is to modify the method for constructing a PDN to yield a DAG model. In particular, we can choose a random ordering for the attributes, and then build a PDN wherein the allowed predictors of a target attribute are only those 119 that precede the target attribute in the ordering. The resulting collection of local probability distributions defines a DAG model (where acyclicity is guaranteed by the ordering constraint). The resulting model can be improved substantially by applying the above procedure to a dozen or so random orderings, and then choosing the best model according to some criterion (e.g., a Bayesian criterion or cross validation) [ 104 ]. The resulting DAG is a generative model that can be used to perform inference on the joint distribution. 120 Chapter 8 SUMMARY This dissertation has introduced the Phylogenetic Dependency Network, a frame- work for the identification of adaptive traits and the specific sources of selection pres- sure that drive adaptation. The idea of the PDN is to fit each potentially adaptive target trait to a probabilistic model of evolution that conditions the target on both the phylogeny and other predictive traits that exert selection pressure on the target. The structure of the PDN then consists of arcs connecting targets to the sources of selection pressure that putatively drive the adaptation of each target. We have defined and explored three specific probabilistic models of conditional adaptation, which can be used as the probability components of the PDN. These models are similar in that they each make the simplifying assumption that each tar- get trait has evolved independent of the predictor traits throughout its evolutionary history. It is only when the target trait reaches the environment in which we are able to observe it that we model the interaction between predictor and target. Al- though this assumption certainly does not describe the true biological process, we have provided empirical evidence that it is a useful and reasonable approximation. Specifically, we considered two extreme examples. In the first, the predictor trait(s) is IID and uniformly distributed. In this case, models based on this assumption are indistinguishable from models that integrate out the interaction of the predictor(s) in each hidden internal node, as this integration is indistinguishable from increasing the rate of evolution in the independence model. In the opposite extreme, the predictor and target are coevolving. In our simulation studies, we found that the conditional adaptation model is nearly as good as the generative coevolution model, suggesting 121 that conditional adaptation is a reasonable approximation even when the true causal model is coevolution. Of the three conditional adaptation models we proposed, we found that Noisy Add is the most expressive. In this model, multiple predictor traits are combined to drive the adaptation of the target. The parameters of the model specify whether each predictor exerts positive or negative selection pressure on the trait, as well as the probability that any non-zero selection pressure is exerted. Because there is only a single parameter for each predictor trait, the number of parameters grows linearly with the number of traits in the model. This represents a major advance over previous approaches, in all of which the number of parameters grows exponentially with the number of traits in the model. This property allows us to fit a large number of traits to the model, even with a modest amount of data, and to do so in a reasonable amount of computational time. Because the conditional adaptation assumption can reasonably model interactions of target traits with predictors that exhibit a wide range of distributions relative to the target trait’s, the Noisy Add model is able to simultaneously incorporate a diverse range of predictor traits into the model. Traditionally, approaches to the comparative method have assumed that the predictor traits follow a specific, prespecified distribu- tion. Most often, the assumption is that the predictor has coevolved with the target along the same phylogeny. Although this model is reasonable for many domains, such as the study of codon covariation in proteins, we have argued that it is not appropriate for many applications in which the predictor trait is derived from the environment. Specifically, we have considered the case where we model the adaptation of HIV to the specific HLA alleles of the human host. In this case, we have shown that a coevolu- tion assumption does not describe the data well and is significantly outperformed by the conditional adaptation model. Furthermore, because the conditional adaptation model can describe coevolution data reasonably well, the Noisy Add model is able to simultaneously incorporate a diverse range of predictors, including those that have 122 coevolved with the target traits and those whose distributions are quite different from that of the target. In addition, our forward selection approach to determining the predictors for a given target can effectively account for correlations among predictors, which we have shown is important for increasing accuracy. Importantly, the ability to simultaneously model the influence of a diverse range of predictor traits has enabled us to explore HIV adaptation from a rich new perspective. Focusing on HIV adaptation to the HLA-mediated adaptive immune response, we have argued that there are three major sources of statistical confounding: (1) population structure due to the HIV phylogeny, (2) linkage disequilibrium among HLA alleles, and (3) HIV codon covariation. Although these sources of confounding are often acknowledged, limitations of previous approaches have prevented an in depth study of HIV adaptation that can correctly account for these problems. In contrast, we have demonstrated that a PDN that uses the Noisy Add model is able to simultaneously account for all three sources of confounding. The result is that we are able to more accurately predict which specific HLA alleles are directly correlated with specific HIV codon substitutions. When we applied our model to a large and diverse cohort, we found a rich network of interactions. It was not uncommon for a single HIV target codon to be directly correlated with a handful of HLA alleles and a large number of other HIV codons. The density of the dependency network underscores the importance of using a model that can simultaneously account for multiple interactions. As we demonstrated with both synthetic data and arguments based on observations of the real data, failure to do so results in a large number of spurious associations, especially as the size of the data set (and the corresponding power to detect deviations from the null model) increases. Even when the goal of a study is simply to identify coevolving codons, the limitation of previous coevolution approaches to comparing pairs of traits leads to a large number of spurious associations that derive from indirect interactions. Our application of the PDN to the largest cohort of HLA-typed, chronically in- 123 fected and antiretroviral-na¨ıve patients studied to date reveals the complexity—yet promising consistency—of HIV adaptation to the cellular immune response. In addi- tion to identifying several of the handful of well-studied escape pathways, we identified a plethora of putative escape adaptations, many of which lie within or adjacent to known epitopes, but many of which also lie in putative epitopes. Remarkably, many of these adaptations correlate strongly with coevolutionary adaptations elsewhere in the protein. Although we cannot infer a causal mechanism for the coevolutionary changes, the combination of recently characterized compensatory mutations coupled with the observation that coevolving pairs of codons in our data tend to be proxi- mal in the folded protein suggests that many of the coevolving pairs may represent compensatory mutations. Moving forward, a major goal of T-cell vaccine design is to identify the specific epitopes that should be included in a vaccine. Given the patterns of escape and compensation that have been characterized in some epitopes that are targeted by protective alleles, the dense map of escape and coevolution that we have identified may provide key new insights into which regions of the virus are more vulnerable to an effective immune attack and should therefore be included in a vaccine. In conclusion, the main contributions of this dissertation can be succinctly sum- marized as follows 1. The introduction of the phylogenetic dependency framework, which uses the assumption of conditional adaptation to map sources of selection pressure to adaptive traits. 2. The definition and exploration of three probabilistic models of conditional adap- tation. In particular, the Noisy Add distribution, which efficiently and parsimo- niously models the effect of selection pressure derived from multiple predictor traits on a single target trait. 124 3. The application of Noisy Add and the PDN to the study of HLA-mediated adaptation in HIV. The source code and executables for the PDN, the distributions discussed, and the PhyloDv viewer that we use to display the results are freely available at http: //www.codeplex.com/MSCompBio . 8.1 Limitations and future directions We now discuss some limitations of the PDN framework and the conditional adapta- tion models we have discussed, as well as some future directions that could address many of these problems. 8.1.1 Numerical and computational issues Perhaps the most obvious limitation is the amount of computing power required to run these models. Indeed, even the largest data set considered in this dissertation can be analyzed using Fisher’s exact test with only an hour or two of CPU time, whereas running Noisy Add took the better part of year’s worth of computing time. Nevertheless, as we have shown, the vast increase in computing time brings with it a substantial increase in statistical power and confidence in the calibration of the resulting significance scores. Indeed, in our largest synthetic studies, the application of FET yielded all but unusable predictions. In addition, given the advent of cheap, high performance computing, we feel that the computational cost is a small price to pay for the increase in accuracy. This tradeoff is especially worth while if the algorithm can be run on a shared resource to defray the marginal cost. As such, we are developing a web-based, cluster-backed server as a service to the biological community so that computing costs are not an issue for researchers. On the technical side, there is always a concern that EM optimization will settle on a local optimum. In principle, with enough random restarts this problem can be 125 eliminated, though restarts are costly computationally and we have found that the fitness landscape tends to be such that, when a local optima is a problem, it takes a large number of restarts to find the global optimum. In practice, we have found that the best solution is to initialize all parameters with “reasonable” starting values, which can be estimated from the data by, for example, assuming the data are IID When local optima are a problem, there are three possibilities: (1) only the calculation of the alternative likelihood falls into a local optima, (2) only the calculation of the null likelihood falls into a local optima, or (3) both calculations fall into a local optima. Typically the most damaging scenario occurs in case (2), as the result can be a wildly inflated significance measure. Fortunately, the fact that the null model is nested in the alternative model implies that if case (2) occurs when there is no true correlation, then there is probably a configuration of the null model parameters that will yield a likelihood close to the alternative model. This fact suggests the following heuristic: for every significant association, refit the null model using the applicable parameters from the alternative model as a starting point for EM. While this heuristic has no guarantees of helping, we have found that in practice it greatly reduces the number of spurious associations arising from case (2). 8.1.2 Problems with trees A natural criticism of our approach is that we condition on a phylogeny, which of course cannot be known with complete certainty. There are, in fact, two major criticisms that can be leveled with regard to our use of phylogenies: (1) we ignore any uncertainty in the phylogeny, and (2) we infer the phylogeny assuming there is no selection pressure, a rather circular assumption given that we then set out to identify selection pressure acting on individual traits from which the phylogeny was inferred. Let us consider each of these criticisms in turn. 126 Uncertainty in phylogeny inference The myriad assortment of phylogeny inference algorithms and their parameter settings makes it clear that inferring phylogenies is an imprecise art. Indeed, when we run a number of available programs on the data sets considered in this dissertation, each yields a slightly different phylogeny. Can we really just pick one and assume it’s the correct one? The good news is, we have found that the precise phylogeny used in our method does not have a huge impact on the results, and what impact it does have appears to be confined to reducing statistical power and does not appear to inflate our estimates of statistical significance (see subsection 5.2.1 ). In other unpublished work, we have found this observation to hold up on real data as well. Specifically, in one example, we ran the model on five “similar” trees (meaning the major topological features are preserved), then found that of those associations identified using any of the five trees at q < 0.05, 80% were found using all of the five trees. Although we can be reasonably optimistic that the model is robust to imperfect trees, a solution that incorporated the phylogenetic uncertainty may increase statisti- cal power and would at the very least be more intellectually satisfying. One possibility is to take a Bayesian perspective and compute the posterior distribution of phyloge- nies (using, for example Mr. Bayes [ 103 ], which will provide such a thing if given enough time). We could then sample trees from this posterior distribution and create a PDN for each tree. Because the PDN consists of local probability distributions, the question then focusses on how to integrate over the sampled trees in the local condi- tional probabilities. One simple possibility would to compute the alternative and null likelihoods as the average of the alternative and null distributions for each tree. The resulting likelihood ratio would then reflect the uncertainty in the tree structure. 127 How does selection pressure influence inference of the phylogeny? Even if we ignore the uncertainty in inferring the phylogeny, one must still wonder about the common assumption in the inference process that each site in the sequence is evolving neutrally. Not only is this assumption violated in general, but the whole point of the PDN is to identify where and how it is violated. Indeed, some of our collaborators have recently pointed out that the strongest associations can strongly bias the inference of the phylogeny [ 152 ]. In their case, simply removing the sites known to escape in response to B*57 drastically altered the inferred phylogeny. This observation suggests a simple, iterative solution: 1. Infer the phylogeny using all sites. 2. Build the PDN using the inferred phylogeny. 3. Identify the target variables involved in the strongest associations and remove the underlying codons from the DNA sequences. 4. Infer a new phylogeny using the DNA sequences modified in step three. 5. If the phylogeny is very similar to that from the last iteration, stop. Otherwise, continue with step two. In the end, we will be left with a phylogeny built from sites that are not under strong selection pressure (or at least, those that are under strong selection pressure contribute relatively little to the phylogeny inference procedure). The PDN from this last step can thus be considered as the most reliable. One concern might be that, because the first iterations of this algorithm involve a tree that is apparently incorrect, the associations we identify from those trees are likely incorrect and thus the wrong sites are removed, leading us down a wrong path. Two responses can be given to this. First, the fact that we are reasonably robust to wrong trees should guard against this 128 problem. Second, and perhaps more important, any association that is strong enough to alter the phylogeny inference is likely strong enough to be identified given most any phylogeny. Indeed, the associations identified in Matthews et al. were those that have long been known precisely because they can be identified using just about any method (including those that ignore the phylogeny altogether). Nevertheless, a more principled approach can be taken to this problem. Specifi- cally, one could infer the structure of the phylogeny and the PDN simultaneously. One approach could be to use an EM-like procedure. That is, as we have demonstrated, given a phylogeny we can easily identify associations. The reverse is also likely to be true. Given a PDN, in which we model the selection pressure acting on each target variable, we should be able to infer a phylogeny that incorporates that selection pres- sure. By iterating back and forth between these two steps, we may expect to improve both our phylogenetic inference and the structure of the PDN. 8.1.3 Deterministic cycles A more fundamental limitation of the PDN is an inherent limitation of dependency networks in general. Because a dependency network consists of local conditional probability distributions, in certain circumstances, true causal interactions can be impossible to detect. Consider the following causal model: X → Y → Z. Further, suppose the conditional probability table for Y → Z is the deterministic relation Pr {Z = 1|Y = 1} = 1 Pr {Z = 0|Y = 0} = 1 129 Now when we fit the local conditional probability distributions for the target variables Y and Z, we will find that Y is fully sufficient to predict Z and Z is fully sufficient to predict Y . Thus, if the relation between X and Y is non-deterministic, our forward selection procedure will never find X → Y , because there can be no gain in likelihood over the relation Z → Y (even though in the true causal model, the direction of causation is in reverse). An example of where we have seen this problem is in the case of B27-mediated escape in the epitope KK10. In the combined HOMER-Durban cohort, we were able to recover the known escape pathway B27 → R264K → L268M, which is one of two possible escape pathways originating at position 264. When we look only at the HOMER cohort, however, we do not see this pathway. What we do see is R264K → L268M, L268M → R264K. Upon closer inspection, we find that the selection parameter in both cases is 1, mean- ing if escape occurs in one position, escape must occur in the other position. Indeed, in the HOMER cohort, every patient with one of the polymorphisms has the other polymorphism. Consequently, we fail to find the association of these escape mutations with B27, which is the known cause of the escapes. One solution to this problem is to use a directed acyclic graphical model (DAG) and compute the joint likelihood. This is the approach of Poon et al [ 182 ]. We could still achieve the benefits of the Noisy Add model (linearity in the number of parameters and the conditional adaptation assumption) by using the procedure outlined in section 7.3 . Namely, create a random ordering over all the traits, then run the PDN as described, but with the restriction that the only allowable predictors for 130 a given target trait are those that precede the target in the random ordering. The result is a DAG, for which the joint likelihood can be easily computed. The resulting model can be improved substantially by applying the above procedure to a dozen or so random orderings, and then choosing the best model according to some criterion (e.g., a Bayesian criterion or cross validation) [ 104 ]. Because the resulting model is acyclic, we cannot learn the (incorrect) circular model Y ↔ Z, and thus are likely to learn the (correct) model X → Y → Z. 8.1.4 Binarization of data and results On finite data sets, binarization of multistate discrete data is a simple means of increasing the power to detect associations. In our case, a model that can incorporate all 20 possible amino acid states would represent an enormous increase in the number of parameters (up to 380 for the null model, if all possible transitions were to be considered; many more to capture selection pressure). Nevertheless, binarization has its drawbacks. The most obvious drawback comes in interpreting the results. Typically, the biologist wants to know what specific mutations are likely to arise in response to a given selection pressure. In the binarization process, we create multiple traits for each codon (one for each observed amino acid at that codon). In the best case, it can be difficult to reconstruct the escape process from the set of traits at that position. In the worst case, noise in the data can lead to inconsistencies or an incomplete picture. Furthermore, any evolutionary model that does not model the mutation of codons directly (i.e., either binarizes the data or uses some compression of the codon space, such as the 20 amino acids), is not a first order continuous time Markov process (CTMP), as our models assume. That is, the first order CTMP assumption states that the rate of transition between two states is constant. Because of redundancy in the mapping from trinucleotides to amino acids, however, this assumption does not hold. For example, the rate of transition from Leucine (Leu) to Phenylalanine (Phe) 131 is dependent on the specific trinucleotide that underlies the Leu. If the trinucleotide is CUU, then a single mutation is required to yield UUU, a Phe trinucleotide. If, how- ever, the Leu trinucleotide is CUG, then two mutations are required to yield either UUU or UUC (the two possible Phe trinucleotides). Thus, the probability of transi- tioning is not constant, but rather depends on the underlying trinucleotide, which in turn is a function of the previous amino acid state that was visited. For example, if the trait had just transitioned from Phe to Leu, then it is more likely to transition back to Phe than if it had previously been a Valine, whose GUG trinucleotide is one mutation away from Leu’s CUG. The violation of the Markov assumption becomes even more pronounced when the data are binarized as the transition from not Leu to Leu can require anywhere from one to three DNA mutations, and thus the probability of the transition is a function of how long the trait has not been a Leu (the longer it hasn’t been a Leu, the more mutations are likely to require a transition back to Leu). Nevertheless, we must point out the the empirical results we have presented strongly suggest that PDN is robust to violations in the CTMP assumption. Fur- thermore, for many traits, unmodeled purifying selection will, in practice, constrain the range of possible amino acid states, effectively minimizing the likelihood that the rate of transition from Not Leu to Leu (for example) deviates from a constant. In- deed, for many of the codons we saw in the real data sets, only 2–4 amino acids were actually observed at that position. Although it is not feasible to learn the structure of the PDN using multistate distributions (especially one that fully satisfies the CTMP assumption), one could use a hybrid approach. Specifically, one could first fit the structure of the PDN using binarized data. The resulting structure could then be fixed, and the binarized traits could be collapsed back in to multistate traits. Finally, the parameters of the multistate model could be fit to the data, assuming the PDN structure from the binarized data. There are, of course, several open questions that have to be addressed, including how to deal with inconsistencies in the structure when binary 132 traits are collapsed back into their original multistate traits, and the precise definition of selection pressure over multiple states (the simple definition of positive and negative selection would no longer apply). Nevertheless, the result would be the statistical confidence in the structure provided by binarizing the data, with an easier task of interpreting the parameters of the model and the specific adaptations that are selected for in specific circumstances. In addition, this process, coupled with converting the PDN to a DAG as described in the previous section, would greatly facility the calculation and interpretation of joint inference. Such inference could, for example, lead to the inference of the most likely viral sequence to arise in response to the adaptation of a particular HLA reper- toire, a functionality that could prove extremely useful for vaccine design, especially if we incorporate a model of which epitopes are actually targeted in an individual (as governed by the HLA repertoire in addition to immunodominance and/or vacci- nation). As the model currently stands, it is not possible to determine the rate of targeting a specific epitope. But as shall see in the next section, such an adaptation for the model may be possible with the advent of next generation sequencing data. 8.1.5 Extension to single genome sequencing data The input data we used for this dissertation consisted of “bulk” sequences, which represent the consensus of the HIV population infecting an individual. The primary reason for this is that bulk sequences are cheap and easy to obtain. They also are intu- itively pleasing as they presumably represent an estimate of “optimal” HIV sequence given the current immune system conditions, insofar as the most common virus is the most fit virus. The bulk sequence is, however, an imperfect reduction of the underly- ing data. A more accurate sequencing technology is generically referred to as single genome sequencing (SGS), which involves the sequencing of individual virions. Using traditional sequencing methods, the isolation and sequencing of individual virions is quite tedious. Thus, these data sets are currently rare and tend to be much smaller 133 than the data sets considered here. However, next generation sequencing technology promises to provide a high throughput means of sequencing thousands of individual virions in each patient. Is the PDN applicable to such data sets? The answer depends on the specific question. For example, if codon coevolution is the sole interest of the study, then the PDN can be applied to the SGS sequences without modification. As we have discussed, however, the ideal scenario is to account for environmental variables in addition to covariation. Indeed, in the studies we have considered, the interaction of HIV with the host environment is of primary interest. In this case, the Noisy Add model as it currently exists is inadequate. Consider the case where we want to compare a single HLA trait X to a single HIV trait Y using the leaf distribution escape. The parameterization of this model includes a single parameter s for selection pressure. In the notation of this dissertation, s = Pr {Y i = 0|X i = 1, H = 1} ; that is, the probability that Y transitions to state 0 given that it started in state 1 and individual i has HLA X. If we consider the underly biology, however, we will notice that s should really be decomposed further. The probability of escape is not simply a function of whether the individual has HLA X, it is also a function of whether the individual’s immune system is actively targeting (using HLA X) the epitope containing Y . That is s = Pr {Y i = 0|L i = 1, H = 1} · Pr {L i = 1|X i = 1} (8.1) where L i = 1 means individual i’s immune system is actively targeting the epitope containing Y . The new (hidden) trait L is conditioned on X because the epitope cannot be targeted without the HLA, but Pr {L|X} is non-deterministic because im- munodominance implies that individuals who have the capacity to target the epitope may not ( chapter 3 ). (Also, note that L is independent of H, because H corresponds to the state of Y in the absence of selection pressure.) 134 Figure 8.1: Univariate model with linked predictors. This phylogeny represents sin- gle genome sequences taken from two individuals (three sequences for the bottom individual, two for the top individual). When we have only one sequence per patient, the two components of Equa- tion ( 8.1 ) are unidentifiable. With multiple sequences per patient, however, the two components are identifiable, and failure to decompose s may result in a misleading model. Indeed, continuing our univariate example, a better model is the one given in Figure 8.1 . In this model, the (observed) predictor HLA X points to an (unobserved) linker variable L, which in turn points to the observed target variable Y . The selection pressure s is now decomposed into two parts, s = (s 1 , s 2 ), such that s 1 = Pr {L = 1|X = 1} and s 2 = Pr {Y = 0|L = 1, H = 1} . Semantically, s 1 is the probability that the immune system targets the relevant epi- 135 tope given that the patient has HLA X, and s 2 is the probability that escape occurs given that the epitope is targeted. Note that s 2 can be defined for each of the leaf distributions as in chapter 4 , but s 1 remains the same for all leaf distributions (i.e., Pr {L = 1|X = 0} = 0 for all leaf distributions). More generally, this linked condi- tional adaptation model is suitable for any case in which the presence of selection pressure from a predictive source is stochastic, and will be applied equally to groups of individuals (that is, if the selection pressure is present (or absent) for one individual in the group, then it will be present (absent) for all individuals in the group). Following this line of thinking, a linked version of both the univariate and Noisy Add models can be constructed (see Fig. 8.2 for a mixed linked/unlinked example of Noisy Add). The resulting model should be a more faithful representation of the un- derlying biological interaction between HLA alleles and SGS sequences. Furthermore, the parameters s 1 and s 2 are identifiable and may provide valuable insight into the un- derlying biological mechanism. For example, does apparently moderate to low rate of escape s derive from a low rate s 1 of epitope targeting, or a low rate s 2 of escape in ac- tively targeted epitopes? Note that, in the context of Noisy Add, it is straightforward to include unlinked traits as well. In this case, other codons would be represented as unlinked traits. The interested reader is referred to Appendix A for further technical details of this model, though thorough experimentation and application are left for future work. |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling