Phylogenetic dependency networks: Inferring patterns of adaptation in hiv
Download 4.8 Kb. Pdf ko'rish
|
Phylogenetic dependency networks: Inferring patterns of adaptation in HIV Jonathan M. Carlson A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2009 Program Authorized to Offer Degree: Computer Science and Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Jonathan M. Carlson and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Co-Chairs of the Supervisory Committee: David Heckerman Walter L. Ruzzo Reading Committee: David Heckerman James Mullins Walter L. Ruzzo Date: In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.” Signature Date University of Washington Abstract Phylogenetic dependency networks: Inferring patterns of adaptation in HIV Jonathan M. Carlson Co-Chairs of the Supervisory Committee: Affiliate Professor David Heckerman Medical Education and Biomedical Informatics, and Microbiology Professor Walter L. Ruzzo Computer Science and Engineering Populations adapt to their environment through a process of natural selection. By studying this process, one can gain insight into the specific functions of adaptive traits that provide an advantage in certain environments. HIV has proven to be remarkably adept at adaptation. So much so that the virus quickly adapts to each individual who is infected, effectively nullifying the immune response of most patients. By identifying the specific adaptations HIV employs against the immune system, it may be possible to identify vaccine targets that reduce HIV’s capacity to successfully adapt. This dissertation introduces the Phylogenetic Dependency Network (PDN) for the identification of adaptive traits and the environments in which they arise. The PDN is a directed graphical model in which nodes represent measurable traits of the popula- tion and the environment and arcs represent probabilistic dependencies among traits. The probability component of the PDN consists of a model of adaptive evolution in which each population trait adapts to a set of predictors, traits to which it is con- nected in the PDN. The structure of the PDN is identified through a model selection approach and can be interpreted as an estimate of which traits directly interact. We introduce a class of probabilistic adaptive evolution models called conditional adap- tation models. These models assume that each trait has evolved independent of all other traits in the PDN until it reached the current environment, at which point the predictors act to influence adaptation of the trait. One of the key benefits of this approach over traditional methods is the ability to simultaneously model multiple interactions. Existing approaches are typically con- strained to consider the evolutionary interaction of two traits at a time. In complex environments in which each trait interacts with many other traits, this constrained view of adaptation blurs the distinction of which traits are truly interacting and which are only indirectly correlated. By modeling these interactions using conditional adap- tation models, we are able to accurately capture dense networks of interactions. We apply our PDN approach to study adaptation of HIV to the human cellular immune response, identifying a large set of HIV adaptations that consistently arise in patients with similar immune genetics. These adaptations often take the form of multiple mutations spanning large regions of HIV proteins and indicate the presence of preferred patterns of adaptation. Although these adaptation networks are quite complex, the presence of these preferred adaptation patterns suggest weak points in viral adaptation that may be exploited by future vaccines. TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Detecting Adaptation: Introduction and Review . . . . . . . . . 5 2.1 Selection and adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Phylogeny confounds the comparative method . . . . . . . . . . . . . 7 2.3 Related work on the comparative method . . . . . . . . . . . . . . . . 9 2.4 Limitations of existing methods . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3: HIV Immune Escape: Introduction and Review . . . . . . . . . 18 3.1 The HLA-restricted CTL response is a major selective force driving HIV-1 evolution within an infected host . . . . . . . . . . . . . . . . . 19 3.2 Escape follows generally predictable patterns in response to specific immune pressures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Immune selection pressures drive HIV evolution at the population level: but to what extent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Assessing the extent of HLA-driven HIV-1 evolution at the population level: challenges and controversies . . . . . . . . . . . . . . . . . . . . 22 3.5 HLA-associated immune pressures influence population HIV diversity at up to 40% of positions in some proteins . . . . . . . . . . . . . . . 25 3.6 Clinical consequences of immune-mediated evolution . . . . . . . . . . 26 3.7 Strategies to cope with viral diversity in HIV-1 vaccine design . . . . 28 3.8 Remaining challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 4: Phylogenetic Dependency Networks . . . . . . . . . . . . . . . . 31 i 4.1 Phylogenetically corrected distributions for one predictor trait . . . . 32 4.2 Phylogenetically corrected distributions for more than one predictor trait 37 4.3 q-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 5: Evaluation and Application of the Univariate Model . . . . . . 51 5.1 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . 59 5.3 Application 1: Effect of immune pressure on HIV evolution . . . . . . 63 5.4 Application 2: Pairwise correlations between amino acids in HIV . . . 65 5.5 Application 3: Genomic search for genotype-phenotype associations in Arabidopsis thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.6 Studies using the univariate conditional evolution model . . . . . . . 71 5.7 Limitations of univariate conditional evolution model . . . . . . . . . 75 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6: Evaluation of Multivariate Models . . . . . . . . . . . . . . . . 79 6.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Model validation on synthetic data . . . . . . . . . . . . . . . . . . . 83 Chapter 7: Using PDNs to Infer Patterns of Immune Escape and Covariation in HIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.1 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2 Phylogenetic dependency network for Gag p17 and p24 . . . . . . . . 99 7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 8: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.1 Limitations and future directions . . . . . . . . . . . . . . . . . . . . 124 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Appendix A: Next Generation Sequencing: Extending the model to single genome sequences . . . . . . . . . . . . . . . . . . . . . . . . . 187 A.1 Likelihood calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 187 ii A.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . 191 Appendix B: On computing FDR for Fisher’s exact test . . . . . . . . . . . . 196 B.1 Examples of FET for sequence data . . . . . . . . . . . . . . . . . . . 197 B.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 B.3 Computing pFDR for Fisher’s exact test . . . . . . . . . . . . . . . . 204 B.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 B.5 Creating synthetic data sets . . . . . . . . . . . . . . . . . . . . . . . 217 B.6 Proofs and Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 B.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 iii LIST OF FIGURES Figure Number Page 2.1 Phylogeny confounds the comparative method . . . . . . . . . . . . . 8 4.1 Phylogenetic dependency network . . . . . . . . . . . . . . . . . . . . 33 4.2 The univariate model . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 The multivariate model . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Decision Tree leaf distribution . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Noisy Add leaf distribution . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 PR and calibration curves on synthetic data. . . . . . . . . . . . . . . 61 5.2 PR and calibration curves over different trees . . . . . . . . . . . . . 63 5.3 PR curves for the real the full HLA-amino-acid data. . . . . . . . . . 65 5.4 Correlated amino-acid pairs in HIV-1 p6. . . . . . . . . . . . . . . . . 67 5.5 GWAS for Arabidopsis bacterial response . . . . . . . . . . . . . . . . 70 6.1 p-value calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Noisy Add represents real data better than Decision Tree . . . . . . . 85 6.3 Performance on HOMER data . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Tree built from the combined HOMER and Durban cohorts . . . . . . 91 6.5 Performance on synthetic mixed clade data . . . . . . . . . . . . . . . 92 6.6 Power to detect associations . . . . . . . . . . . . . . . . . . . . . . . 97 7.1 Gag PDN for combined HOMER and Contract cohorts . . . . . . . . 101 7.2 Number of optimal epitopes found vs. q-value rank . . . . . . . . . . 112 8.1 Univariate model with linked predictors . . . . . . . . . . . . . . . . . 134 8.2 Noisy Add model with linked and unlinked predictors . . . . . . . . . 136 B.1 P-value histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 B.2 Pooled vs. marginal p-values . . . . . . . . . . . . . . . . . . . . . . . 206 iv B.3 The advantage of using π 0 (α) computed using the filtering technique over not filtering. Because filtering only affects ˆ π 0 , these gains result in proportionally reduced (yet conservative) pFDR estimates. . . . . . 215 B.4 Estimated pFDR vs. true false discovery proportion . . . . . . . . . . 216 B.5 Power gains for proposed pFDR method . . . . . . . . . . . . . . . . 217 v LIST OF TABLES Table Number Page 5.1 Predicted HLA-amino acid associations in Gag. . . . . . . . . . . . . 66 7.1 HLA-codon associations in which consensus is the predicted resistant form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.1 2 × 2 contingency table on binary variables X and Y . . . . . . . . . . 200 B.2 Outcomes when testing m hypotheses. . . . . . . . . . . . . . . . . . 202 B.3 Comparing ˆ π 0 estimations for synthetic data sets derived from the Epi- tope data with different true π 0 . Storey’s method was evaluated at λ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 B.4 Comparing ˆ π 0 estimations for the real data sets. . . . . . . . . . . . . 211 vi ACKNOWLEDGMENTS I would like first to thank my supervisor, David Heckerman, who has been an in- valuable mentor, manager, colleague and friend, as well as Larry Ruzzo, who has been incredibly supportive of me throughout my rather non-traditional graduate school ex- perience. I would also like to thank Jim Mullins, for his collaboration, sharing his data and post docs, and serving on my reading committee, and Elizabeth Thompson, for serving as my GSR. This work has been been the result of numerous collaborations, for which I am extremely grateful. In particular, this dissertation would be entirely different without the collaboration of Zabrina Brumme, Chanson Brumme and Richard Harrigan, who introduced me to (and continue my education on) HIV, have provided invaluable data and ideas, and have critically read and contributed to just about all the work I have done related to HIV. Special acknowledgment also goes to Philip Goulder and Philippa Matthews, who provided invaluable data and ideas for the results in chapter 7 and first suggested we try the Decision Tree model. I have also had incredible support at MSR. In particular, Carl Kadie has been an incredible coding resource. Much of the implementation is due directly to him, and most everything else reflects his influence. In addition, the work in Appendix B was done in close collaboration with Guy Shani, who developed the efficient algorithms and performed the experiments. And of course Jennifer Listgarten, who generally makes MSR a more interesting and enjoyable place to work. Thanks to my parents for their unwavering support; to Richard Peterson, who introduced me to biology, to Tom Cormen, who introduced me to computer science, vii and to Bob Gross, who brought it together for me in computational biology and gave me incredible research opportunities as an undergraduate; to Arijit Chakravarty, for mentoring me and shaping my approach to research; to Scott Saponas, Jon Froehlich, Seth Bridges and others, who made grad school fun and helped prove that CS grad students can hold their own in IM sports; and to Kate, for absolutely everything. This work was supported in part by funding from a Microsoft Research Graduate Fellowship. viii DEDICATION To my amazing wife Kate, who is everything to me. ix 1 Chapter 1 INTRODUCTION Since its identification as the pathogenic cause of Acquired Immunodeficiency Syn- drome (AIDS) in the early 1980s, Human Immunodeficiency Virus Type 1 (HIV-1) has emerged as a major global pandemic with an estimated 33 million infected indi- viduals worldwide at the end of 2007 [ 221 ]. Specifically targeting the CD4 + subset of T-lymphocytes (the so-called “helper T cells”), HIV-1 causes a progressive deteri- oration of immune function, leaving the infected individual susceptible to a range of opportunistic infections that eventually lead to AIDS and death. Although improve- ments in antiretroviral therapy have dramatically reduced HIV-related morbidity and mortality among those with access to treatment [ 172 ], the search for an effective HIV-1 vaccine continues. One of the enduring challenges facing HIV vaccine design is the remarkable rate of viral mutation and adaptation that allows the virus to evade the adaptive immune response of the host. As the immune system learns to target the virus, novel viral mutations that allow the virus to escape the attack provide an advantage over virus particles (virions) lacking the mutations. These escape mutations thus come to dom- inate the viral population in the host via selection and the immune system is left learning to target what must appear to be a novel pathogen, causing the process to repeat. The result has been devastating to HIV vaccine design. The goal of a vaccine is to train the host immune system to recognize the virus before exposure, but when the virus is constantly changing, it is extremely difficult to predict what the attacking virus will look like, and thus, how to train the immune system. Even when monkeys are inoculated and subsequently challenged with the same virus (so called homologous 2 challenge), protection is often incomplete, in part because any virions that survive the initial attack quickly adapt. So then, is the search for a vaccine futile? Although the power of adaptation may seem insurmountable, it should not be surprising that there is a large body of evidence that the space of viable mutations is constrained. Indeed, through all the myriad mutations the result must still be an infectious virion. It would thus seem that identification and characterization of these constraints is both possible and necessary for the advancement of the field. That, in a nutshell, is the purpose of this dissertation: to develop, test, and apply a statistical model of evolution that can robustly identify patterns and constraints in viral evolution. The result is promising. Although the patterns are complex and will require significant follow up to tease apart, they are also dense, suggesting a promising consistency that may provide insight into weak points in viral adaptation and suggest new targets for vaccine design. It should be evident that adaption is not a process unique to HIV, and thus the models explored here may find many uses outside the realm of HIV. Nevertheless, the rapid rate of HIV adaptation provides a unique opportunity to capture adaptation in near real time and to analyze distinct populations that are isolated from each other (by virtue of infecting different hosts), yet have had a chance to adapt to their environment. Thus, the broad approach we take is to analyze a large cohort of individuals who have been infected for some time (to allow the adaptations to arise) and have not been exposed to antiretroviral therapy (which introduces a tremendous source of selection pressure that may obliterate the signal induced by the immune system). We do so by employing statistical models of evolution that attempt to identify common patterns of selection and adaptation. The statistical model we propose is the phylogenetic dependency network (PDN), so-called because it is an adaptation of the dependency network [ 94 ] to an evolutionary context. In brief, the PDN is a graphical model that relies heavily on local proba- bility distributions that are conditioned on an underlying phylogeny. The graphical 3 structure, as well as the parameters of the probability distributions, are learned from the data, with the resulting structure indicating statistical dependencies among vari- ables. Although these dependencies cannot be interpreted as causal, they provide a useful means to understanding potential interactions among the variables and suggest simple experiments that can confirm specific biological interactions. We describe the model in detail in chapter 4 . There are several possible probability distributions that fit naturally within the PDN framework. Although we describe the details of the models in chapter 4 , it is useful to explore how the distributions work in practice. chapter 5 is devoted to the simpler model, which we call the univariate conditional adaptation model, as it describes the adaptation of one trait in response to a single source of selection pressure. We use synthetic data sets and real world analyses to explore the properties of the model and compare it to previous approaches. Interestingly, this simple model has proven quite useful in practice. We review some recent studies that have utilized the univariate model in section 5.6 . The more complete model we propose is the multivariate conditional adaptation model, of which there are two variations. These models incorporate multiple sources of selection pressure and are, in principle, better able to describe dense interaction networks, in which each trait is influenced by myriad sources of selection pressure. These models are explored in detail in chapter 6 . In chapter 7 , we apply the PDN to the study of HIV adaptation. In this work, we combine the two largest antiretroviral na¨ıve cohorts, comprising a diverse set of HIV sequences and host genetics, into a single analysis. Combined with the PDN, we have unprecedented statistical power and precision to explore the patterns of HIV adaptation in response to the immune system. To the biologist, this chapter may be read as the main result of the dissertation. For the interested reader, the appendices provide technical discussions of tangen- tial issues that arose as part of the dissertation research. Appendix A provides a preliminary look at the technical details around extending the models discussed in 4 this dissertation to the case in which individual viral sequences are sampled from each individual. In Appendix B , we consider the statistical characteristics of the false discovery rate for the simplest probability distribution we encounter: the joint dis- tribution of two independent, identically distributed binary variables evaluated using Fisher’s exact test (FET) [ 66 ]. 5 Chapter 2 DETECTING ADAPTATION: INTRODUCTION AND REVIEW 2.1 Selection and adaptation Let us begin with a brief, informal introduction to natural selection and adaptation, as the concepts and terminology will be useful in later discussions. The process of natural selection and adaptation can be summarized as follows. In any population of individuals, there is a natural variation among those individuals arising from ge- netic mutations, which arise randomly and then are passed on to offspring. Most mutations are deleterious, resulting in individuals who are less fit than the rest of the population, meaning that, on average, they tend to produce fewer offspring. Be- cause individuals who have these deleterious mutations produce fewer offspring, over time, the frequency of these mutations will be quite small or may even be eliminated from the population. This process is referred to as negative or purifying selection. In contrast, some mutations actually increase the ability of the individual to procreate. Over time, individuals harboring these mutations come to dominate the population and the mutation may even reach fixation, meaning the majority of surviving indi- viduals have the mutation. This process is referred to as positive selection, and the mutation (or set of mutations) that resulted in increased fitness is called an adapta- tion. Finally, many mutations make no discernable difference to the individual. If we were to follow the frequency of such mutations over a long period of time, we would see the frequency of the mutation follow a random walk pattern, referred to as neutral evolution. The process of selection and adaptation, is thus an interactive process between the population and the environment. Certain characteristics (traits) 6 of the environment favor some mutations over others. Such traits serve as a source of selection pressure, and the resulting adaptations are thus in response to that selection pressure. Thus, when the environment changes, the selection pressures may change, favoring different adaptations. Where the study of adaptation becomes particularly interesting is when compara- ble populations are compared in different environments. In this scenario, a mutation that is neutral in environment A maybe be beneficial in environment B. Thus, over time, this mutation will be far more prevalent in environment B than in environment A. Furthermore, the implication is that there are specific characteristics of environ- ment B that interact (directly or indirectly) with the mutation. In the case of HIV, these interactions prove vital for the design of an effective vaccine (see chapter 3 ). To ground our example, suppose we are interested in trait Y , which can take on values in {0, 1}. In addition, suppose there exists an environmental trait X ∈ {0, 1}, which exerts selection pressure on Y , such that when X = 1, there is a selective advantage for Y = 1 over Y = 0. Then we have, Pr {Y = 1|X = 1} > Pr {Y = 0|X = 1} . If we sample enough individuals from the two different environments, we will be able to see this correlation in the form of statistical dependence. In this vein, the comparative method [ 92 ] seeks to identify correlated traits, with the assumption that correlation implies the interaction of selection pressure and adap- tation, which may imply a specific function for the traits involved. In essence, one simply samples a large number of traits from individuals, as well as a number of envi- ronmental traits, then tests for correlations among all pairs of traits. Interactions that achieve some significance threshold are then considered candidates for experimental followup to determine the underlying process of selection and adaptation that leads to the apparent correlation. The challenge with the comparative method, as elegantly described by Felsenstein [ 61 ], lies in the confounding effect of the evolutionary history 7 of the traits, which tends to make traits look more correlated than they really are. 2.2 Phylogeny confounds the comparative method The relevant question then is how should we test for significance? Suppose we are considering the two binary traits X and Y . To determine whether these two variables are associated, we could count the number of individuals with and without each trait and apply a simple statistical test for indepedence, such as Fisher’s exact test. This procedure, however, ignores the phylogenetic structure among the sequences [ 61 ]. Suppose these sequences have the phylogeny shown in Figure 2.1 a. In essence, there are two clusters of individuals where individuals within a cluster are similar to each other but quite different from those in the other cluster. Now suppose we observe that traits X and Y are present in the two individuals on the top and absent in the two individuals on the bottom, as shown in Figure 2.1 a. The observations of the amino acid are well explained by the phylogeny alone and should not be treated as independent observations. Consequently, the application of Fisher’s exact test or some other test that ignores the phylogenetic structure would overcount these observations when determining the correlation of X and Y . Such overcounting will lead to an overestimation of the statistical significance, leading to a surprising number of false positives. In contrast, suppose we make the observations shown in Figure 2.1 b. Here, observing the presence and absence of the trait Y in the same branch of the phylogeny is quite surprising, until the observations of X are taken into account. In this case, the application of a simple test would undercount the observations when determining the correlation of X and Y , leading to an underestimation of statistical significance and potentially increasing the number of false negatives. Simple statistical methods such as Fisher’s exact test assume the data to be in- finitely exchangeable or independent and identically distributed (IID). Although se- quence data and other biological data are IID a priori, they are not IID once we learn their hierarchical structure. Furthermore, as we have just seen, this structure can 8 y x y x y x y x y x y x y x y x Download 4.8 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling