Phylogenetic dependency networks: Inferring patterns of adaptation in hiv
Download 4.8 Kb. Pdf ko'rish
|
actions of the Royal Society of London B Biological Sciences 255(1342):37–45.
10 , 11 , 13 , 16 , 89 176 [172] Palella FJ, Delaney KM, Moorman AC, Loveless MO, Fuhrer J, Satten GA, Aschman DJ, Holmberg SD and Investigators THOS (1998). Declining mor- bidity and mortality among patients with advanced human immunodeficiency virus infection. New England Journal of Medicine 338(13):853–860. 1 [173] Pazos F, Helmer-Citterich M, Ausiello G and Valencia A (1997). Correlated mutations contain information about protein-protein interaction. Journal of Molecular Biology 271(4):511–523. 10 [174] Peters HO, Mendoza MG, Capina RE, Luo M, Mao X, Gubbins M, Nagelkerke NJD, MacArthur I, Sheardown BB, Kimani J, Wachihi C, S. T and Plummer FA (2008). An integrative bioinformatic approach for studying escape mutations in human immunodeficiency virus type 1 Gag in the pumwani sex worker cohort. Journal of Virology 82(4):1980–1992. 12 , 114 , 116 [175] Peyerl FW, Barouch DH, Yeh WW, Bazick HS, Kunstman J, Kunstman KJ, Wolinsky SM and Letvin NL (2003). Simian-human immunodeficiency virus escape from cytotoxic T-lymphocyte recognition at a structurally constrained epitope. Journal of Virology 77(23):12572. 104 [176] Peyerl FW, Bazick HS, Newberg MH, Barouch DH, Sodroski J and Letvin NL (2004). Fitness costs limit viral escape from cytotoxic T lymphocytes at a structurally constrained epitope. Journal of Virology 78(24):13901. 104 [177] Pollock DD and Taylor WR (1997). Effectiveness of correlation analysis in iden- tifying protein residues undergoing correlated evolution. Protein Engineering 10(6):647–657. 10 [178] Pollock DD, Taylor WR and Goldman N (1999). Coevolving protein residues: maximum likelihood identification and relationship to structure. Journal of Molecular Biology 287(1):187–198. 11 , 13 , 16 , 17 , 51 , 52 , 65 , 89 177 [179] Poon A and Chao L (2005). The rate of compensatory mutation in the DNA bacteriophage φX174. Genetics 170(3):989–999. 106 , 108 , 114 [180] Poon A, Davis BH and Chao L (2005). The coupon collector and the suppressor mutation: estimating the number of compensatory mutations by maximum likelihood. Genetics 170(3):1323–1332. [181] Poon AFY, Lewis FI, Pond SLK and Frost SDW (2007). Evolutionary in- teractions between N-linked glycosylation sites in the HIV-1 envelope. PLoS Computational Biology 3(1):e11. 11 , 13 , 15 [182] Poon AFY, Lewis FI, Pond SLK and Frost SDW (2007). An evolutionary- network model reveals stratified interactions in the V3 loop of the HIV-1 Enve- lope. PLoS Computational Biology 3(11):e231. 11 , 15 , 16 , 89 , 117 , 129 [183] Pounds S and Cheng C (2006). Robust estimation of the false discovery rate. Bioinformatics 22(16):1979. 204 , 209 , 221 , 227 [184] Press W, Teukolsky S, Vetterling W and Flannery B (1992). Numerical Recipes in C. Cambridge University Press, New York, NY. 52 , 53 [185] Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA and Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38(8):904–909. 13 [186] Pritchard JK, Stephens M, Rosenberg NA and Donnelly P (2000). Association mapping in structured populations. The American Journal of Human Genetics 67(1):170–181. 13 [187] Pritchard L, Bladon P, M O Mitchell J and J Dufton M (2001). Evaluation of a novel method for the identification of coevolving protein residues. Protein Engineering 14(8):549–555. 89 178 [188] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ and Sham PC (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3):559–575. [189] Reiner AP, Ziv E, Lind DL, Nievergelt CM, Schork NJ, Cummings SR, Phong A, Burchard EG, Harris TB, Psaty BM and Kwok PY (2005). Population structure, admixture, and aging-related phenotypes in African American adults: the Cardiovascular Health Study. The American Journal of Human Genetics 76(3):463–477. [190] Ridley M (1983). The Explanation of Organic Diversity: The Comparative Method and Adaptations for Mating. Oxford University Press, Oxford. 10 , 13 , 15 , 16 , 30 , 89 [191] Risch NJ (2000). Searching for genetic determinants in the new millennium. Nature 405(6788):847–856. [192] Rodrigue N, Lartillot N, Bryant D and Philippe H (2005). Site interdepen- dence attributed to tertiary structure in amino acid sequence evolution. Gene 347(2):207–217. [193] Rodriguez F, Harkins S, Slifka MK and Whitton JL (2002). Immunodominance in virus-induced CD8 + T-cell responses is dramatically modified by DNA immu- nization and is regulated by gamma interferon. Journal of Virology 76(9):4251. [194] Rolland M, Nickle DC and Mullins JI (2007). HIV-1 group M conserved elements vaccine. PLoS Pathogens 3(11):e157. 73 [195] Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA and Feldman MW (2002). Genetic structure of human populations. Science 298(5602):2381–2385. 13 179 [196] Rousseau CM, Daniels MG, Carlson JM, Kadie C, Crawford H, Prendergast A, Matthews P, Payne R, Rolland M, Raugi DN, Maust BS, Learn GH, Nickle DC, Coovadia H, Ndung’u T, Frahm N, Brander C, Walker BD, Goulder PJR, Bhattacharya T, Heckerman DE, Korber BT and Mullins JI (2008). HLA class I-driven evolution of human immunodeficiency virus type 1 subtype C proteome: immune escape and viral load. Journal of Virology 82(13):6434–6446. 26 , 27 , 72 , 75 , 77 , 84 , 86 , 87 , 90 , 95 , 98 , 99 , 110 , 113 , 114 , 115 , 116 , 198 [197] Sanchez-Merino V, Farrow M, Brewster F, Somasundaran M and Luzuriaga K (2008). Identification and characterization of HIV-1 CD8 + T cell escape variants with impaired fitness. The Journal of Infectious Diseases 197(2):300–8. [198] Satten GA, Flanders WD and Yang Q (2001). Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. The American Journal of Human Genetics 68(2):466– 477. 13 [199] Schmitz JE, Kuroda MJ, Santra S, Sasseville VG, Simon MA, Lifton MA, Racz P, Tenner-Racz K, Dalesandro M, Scallon BJ, Ghrayeb J, Forman MA, Mon- tefiori DC, Rieber EP, Letvin NL and Reimann KA (1999). Control of viremia in simian immunodeficiency virus infection by CD8 + lymphocytes. Science 283(5403):857–860. 19 [200] Schneidewind A, Brockman MA, Sidney J, Wang YE, Chen H, Suscovich TJ, Li B, Adam RI, Allgaier RL, Mothe BR, Kuntzen T, Oniangue-Ndza C, Trocha A, Yu XG, Brander C, Sette A, Walker BD and Allen TM (2008). Structural and functional constraints limit options for cytotoxic T-lymphocyte escape in the immunodominant HLA-B27-restricted epitope in human immunodeficiency virus type 1 capsid. Journal of Virology 82(11):5594–5605. 104 , 105 , 110 , 114 [201] Schneidewind A, Brockman MA, Yang R, Adam RI, Li B, Le Gall S, Rinaldo 180 CR, Craggs SL, Allgaier RL, Power KA, Kuntzen T, Tung CS, LaBute MX, Mueller SM, Harrer T, McMichael AJ, Goulder PJR, Aiken C, Brander C, Kelle- her AD and Allen TM (2007). Escape from the dominant HLA-B27-restricted cytotoxic T-lymphocyte response in Gag is associated with a dramatic reduc- tion in human immunodeficiency virus type 1 replication. Journal of Virology 81(22):12382–12393. 20 , 21 , 29 , 104 , 105 , 110 , 113 , 114 [202] Schneidewind A, Brumme ZL, Brumme CJ, Power KA, Reyor LL, O’Sullivan K, Gladden A, Hempel U, Kuntzen T, Wang YE, Oniangue-Ndza C, Jessen H, Markowitz M, Rosenberg ES, Sekaly RP, Kelleher AD, Walker BD and Allen TM (2008). Transmission and long-term stability of compensated CD8 escape mutations. Journal of Virology in press. doi:10.1128/JVI.01108-08. [203] Self SG and Liang KY (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82(398):605–610. [204] Setakis E, Stirnadel H and Balding DJ (2006). Logistic regression protects against population structure in genetic association studies. Genome Research 16(2):290–296. 13 [205] Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL and Mullins JI (1999). Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73(12):10489– 10502. 18 [206] Sinha H, Nicholson BP, Steinmetz LM and McCusker JH (2006). Complex genetic interactions in a quantitative trait locus. PLoS Genetics 2(2):e13. 181 [207] Stewart JJ, Watts P and Litwin S (2001). An algorithm for mapping positively selected members of quasispecies-type viruses. BMC Bioinformatics 2(1):1. 12 [208] Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society - Series B: Statistical Methodology 64(3):479–498. 196 , 202 , 203 , 206 , 208 , 216 , 221 , 226 [209] Storey JD (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics 31(6):2013–2035. 196 , 205 , 208 , 209 , 221 , 226 [210] Storey JD, Taylor JE and Siegmund D (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society - Series B: Statistical Methodology 66(1):187–205. 203 [211] Storey JD and Tibshirani R (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100(16):9440–9445. 42 , 196 , 203 , 209 [212] Streeck H, Lichterfeld M, Alter G, Meier A, Teigen N, Yassine-Diab B, Sidhu HK, Little S, Kelleher A, Routy JP, Rosenberg ES, Sekaly RP, Walker BD and Altfeld M (2007). Recognition of a defined region within p24 Gag by CD8+ T cells during primary human immunodeficiency virus type 1 infection in individuals expressing protective HLA class I alleles. Journal of Virology 81(14):7725–7731. [213] Suel GM, Lockless SW, Wall MA and Ranganathan R (2003). Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nature Structural and Molecular Biology 10(1):59–69. 11 182 [214] Tang C, Ndassa Y and Summers MF (2002). Structure of the N-terminal 283- residue fragment of the immature HIV-1 Gag polyprotein. Nature Structural and Molecular Biology 9(7):537–543. 105 , 106 , 107 [215] Taylor WR and Hatrick K (1994). Compensating changes in protein multiple sequence alignments. Protein Engineering 7(3):341–348. 10 [216] Thomas DC, Haile RW and Duggan D (2005). Recent developments in genomewide association scans: a workshop summary and review. The American Journal of Human Genetics 77(3):337–345. [217] Thornsberry JM, Goodman MM, Doebley J, Kresovich S, Nielsen D and Buckler ES (2001). Dwarf8 polymorphisms associate with variation in flowering time. Nature Genetics 28(3):286–289. 13 [218] Tillier ERM and Lui TWH (2003). Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19(6):750–755. [219] Turnbull EL, Lopes AR, Jones NA, Cornforth D, Newton P, Aldam D, Pellegrino P, Turner J, Williams I, Wilson CM, Goepfert PA, Maini MK and Borrow P (2006). HIV-1 epitope-specific CD8+ T cell responses strongly associated with delayed disease progression cross-recognize epitope variants efficiently. The Journal of Immunology 176(10):6130–6146. 27 [220] Ueno T, Motozono C, Dohki S, Mwimanzi P, Rauch S, Fackler OT, Oka S and Takiguchi M (2008). CTL-mediated selective pressure influences dynamic evolution and pathogenic functions of HIV-1 Nef. The Journal of Immunology 180(2):1107. [221] UNAIDS/WHO (2007). AIDS epidemic update. http://www.unaids.org . 1 183 [222] Vogel TU, Horton H, Fuller DH, Carter DK, Vielhuber K, O’Connor DH, Ship- ley T, Fuller J, Sutter G, Erfle V, Wilson N, Picker LJ and Watkins DI (2002). Differences between T cell epitopes recognized after immunization and after infection. The Journal of Immunology 169(8):4511–4521. [223] Voight BF and Pritchard JK (2005). Confounding from cryptic relatedness in case-control association studies. PLoS Genetics 1(3):e32. 13 [224] Wang Y and DeLisi C (2006). Inferring protein-protein interactions in viral proteins by co-evolution of conserved side chains. Genome 17(1):23–35. 116 [225] Wang YE, Li B, Carlson JM, Streeck H, Gladden AD, Goodman R, Schnei- dewind A, Power KA, Toth I, Frahm N, Alter G, Brander C, Carrington M, Walker BD, Altfeld M, Heckerman D and Allen TM (2009). Protective HLA class I alleles that restrict acute-phase CD8 + T-cell responses are associated with viral escape mutations located in highly conserved regions of human im- munodeficiency virus type 1. Journal of Virology 83(4):1845–1855. 73 [226] Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF, Salazar MG, Kilby JM, Saag MS, Komarova NL, Nowak MA, Hahn BH, Kwong PD and Shaw GM (2003). Antibody neutralization and escape by HIV-1. Nature 422(6929):307–312. 18 [227] Williams SG and Lovell SC (2009). The effect of sequence evolution on protein structural divergence. Molecular Biology and Evolution in press. doi:10.1093/molbev/msp020. [228] Wollenberg KR and Atchley WR (2000). Separation of phylogenetic and func- tional associations in biological sequences by using the parametric bootstrap. Proceedings of the National Academy of Sciences of the United States of America 97(7):3288–3291. 10 , 11 184 [229] Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, Bunce M, Muyembe JJ, Kabongo JMM, Kalengayi RM, Van Marck E, Gilbert MTP and Wolinsky SM (2008). Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature 455(7213):661–664. 18 , 148 [230] Xie Y, Pan W and Khodursky AB (2005). A note on using permutation-based false discovery rate estimates to compare different analysis methods for microar- ray data. Bioinformatics 21(23):4280–4288. 196 , 211 , 212 [231] Yang Z, Nielsen R, Goldman N and Pedersen AMK (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155(1):431–449. 12 [232] Yanofsky C, Horn V and Thorpe D (1964). Protein structure relationships revealed by mutational analysis. Science 146(3651):1593–1594. 114 [233] Yeang CH and Haussler D (2007). Detecting coevolution in and among protein domains. PLoS Computational Biology 3(11):e211. 11 [234] Yeh WW, Cale EM, Jaru-Ampornpan P, Lord CI, Peyerl FW and Letvin NL (2006). Compensatory substitutions restore normal core assembly in simian im- munodeficiency virus isolates with Gag epitope cytotoxic T-lymphocyte escape mutations. Journal of Virology 80(16):8168–8177. 104 [235] Yewdell J (2005). The seven dirty little secrets of major histocompatibility complex class I antigen processing. Immunological Reviews 207(1):8–18. [236] Yewdell JW (2006). Confronting complexity: real-world immunodominance in antiviral CD8 + T cell responses. Immunity 25(4):533–543. 26 [237] Yewdell JW and Bennink JR (1999). Immunodominance in major histocom- patibility complex class I-restricted T lymphocyte responses. Annual Review of Immunology 17(1):51–88. 26 , 108 185 [238] Yewdell JW and Haeryfar SM (2005). Understanding presentation of viral anti- gens to CD8 + T cells in vivo: the key to rational vaccine design. Annual Review of Immunology 23(1):651–682. [239] Yokomaku Y, Miura H, Tomiyama H, Kawana-Tachikawa A, Takiguchi M, Ko- jima A, Nagai Y, Iwamoto A, Matsuda Z and Ariyoshi K (2004). Impaired processing and presentation of cytotoxic-T-lymphocyte (CTL) epitopes are ma- jor escape mechanisms from CTL immune pressure in human immunodeficiency virus type 1 infection. Journal of Virology 78(3):1324. 20 [240] Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S and Buckler ES (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38(2):203–208. 13 [241] Yu XG, Lichterfeld M, Chetty S, Williams KL, Mui SK, Miura T, Frahm N, Feeney ME, Tang Y, Pereyra F, LaBute MX, Pfafferott K, Leslie A, Crawford H, Allgaier R, Hildebrand W, Kaslow R, Brander C, Allen TM, Rosenberg ES, Kiepiela P, Vajpayee M, Goepfert PA, Altfeld M, Goulder PJR and Walker BD (2007). Mutually Exclusive T-Cell Receptor Induction and Differential Susceptibility to Human Immunodeficiency Virus Type 1 Mutational Escape Associated with a Two-Amino-Acid Difference between HLA Class I Subtypes. Journal of Virology 81(4):1619–1631. 14 , 110 [242] Zhang J and Rosenberg HF (2002). Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates. Proceedings of the National Academy of Sciences of the United States of America 99(8):5486–5491. 114 [243] Zu˜ niga R, Lucchetti A, Galvan P, Sanchez S, Sanchez C, Hernandez A, Sanchez H, Frahm N, Linde CH, Hewitt HS, Hildebrand W, Altfeld M, Allen TM, Walker 186 BD, Korber BT, Leitner T, Sanchez J and Brander C (2006). Relative dom- inance of Gag p24-specific cytotoxic T lymphocytes is associated with human immunodeficiency virus control. Journal of Virology 80(6):3122–3125. 115 187 Appendix A NEXT GENERATION SEQUENCING: EXTENDING THE MODEL TO SINGLE GENOME SEQUENCES In this appendix, we briefly outline the likelihood calculation and EM steps for the Single Genome Sequencing (SGS) case, which was introduced and motivated in subsection 8.1.5 . These calculations follow directly from known methods (see, e.g., [ 93 ], but we provide a derivation here to clarify the model. A thorough exploration of the behavior of the model on real and synthetic data is left for future work. A.1 Likelihood calculation The calculation of likelihood on a tree is a well studied problem, first made popular in the phylogenetics community by Felsenstein [ 60 ]. Although we heavily use this approach in all of our models, we have thus far omitted a thorough discussion of likelihood calculation as it is generally familiar to the community. We will find it useful here, however, to briefly review the likelihood calculation. A.1.1 Notation For a given phylogeny Ψ, observed target variables D = Y 1 , . . . , Y N , observed predictor variables X = X 1 , . . . , X n , and model parameters θ, the likelihood of the model with respect to θ is written L Ψ (D|θ, X) = Pr {D|θ, Ψ, X} . (A.1) As the tree is held constant throughout, we will drop Ψ for the rest of this discussion. The goal is first compute L(D|θ), then to find θ that maximize L. First, let us 188 introduce some notation. Let {V k } represent the set of nodes in the phylogeny. If we are considering the model of conditional adaptation, the {V k } will include the hidden nodes described in chapter 4 . The ordering is arbitrary, but for convenience, let V 0 be the root (which is itself arbitrary, as the models we consider are reversible), and let V 1 , . . . , V N denote the nodes corresponding to variables Y 1 , . . . , Y N (as previously, we will sometimes simply refer to node V i , 1 ≤ i ≤ N , by its corresponding variable Y i ). We denote the parent of a node V k as p(V k ) and the set of children of V k as c(V k ). Now the observed data D consists of the values of the target and predictor attributes for each individual i, each of which corresponds to a leaf on the tree. Let D(V k ) represent the data belonging to the subset of individuals in the subtree rooted at node V k . Recall that the parameters of our model are θ = (π, λ, s). A.1.2 Likelihood calculation in the unlinked case Given this notation, the likelihood can then be written as L(D|θ, X) = v∈{0,1} Pr {V 0 = v|θ} V c ∈c(V 0 ) L(D(V c )|θ, X, V 0 = v) (A.2) where L(D(V k )|θ, X, V 0 = v) is the likelihood of the subtree rooted at V k conditioned on V 0 = v. Note that Pr {V 0 = 1|θ} = π, and Pr {V 0 = 0|θ} = 1 − π. The conditional likelihood of an arbitrary node given its parent can be recursively defined as L(D(V k )|θ, X, p(V k ) = p) = v∈{0,1} Pr {V k = v|θ, p(V k ) = p} V c ∈c(V k ) L(D(V c )|θ, X, V k = v) (A.3) if V k is a branch node. Note that Pr {V k = v|p(V k ) = p, θ} is the transition probability and is a function of the parameters π and λ and the branch length t k between V k and its parent. If V k is a leaf in the phylogeny, then the likelihood is given by the leaf 189 distribution as defined in chapter 4 , which we write as L(D(V i )|θ, X, p(V i ) = p) = y∈{0,1} Pr {Y = y|θ, X, H = p} 1 {Y i = y} , (A.4) where the right hand side of the equation is the familiar leaf distribution, which computes the probability that the trait Y will be y given that the corresponding hidden node is p i . Note that we only want to sum the probabilities for the observed value of Y i . This condition is provided by the identity function 1 {·}, which evaluates to 1 if the condition is true and 0 otherwise. Thus, the likelihood is simply equal to the leaf distribution corresponding to the observed value of trait Y . A.1.3 Likelihood calculation in the SGS case Now consider the case where our data consists of SGS sequences, such that we have multiple HIV sequences for each individual. As before, our first step will be to con- struct a phylogeny for all the sequences. Due to the rapid rate of evolution of HIV, we expect that sequences sampled from a given individual will form a subtree within the larger tree, though this is not guaranteed. For simplicity, however, we will assume this is the case, though the derivation discussed in this appendix can be easily generalized to avoid this assumption. Previously, we had one HIV sequence per individual. When considering the target trait Y and the predictor trait X, we therefore labeled individual variables Y i and X i for each individual i. Now, each individual has multiple sequences. Therefore, define Y i,g to be the variable corresponding the gth sequence from the ith individual, with H i,g denoting the corresponding hidden variable. For simplicity, we will focus on the univariate model with the trait X corresponding to an HLA allele (though the results generalize to Noisy Add as well). Therefore, will will continue to index X as X i . Consider Figure 8.1 on page 134 . The addition of the linked nodes L have two implications. First, they change the leaf distribution calculation to Equation ( 8.1 ), as discussed in subsection 8.1.5 . Second, our likelihood calculation must now integrate 190 out the linked nodes. Intuitively, this can be done by first fixing the values of the linked nodes, then computing the likelihood. We then simply compute the weighted average of the likelihoods under every possible assignment of linked nodes, weighted by the prior probability of a given configuration, which is given by the parameter s 1 and the observed predictor variables X. To make this explicit, we can expand Equation ( A.1 ) around the linked nodes. Specifically, let L 1 , . . . , L N be the set of variables represent the linked trait L for each individual i ∈ {1, . . . , N }. Then the likelihood can be written as L(D|θ, X) = l 1 ∈{0,1} · · · l N ∈{0,1} L(D|θ, X, L 1 = l 1 , . . . , L N = l N ) × · · · . . . × Pr {L 1 = l 1 , . . . , L N = l N |X} . (A.5) The structure of the graph implies that each linked node L i is conditionally indepen- dent of all other linked nodes, given X. Thus, we can compute Pr {L 1 = l 1 , . . . , L N = l N |X} = i Pr {L i = l i |X i } , (A.6) where Pr {L i = l i |X i } = s 1 if X i = 1 and l i = 1 1 − s 1 if X i = 1 and l i = 0 0 if X i = 0 and l i = 1 1 if X i = 0 and l i = 0 (A.7) A straightforward computation of Equation ( A.5 ) would involve an exponential num- ber of terms. However, the independencies implied by the tree allow us to simplify the expression in terms of subtrees, much as we did in Equation ( A.3 ). Notice how each linked node points to a set of individuals. Let M i be the most recent common ancestor (MRCA) of the leaves pointed to by L i , where the MRCA is defined to be root of the smallest subtree (i.e., that subtree with the fewest nodes) that contains all of the leaves in question. For simplicity, suppose each linked node 191 corresponds to a unique MRCA. That is for i = j, M i = M j . Further, suppose that for all i = j, M i is not a descendent of M j , meaning M j is not on the path from M i to the root node. This will typically be the case for the univariate model, though it clearly is not the case for Noisy Add. Nevertheless, the results easily generalize and this assumption will simplify the following discussion. Given these assumptions, the (in)dependencies implied by the phylogeny ( Fig- ure 8.1 ) imply that L(D(M i )|θ, X, L 1 , . . . , L N ) = L(D(M i )|θ, X, L i ). (A.8) That is, the likelihood of the subtree of M i is conditionally independent of all linked nodes except L i . Thus, the conditional likelihood of the subtree rooted at M i is easily computed using L(D(M i )|θ, X, p(M i ) = p) = l∈{0,1} L(D(M i )|θ, X, p(M i ) = p, L i = l) Pr {L i = l|X i } , (A.9) where L(D(V i )|θ, X, p(V i ) = p, L i = l) = y∈{0,1} Pr {Y = y|θ, X i , H = p, L i = l} 1 {Y i = y} if V i is a leaf v∈{0,1} Pr {V i = v|θ, p(V i ) = p} × · · · · · · × V c ∈c(V i ) L(D(V c )|θ, X i , V i = v, L i = l) otherwise (A.10) This observation allows us to use each M i as a cut note. That is, the likelihood cal- culation proceeds as in the unlinked case, except when a MRCA node is encountered. When a MRCA is encountered, we calculate the likelihood of the subtree rooted at the MRCA using Equation ( A.8 ). A.2 Expectation maximization Having described the computation of the likelihood under the linked model, and assuming the reader is familiar with expectation maximization (EM) [ 49 ] on trees, 192 we can now briefly described the EM procedure for the linked model. Recall that EM consists of two steps: the Expectation/E step, in which the expected values of the hidden variables are computed with respect to their posterior probability distribution, and the Maximization/M step, in which the parameters of the model are chosen so as to maximize the likelihood of the model conditioned on the expected values of the hidden variables. In the case of a continuous time Markov process (CTMP), the M step requires the expected transition probabilities between every parent-child pair in the tree [ 167 ]. Using the notation developed in the previous section, we can write the posterior joint probability between node V k and its parent p(V k ) as Pr {p(V k ), V k |θ, X, D} . Note that the posterior is conditioned on all the observed data, not just the data in the subtree rooted at V k . For the unlinked case, these probabilities can be easily computed using standard message passing algorithms (e.g. [ 93 ]). Briefly, during the likelihood calculation, the partial posterior conditional probability Pr {V k |θ, X, D(V k ), p(V k )} = Pr {V k |θ, p(V k )} V c ∈c(V k ) L(D(V c )|θ, X, V k ) L(D(V k )|θ, X, p(V k )) (A.11) is computed and stored. Similarly, a biproduct of computing the likelihood at the root is the posterior marginal probability Pr {V 0 |θ, X, D} = Pr {V 0 } V c ∈c(V 0 ) L(D(V c )|θ, X, V 0 ) L(D|θ, X) . (A.12) Given the posterior marginal probability of any node, we can recurse back down the tree, computing posterior joint probabilities along the way. Given the posterior marginal probability of node V k ’s parent, the posterior joint probability is simply Pr {p(V k ), V k |θ, X, D} = Pr {V k |θ, X, D(V k ), p(V k )} × Pr {p(V k )|θ, X, D} (A.13) and the posterior marginal probability of V k is obtained by taking Equation ( A.13 ) and summing over all potential values of p(V k ). Note that the conditional independencies 193 afforded by the tree structure of the model mean that Pr {V k |θ, X, D(V k ), p(V k )} is independent of all observed data not in the subtree of V k , making this simple message passing scheme possible. Given these posterior joint probabilities, we can now compute the expected value of the CTMP sufficient statistics, as well as maximize the parameters with respect to those expected values, using the method of [ 167 ]. As discussed in section 4.4 , this process also yields the posterior joint probability distribution between the hidden variables H and their corresponding observed target variables Y Pr {Y i , H i |θ, D, X i } , from which we can maximize the selection parameter s according to whichever con- ditional adaptation model we are using. A.2.1 EM for the linked case EM for the linked case proceeds much like the likelihood calculation. That is, we use M i as a cut node and calculate the posterior probabilities for each node in the subtree rooted at M i for each value of L i . The first step in this process is to compute the posterior Pr {L i |θ, X, D}. To do so, we first condition on the parent of M i , isolating L i from the rest of the tree, then use Bayes’ rule to calculate the partial posterior: Pr {L i |θ, X i , D(M i ), p(M i )} = Pr {D(M i )|θ, L i , p(M i ), X i } Pr {L i |X i } L(D(M i )|θ, X i , p(M i )) . (A.14) Note that each time on the right hand side of the equation follows naturally from our linked likelihood calculations. Now we simply multiply Equation ( A.14 ) by the posterior marginal probability of p(M i ) (( A.12 )) to yield Pr {L i |θ, X, D} = p∈{0,1} Pr {L i |θ, X i , D(M i ), p(M i ) = p} × Pr {p(M i ) = p|θ, X} . (A.15) 194 The parameter s 1 can then be maximized by s 1 = i Pr {L i = 1|θ, D, X i = 1} 1 {X i = 1} i 1 {X i = 1} . (A.16) With the posterior of L i in hand, we can now easily compute the posterior joint probability between any node and it’s parent in the subtree rooted at M i by condi- tioning on L i . Specifically, Equation ( A.13 ) becomes Pr {p(V k ), V k |θ, X, D} = l∈{0,1} Pr {V k |θ, X, D(V k ), p(V k ), L i = l} × · · · · · · × Pr {p(V k )|θ, X, D, L i = l} Pr {L i = l|θ, X, D} (A.17) Note that Pr {p(M i )|θ, X, D, L i = l} = Pr {p(M i )|θ, X, D}, allowing us to restrict this process of conditioning on L i to the subtree rooted at M i . Finally, to maximize s 2 , we first compute the posterior Pr {H i,g , Y i,g |θ, X i = 1, D, L i = 1} = Pr {Y i,g |θ, X, D, H i,g , L i = 1} × · · · · · · × Pr {H i,g |θ, X, D, L i = 1} Pr {L i = 1|θ, X, D} , (A.18) then maximize s 2 according to the appropriate leaf distribution and the posterior observed for each sequence. Finally, recall that throughout this discussion we have made the simplifying as- sumption that each linked node has a unique MRCA and no MRCA is a descendent of any other MRCA. When these restrictions are removed, the result is that each linked node is no longer independent of all other linked nodes. Consider the case where two linked nodes L i and L j share the same MRCA (as will likely happen for the Noisy Add model). The above discussion can be easily extended to include this case by noting that we now must condition on L i and L j simultaneously. Likewise, if M j is a descendent of M i , then all calculations within the subtree rooted at M j will need to be conditioned on both L i and L j . In general, the computational complexity will scale exponentially with the number of linked nodes on which we must simultaneously 195 condition. In practice, this scenario is most likely to occur in the Noisy Add model, in which case the calculation will tend to be exponential in the number of linked nodes that are included as predictors for a given target variable. In the examples discussed in this dissertation, linked predictor variables correspond to HLA alleles. In our data, it is uncommon for a single target variable to be predicted by more than a handful of HLA alleles; thus, in this domain, it is unlikely that the computational burden of the linked model will be large enough to make this approach impractical. 196 Appendix B ON COMPUTING FDR FOR FISHER’S EXACT TEST The model selection approach we have used throughout this work is based on estimated false discovery rates. Although our FDR estimates have generally been well-calibrated on synthetic data, the theory of FDR estimation was developed for continuous statistics and does not always perform well on discrete data, especially when the number of observations is small. In this appendix, we digress a bit from the main focus of the work to discuss the implications of FDR on discrete data. In particular, we focus on the application of FDR estimation to two by two contingency tables, for which p-values can be exactly calculated using Fisher’s exact test (FET) [ 66 ]. Although the results discussed in this appendix are not directly applicable to phylogenetic dependency networks, owing to the impracticality of computing exact p- values over our phylogenetic models, there are a number of cases in sequence analysis in which the phylogeny has no bearing on the null distribution of variables, allowing us to use FET directly. In these cases, the results presented here provide a substantial increase in power by leading to less-biased FDR estimates while maintaining provably conservative asymptotic guarantees. The idea is as follows. A key assumption in Storey’s [ 208 , 209 , 211 ] estimation of FDR is that the p-values under the null distribution are distributed according to Uniform[0,1]. In this case, for any α ∈ [0, 1] the probability that we sample a p-value that is at most α is simply α. In discrete data, however, the p-value distribution under the null is often heavily skewed toward one. In such cases, as we have noted ( subsec- tion 5.1.2 ), one can estimate the the FDR based on an empirical null distribution that is sampled from, for example, randomization testing (see refs. [ 57 , 106 , 230 ]). In the 197 case of FET, we can analytically perform exhaustive randomization testing, allowing us to compute the exact null distribution. As well will show, this exact calculation allows us to compute a tight estimation of FDR, as well as to prove some asymptotic conservative guarantees. B.1 Examples of FET for sequence data Let us begin by describing some examples in which FET is the appropriate test for significance. Whereas the bulk of this dissertation has focused on techniques involv- ing phylogenetic correction, here we briefly outline some examples where population structure is non-existent, making FET the most appropriate test. Longitudinal HIV resistance data. The most direct way to identify sequence associations is threw longitudinal analysis. In this study design, the HIV is sequenced in each individual at the start of the trial, and is resequenced at the end of the trial. This approach is most appropriate when the researcher has complete control over the application of the predictor variable. One such example is provided by the study of Richard Harrigan and colleagues [ 91 ]. In this study, approximately 700 patients who had not been on antiretroviral drugs were enrolled and their consensus HIV sequences sequenced (this is the HOMER cohort from chapter 7 ). These patients then initiated one of several types of HIV therapy. Several years later, the consensus HIV sequences were again determined for each patient and compared to the original sequences. The goal was to identify mutations that were associated with specific antiretroviral drugs. When we used the data as an application for the PDN, we considered only the initial sequences (as we needed sequences that did not reflect drug-related adaptation). In this case, it was important to correct for phylogeny. Here, however, we can observe the transitions taking place, rendering the phylogeny irrelevant. For each amino acid a k at position k, we construct a binary variable A such that A = 1 if a k was observed only in the pre-trial sequence, and A = 0 if a k was observed in both the pre-trial and post-trial sequences. The variable A is then tested for independence against each 198 of the drugs in the trial. Here, we tested three classes of drugs over 1194 variables representing observed amino acids at positions in the HIV Protease and Reverse Transcriptase proteins, resulting in 3582 total tests, each with 281 observations. We will refer to this data set as “Resistance”. Epitope mapping. The cellular arm of the immune response identifies and destroys infected cells. Such cells can be identified by the unique strings of viral protein fragments, called epitopes, that are displayed on the surface of infected cells. These epitopes are displayed by human leukocyte antigen (HLA) proteins. Thousands of HLA variations have been identified in humans. Thus, a critical component of HIV vaccine design is the identification and characterization of the set of epitopes that each HLA allele can present on the cell surface. One high-throughput experimental method for epitope mapping is the use of over- lapping peptide (OLP) scans, in which a sliding window of 10-15 amino acid peptide fragments are created based on the viral genome of interest. Each OLP is then tested against cells from dozens of patients, each of whom contains six HLA alleles, using interferon-γ ELISpot assays [ 1 , 156 ]. We then attempt to map HLA alleles to re- sponses. Thus, we test whether patients with the given HLA allele are more likely to have a positive ELISpot assay for a given OLP than patients without the HLA allele. Here, we reanalyze the data of Kiepiela et al. [ 116 ], comparing comparing 219 HLA alleles with 343 OLPs, resulting in 74,774 total tests, with an average of 724 obser- vations per test. These data were collected as part of the Durban cohort analyzed in chapter 7 . This data set is denoted “Epitope”. Sieve analysis. In sieve analysis we try to identify positions in the viral genome at which the variability differs between two different populations [ 79 ]. For example, if the position is more variable in patients from one region than those from another, it may indicate that different forces are acting on that region. Following Gilbert, we have created a data set consisting of 567 HIV clade B sequences from the HOMER cohort [ 25 ] and 522 HIV clade C sequences from the Durban cohort [ 116 , 196 ]. For 199 each of 363 position in the HIV Gag protein, we compared the frequency at which sequences match the consensus for their respective clades. We will call this data set “Sieve”. Linkage disequilibrium mapping. Linkage disequilibrium (LD) occurs when two sites on a chromosome are not statistically independent of each other. This typically occurs when the sites are nearby on the chromosome, such that inheritance of a given allele at one site is correlated with inheritance of a given allele at the other site. When performing genome wide association studies (GWAS), it is helpful to identify a set of variations (SNPs) that are in LD with each other, so that potentially redundant results can be identified. We have taken the SNP data from a recent GWAS [ 37 ], binarized the data and tested for independence between each of 401,017 pairs of neighboring SNPs. There was an average of 1085 observations per test. We this data set “SNP”. Synthetic data To demonstrate various claims throughout this appendix, we use synthetic data sets constructed to closely mimic the above real data sets. We will describe the construction of these data sets in section B.5 , after we have introduced some useful notation and concepts. B.2 Background B.2.1 Contingency Tables and Fisher’s Exact Test Consider an experiment that tests the independence of two random binary variables X and Y . Suppose there are n observations, (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ). The results can be summarized in a contingency table t = (a, b, c, d), as defined in Table B.1 . We denote the marginal counts of t as θ t = (θ X , θ ¯ X , θ Y , θ ¯ Y ), which represent the number of times each variable is observed in each state. These counts capture the maximum likelihood estimators for the marginal probabilities Pr {X = 1} and Pr {Y = 1}. We will often drop the superscript t when it is clear from context. 200 Table B.1: 2 × 2 contingency table on binary variables X and Y . Y = 1 Y = 0 X X = 1 a b θ X = a + b X = 0 c d θ ¯ X = c + d Y θ Y = a + c θ ¯ Y = b + d n Our goal is to test the null hypothesis H = 0 that X and Y are independent. Let T be the random variable representing a table with marginals θ. Fisher [ 66 ] showed that, if X and Y are independent and each is independent and identically distributed (IID), then the probability of observing T = t is given by the hypergeometric distribution: pr(t) Pr T = t|H = 0, θ t = θ X a θ ¯ X c n θ Y . (B.1) From Equation ( B.1 ), a two-tailed marginal p-value can be computed for t using p(t) = Pr pr(T ) ≤ pr(t)|θ T = θ t = t ∈perm(t) Pr {t } · 1 {Pr {t } ≤ Pr {t}} , (B.2) where perm(t) = {t : θ t = θ t }, and 1 {·} is the indicator function that evaluates to one if the constraints are satisfied and zero otherwise. We call Equation ( B.2 ) a marginal p-value because it is conditioned on the marginals θ. Many statistics have a uniform distribution of p-values under the null. That is, if P is a continuous random variable representing a p-value, Pr {P ≤ α|H = 0} = α. It is evident from Equation ( B.1 ), however, that the distribution of FET p-values under the null may be strongly skewed toward one and therefore Pr {P ≤ α|H = 0} ≤ α. For example, the minimum achievable p-value for a given n is 1/ n n/2 , and that can only be achieved when the marginals are evenly distributed (i.e., when θ X = θ ¯ X = n 2 ). Figure B.1 shows the distribution of p-values from our various data sets. 201 0 100 200 300 400 500 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 A Resistance 0 500 1000 1500 2000 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 B Epitope 0 25 50 75 100 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 C Sieve 0 5000 10000 15000 20000 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 D SNPs Figure B.1: The histogram of marginal p-values for the various data sets. p-values are transformed using log 2 p. B.2.2 Positive False Discovery Rates Suppose that m hypothesis tests over 2 × 2 contingency tables t 1 , . . . , t m are si- multaneously tested using Fisher’s exact test, with corresponding marginal p-values p 1 , p 2 , . . . , p m , and we wish to estimate or control the false positive rate in some way. Given the values of H 1 , H 2 , . . . , H m , where H i = 0 if the ith null hypothesis is true and H i = 1 if the null hypothesis is false, and a rejection region Γ, the possible results are summarized in Table B.2 . Note that only m, R and W are observed. Here, we will focus on nested rejection regions defined by the Type I error rate α. That is, given an α ∈ [0, 1], we will reject all tests with P ≤ α. For convenience, we will use α to denote this rejection region. 202 Table B.2: Outcomes when testing m hypotheses. Hypothesis Accept Reject Total Null true U (Γ) V (Γ) m 0 Alternative true T (Γ) S(Γ) m 1 Total W (Γ) R(Γ) m Benjamini and Hochberg [ 15 ] proposed controlling the False Discovery Rate (FDR), which they defined to be F DR(α) E V (α) R(α) . (B.3) When R(α) = 0, this quantity is undefined. In this case, Benjamini and Hochberg defined F DR(α) to be 0, which is equivalent to defining F DR(α) E V (α) R(α) R(α) > 0 · Pr {R(α) > 0} . (B.4) In practice, we are only interested in p-value thresholds that result in the rejection of at least one test in our data. Thus, Storey [ 208 ] proposed the positive false discovery rate (pFDR), which is conditioned on the rejection of at least one test: pF DR(α) = E V (α) R(α) Pr {R(α) > 0} (B.5) = 1 Pr {R(α) > 0} · F DR(α). (B.6) In order to estimate pF DR(α), we need to make some assumptions about the hypothesis tests. One useful set of assumptions proposed by Storey [ 208 ] is that the random variables H i are IID Bernoulli variables with prior probability Pr {H i = 1} = 1 − π 0 and that the p-values are IID with P i |H i ∼ (1 − H i ) · F 0 + H i · F 1 , (B.7) for some null distribution F 0 and some alternative distribution F 1 . Under these as- 203 sumptions, Storey [ 208 ] showed that pF DR(α) = Pr {H = 0|P ≤ α} (B.8) = π 0 · Pr {P ≤ α|H = 0} Pr {P ≤ α} (B.9) = E [V (α)] E [R(α)] . (B.10) For small samples, it may be quite likely that no tests are rejected at threshold α. Therefore, Storey [ 208 ] proposed the following estimator for pFDR: pF DR(α) ˆ π 0 · Pr{P ≤ α|H = 0} Pr{P ≤ α} · Pr{R(α) > 0} . (B.11) Equation ( B.11 ) provides estimators for each term of Equation ( B.9 ), with an added correction term that estimates Pr {R(α) > 0}. (Note that we did not employ this term in the previous chapters, as we were concerned with large FDR estimates over a large number of samples, meaning Pr {R(α) > 0} was close to 1.) For p-values that are not biased towards 0 under the null distribution, Storey [ 208 ] showed that pF DR(α) W (λ) · α (1 − λ)R(α) (1 − (1 − α) m ) , (B.12) for some well-chosen λ, 0 ≤ λ < 1, is conservative both asymptotically and in ex- pectation for finite samples using the intuition that Pr {P ≤ α|H = 0} ≤ α and π 0 ≈ W (λ) (1−λ)m . Storey [ 208 , 211 ] and others (e.g., [ 126 ]) have proposed methods for choosing λ. In practice, Storey suggested estimating pF DR(α) for each observed p-value p i and proved that, under the aforementioned independence assumptions, the estimates are simultaneously conservative for all p i [ 210 ]. When the tests are continuous and uni- formly distributed, Pr {P ≤ p i |H = 0} = p i , making Storey’s estimator quite tight in practice. When the tests are discrete, however, Pr {P ≤ p i |H = 0} can be sig- nificantly less than α. For example, in the case of Fisher’s exact test, each ob- served p-value p i is a marginal p-value dependent on the marginals θ i . Although 204 Pr {P ≤ p i |H = 0, θ i } = p i , for some test j = i, Pr {P ≤ p i |H = 0, θ j } ≤ p i . Thus, the overall probability Pr {P ≤ P i |H = 0} ≤ p i , making Storey’s estimate overly con- servative. Although the marginal p-values do not provide good estimators of the overall probability Pr {P ≤ α|H = 0}, precluding a simple procedure of computing pFDR for each p i using Storey’s approach, the null distribution can be estimated by ran- domization tests, as has been discussed in the microarray literature (for review, see [ 36 ]). Nevertheless, Pounds and Cheng [ 183 ] point out that, even assuming Pr {P ≤ α|H = 0} = α, estimation methods for π 0 that rely on continuous assump- tions are not applicable to discrete data. Instead, they propose defining ˆ π 0 2¯ p, (B.13) where ¯ p is the arithmetic mean of the observed marginal p-values. Although Equa- tion ( B.13 ) also assumes a uniform p-value distribution under the null hypothesis, this conservative estimator tends to yield tighter estimates than other methods when the true π 0 is sufficiently small [ 183 ]. B.3 Computing pFDR for Fisher’s exact test Fisher’s exact test provides an efficient means of computing Pr {P ≤ α|H = 0, θ} ex- actly. Although the resulting marginal p-values should not be applied directly to pFDR estimation, the computation can be efficiently leveraged to provide a tight estimate for each of the terms in Equation ( B.11 ). In this section, we define each of these estimates and show that each estimate is unbiased or conservative (in this context, an estimator is conservative if it is expected to overestimate terms in the numerator of Equation ( B.11 ) or underestimate terms in the denominator of Equa- tion ( B.11 )). We evaluate the asymptotic convergence properties of Equation ( B.11 ), showing that it results in a conservative FDR estimate and characterize conditions under which the estimates will be closer to the true values. In addition, we discuss 205 ways in which the exact test can be further leveraged to increase power by filtering irrelevant tests. In what follows, we begin with Storey’s [ 209 ] assumptions. Specifically, we assume that p-values are IID and follow the mixture model Equation ( B.7 ) and the H i are IID Bernoulli random variables. Because we are using Fisher’s exact test, we also need to consider the marginals θ, which we assume are IID Although we we make no assumptions about the specific distribution of θ, we assume that the p-values depend on the marginals according to the mixture model P |H, θ ∼ (1 − H) · F 0 (θ) + H · F 1 (θ), (B.14) where F 0 (θ) follows the hypergeometric distribution as defined in Equation ( B.2 ) and F 1 (θ) is the generating function of the alternative model. We further assume that θ is independent of H, though we will later relax this assumption. B.3.1 Estimating Pr {P ≤ α|H = 0} from pooled p-values To estimate Pr {P ≤ α|H = 0}, we use a pooled p-value estimate, which is the average probability that each test has P ≤ α: Pr{P ≤ α|H = 0} 1 m m i=1 Pr {P ≤ α|H = 0, θ i } . (B.15) Because FET computes Pr {P ≤ α|H = 0, θ i } by considering every permutation of the data that is consistent with θ i , it can be seen that Equation ( B.15 ) is the result of computing every possible permutation of the data. Furthermore, given that H and θ are independent, it follows that the pooled p-values are unbiased estimates of Pr {P ≤ α|H = 0} (see Lemma 1 in section B.6 ). Finally, because Pr {P ≤ α|H = 0, θ} α for many marginals observed in prac- tice, Equation ( B.15 ) can provide a substantial gain in statistical power over the estimate Pr{P ≤ α|H = 0} α. Figure B.2 shows the advantage of using pooled p-values over α. 206 0 0.2 0.4 0.6 0 0.5 A Resistance 0 0.05 0.1 0.15 0 0.5 B Epitope 0 0.2 0.4 0.6 0.8 0 0.5 C Sieve 0 0.2 0.4 0.6 0.8 0 0.5 D SNPs Figure B.2: Pooled vs. marginal p-values. The dashed line shows the α estimator. The distance between the dashed line and the solid line is the gain from using the pooled p-values. B.3.2 Estimating Pr {P ≤ α} and Pr {R(α) > 0} Given that R(α) is the number of observed tests with p ≤ α, it follows that E R(α) m = Pr {P ≤ α} . (B.16) We therefore follow Storey [ 208 ] in defining Pr{P ≤ α} R(α) m . (B.17) Because Equation ( B.11 ) would be undefined for R = 0, Storey [ 208 ] defines Pr{P ≤ α} R(α) ∨ 1 m , (B.18) 207 where R(α) ∨ 1 = max(R(α), 1), though in practice we are typically only interested in values for α that were observed in our data set. To determine pF DR(α), we also require the estimate Pr{R(α) > 0}. It follows from our independence assumptions that Pr {R(α) > 0} = 1 − (1 − Pr {P ≤ α}) m (B.19) ≤ 1 − (1 − Pr {P ≤ α|H = 0}) m , (B.20) making Pr{R(α) > 0} 1 − 1 − Pr{P ≤ α|H = 0} m (B.21) a conservative estimate of Pr {R(α) > 0}. B.3.3 Estimating π 0 The final step in the pF DR computation is to estimate π 0 . In this section, we use a general framework for conservatively estimating π 0 [ 42 ] and show how existing methods fit within this framework. Given the mixture model Equation ( B.14 ), we can write Pr {P = p} = π 0 · Pr {P = p|H = 0} + π 1 · Pr {P = p|H = 1} , (B.22) where π 1 = Pr {H = 1} = 1 − π 0 [ 42 , 78 , 126 ]. Thus, it follows that Pr {P = p} ≥ π 0 · Pr {P = p|H = 0} (B.23) and π 0 ≤ Pr {P = p} Pr {P = p|H = 0} . (B.24) Moreover, for any non-negative function ρ(·) we can also write π 0 ≤ p ρ(p) Pr {P = p} p ρ(p) Pr {P = p|H = 0} . (B.25) 208 Therefore, estimating π 0 using ˆ π 0 m i=1 p ρ(p) · 1 {p i = p} m i=1 p ρ(p) · Pr {P = p|H = 0, θ i } , (B.26) where ρ(·) is any non-negative function, is always a conservative estimate in expec- tation (see Lemma 2 ). Furthermore, in the limit, Equation ( B.26 ) asymptotically converges to π 0 + π 1 · E [ ρ(p)| H = 1] E [ ρ(p)| H = 0] , which implies that the ρ(·) function minimizing E[ ρ(p)|H=1] E[ ρ(p)|H=0] will yield the least biased estimator (see Lemma 3 ). Equation ( B.26 ) gives us great flexibility in computing π 0 estimates. One such estimation method is given by Storey: [ 208 , 209 ] ˆ π 0 (λ) = #{p i > λ} (1 − λ)m (B.27) for some tuning parameter 0 ≤ λ < 1. For uniformly distributed statistics, (1 − λ)m = E [ #{π > λ}| H = 0] . (B.28) For contingency tables, we can estimate Pr {P > λ|H = 0} using Pr{P ≤ α|H = 0}, which results in an unbiased estimate for E [ #{π > λ}| H = 0]. Therefore, Equa- tion ( B.27 ) is a special case of Equation ( B.26 ) in which ρ(p) = 0 if p ≤ λ, 1 otherwise. (B.29) As λ → 0, we have increasingly conservative bias, with ˆ π 0 = 1 when λ = 0, whereas the variance of the ˆ π 0 estimate increases as λ → 1 due to the decreasing number of observations. Indeed, different heuristic approaches have been proposed to balance the bias-variance tradeoff inherent in picking λ. Equation ( B.26 ) suggests an orthogonal heuristic that may be useful in estimating ˆ π 0 : choosing a weighting function ρ(·) 209 such that more weight is applied to tests with high p-value. A natural choice for a weighting function is ρ(p) = Pr {P ≤ p|H = 0, θ i } = p i , which is equivalent to ˆ π 0 E [P ] E [P |H = 0] . (B.30) Under this weighting function, tests with low p-values will still contribute to the π 0 estimate, but not as much as tests with high p-values. In principle, we could define ρ(·) such that we are effectively summing p-values over the range λ < p ≤ 1; in practice, however, we have found the ˆ π 0 estimate to be quite stable over a wide range of λ, and so simply set λ = 0. Pounds and Cheng [ 183 ] suggested the estimator ˆ π 0 2¯ p for discrete statistics, where ¯ p is the average observed marginal p-value. Estimator ( B.30 ) is similar to their estimate, except that for contingency tables we can compute E [P |H = 0] exactly, rather than assuming E [P |H = 0] = 0.5. Table B.3 compares the true π 0 for synthetic data against the π 0 estimators of Equation ( B.26 ) and of Storey [ 209 ] (evaluated for λ = 0.5 1 ) and Pounds and Cheng [ 183 ] computed using marginal p-values. Equation ( B.26 ) provides a tight and con- servative estimate, substantially increasing power of methods that assume a uniform p-value distribution. Similarly, accounting for the exact p-value distribution results in a substantially lower π 0 estimate on each of the real data sets ( Table B.4 ). B.3.4 Convergence properties We have presented estimates for each component of pF DR, showing how each is either unbiased or conservative. It follows that pF DR is asymptotically conservative as m → ∞. We can, however, be more precise in estimating the convergence properties of our pF DR estimate. Specifically 1 We used the available R code from http://genomics.princeton.edu/storeylab/qvalue/index.html. Results are truncated at 1. λ = 0.5 was chosen because the spline-fitting method [ 211 ] always results in estimates of π 0 = 1 for these discrete data sets. 210 Table B.3: Comparing ˆ π 0 estimations for synthetic data sets derived from the Epitope data with different true π 0 . Storey’s method was evaluated at λ = 0.5. π 0 E[P ] E[P |H=0] 2¯ p Storey 0.001 0.086 0.12 0.11 0.05 0.13 0.18 0.17 0.1 0.18 0.25 0.24 0.2 0.27 0.38 0.37 0.3 0.36 0.52 0.51 0.4 0.45 0.64 0.63 0.5 0.58 0.69 0.68 0.6 0.66 0.79 0.79 0.7 0.75 0.91 0.90 0.8 0.83 1.01 1 0.9 0.92 1.13 1 1 1.003 1.24 1 Theorem 1. Given m tests, in which the P-values are IID and distributed according to the mixture Equation ( B.14 ), the H are IID Bernoulli random variables, and for each test i, θ i is independent of H i , and given a non-negative function ρ(·): lim m→∞ pF DR(α) a.s. = π 0 + π 1 E[ ρ(p)|H=1] E[ ρ(p)|H=0] π 0 · pF DR(α) (B.31) Proof. Provided in section B.6 . This convergence theorem shows that, for large samples, our pFDR estimate is conservative and becomes tightest when ˆ π 0 is computed using a ρ(·) function that is expected to be much lower for true alternative hypotheses than for true null hypothe- ses. 211 Table B.4: Comparing ˆ π 0 estimations for the real data sets. data set E[P ] E[P |H=0] 2¯ p Storey Sieve 0.63 0.82 0.69 SNPs 0.34 0.44 0.41 Epitope 0.99 1.84 1 Resistance 0.97 1.38 1 B.3.5 Dependent marginals Until now, we have assumed the following mixture model: P |H, θ ∼ (1 − H) · F 0 (θ) + H · F 1 (θ), (B.32) where θ is independent of H. It is conceivable, however, that true alternative tests will tend to have more balanced marginals that permit lower p-values. For example, in the Resistance data, it is possible that positions in which no mutations confer drug resistance may be more conserved due to purifying selection, an evolutionary process that results in less observed variation. Under these conditions, we might expect the pFDR estimate to become more conservative because a substantial proportion of the balanced marginals included in our pooled-p-value estimate belong to true alternative tests causing us to overestimate Pr {P ≤ α|H = 0}. This conservative bias is analogous to that observed in the microarray community—a bias that arises from permutation testing over alternative data that has a higher variance than the null data [ 106 , 230 ]. In this section, we formalize the problem for discrete statistics and show that our pFDR estimate becomes asymptotically more conservative as the tendency increases for true alternative tests to have more evenly distributed marginals than null tests. Note that these results hold only for π 0 estimators in which ρ(·) is non- decreasing, a slightly more restrictive definition than we used in the previous sections. In principle, the conservative bias could be reduced using heuristic measures similar 212 to those proposed in the microarray community [ 106 , 230 ]; however, these heuristics are not guaranteed to result in a conservative pFDR estimate, so we will not explore them further here. Let us consider a special case in which Pr {θ|H = 1} = Pr {θ|H = 0}. Specifically, we shall consider the case where the marginals of true alternative hypotheses tend to have larger n and/or are more balanced, meaning that the marginals of the true alternative hypotheses tend to permit smaller p-values than the marginals of the true null hypotheses. In this case, θ Pr {P ≤ α|H = 0, θ} · Pr {θ|H = 0} assumed ≤ θ Pr {P ≤ α|H = 0, θ} · Pr {θ|H = 1} . (B.33) Under these assumptions, we can derive the following large sample theorem: Theorem 2. Given m tests that follow the assumptions of Theorem 1 , a non-decreasing function ρ(·), and Assumption ( B.33 ), then lim m→∞ pF DR(α) ≥ π 0 + π 1 E[ ρ(p)|H=1] E[ ρ(p)|H=0] π 0 · pF DR(α). (B.34) Hence, under the above stated assumptions, and provided ρ(p) is non-decreasing in p, our pFDR and FDR estimates will be asymptotically more conservative than the case where the marginals are independent of H. In practice, it is not known whether H and θ are independent. Consequently, we recommend using the more restricted class of ρ(·) functions allowed by Theorem 2 . B.3.6 Filtering irrelevant tests The discreteness of the data provides a unique opportunity to increase power further. For any discrete test, there exists an α such that Pr {P ≤ α} = 0. It turns out that including such tests in our pFDR estimate will typically increase the conservative bias. Thus, power can often be improved by first filtering out all all tests that couldn’t 213 possibly achieve the significance threshold α. As we shall see, such filtering will still result in an asymptotically conservative estimate and, in some cases, will provably increase power. Because the range of the data is finite, computation of the most significant achiev- able p-value given the marginals is possible. We can define the minimum achievable p-value for fixed marginals as p ∗ (θ) min t∈perm(θ) p(t). (B.35) When computing pF DR(α), we can ignore (filter) all tests i such that p ∗ (θ i ) > α. (B.36) For a set of contingency tables T, we can now write T = T + α ∪ T − α (B.37) for the disjoint sets T + α = {t i : t i ∈ T ∧ p ∗ (θ i ) > α}, and (B.38) T − α = {t i : t i ∈ T ∧ p ∗ (θ i ) ≤ α}. (B.39) Filtering on p ∗ (θ) > α can be seen as estimating pF DR over T − α . Let pF DR ∗ (α) and pF DR ∗ (α) be the true and estimated pFDR, respectively, for T − α . Then the following theorem holds. Theorem 3. Given m tests that follow the assumptions of Theorem 1 , pF DR ∗ (α) = pF DR(α). (B.40) Proof. Recall that, for independent, identically distributed tests, pF DR(α) = E [V (α)] E [R(α)] . (B.41) 214 Because E [ R(α)| Pr {p(T ) ≤ α} = 0] = 0, (B.42) it follows that pF DR(α) will be the same for T and T − α . Therefore, pF DR ∗ (α) = pF DR(α). Therefore, filtering on p ∗ (θ i ) > α) does not change the true pF DR. Furthermore, under the assumptions of Theorem 2 , we have pF DR ∗ (α) a.s. ≤ lim m→∞ pF DR ∗ (α), (B.43) Consequently, we can compute pF DR ∗ (α) instead of pF DR(α). Moreover, if true alternative tests are monotonically more concentrated at lower p-values; that is, if Pr {P ≤ α|H = 1} Pr {P ≤ α|H = 0} (B.44) is non-increasing in α, then lim m→∞ pF DR ∗ (α) a.s. ≤ lim m→∞ pF DR(α), (B.45) meaning that filtering is asymptotically guaranteed to provide a tighter estimate and therefore increase power (see Lemma 4 in Appendix for details). Although this condition is often not met for discrete data, it is often approximately met and provides a good rationale for filtering. In addition, because both pF DR(α) and pF DR ∗ (α) are asymptotically greater than pF DR(α), for large samples we can compute both the filtered and unfiltered estimates and choose whichever yields the lower value. The only estimate that filtering changes is ˆ π 0 . Let ˆ π 0 (α) denote the estimated π 0 over T − α (α), and ˆ π 0 (1) denote the estimated π 0 over T. It may be that ˆ π 0 (α) ≤ ˆ π 0 (1), and ˆ π 0 (α) is not a conservative estimate of the true π 0 of the original set of tests T. ˆ π 0 (α) is, however, a conservative estimate of π 0 among the filtered set of tests T − α (α). That is, ˆ π 0 (α) is a conservative estimate of the proportion of tests that are truly null among those tests that could achieve significance level α. In addition to |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling