Phylogenetic dependency networks: Inferring patterns of adaptation in hiv
Download 4.8 Kb. Pdf ko'rish
|
P re c is ion Recall Conditional Undirected Joint FET A PR curves for synthetic coevolution data. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 -P rec ision q Conditional Undirected Joint FET FET Parametric Bootstrap B Calibration on synthetic coevolution data. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P re c is ion Recall Conditional Undirected Joint FET C PR curves for synthetic conditional data. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 -P re c is ion q Conditional Undirected Joint FET D Calibration on synthetic conditional data. Figure 5.1: PR and calibration curves on synthetic data. 62 5.2.1 Sensitivity to tree structure Our approach raises the question of how sensitive the results are to the structure of the tree used by the models. To address this question, we ran the conditional model on the synthetic conditional evolution data using four different trees: the tree used to generate the data (T gen ), a tree with the same structure as T gen but with the leaf-to-patient assignments randomized (T rand ), and two trees reconstructed from the synthetic amino acids using either a generalized maximum likelihood method (T M L , the method we use throughout this work) or a na¨ıve parsimony method (T pars ). As expected, the conditional model performed best using T gen , though the discrim- ination curves were not significantly different from those of T M L and T pars , indicating that the conditional model is robust to variations of the tree on this data set ( Fig- ure 5.2A ). Importantly, although the discrimination curve was significantly worse using T rand rather than T gen (p = 0.016), the conditional model was calibrated on all four trees, indicating that, on this data set, poor trees resulted in a loss of power but not in an inflation of false discovery rates ( Figure 5.2B ). This point is reinforced by the number of associations identified at q < 0.20 for the different methods: T gen yielded eighty nine predictions whereas T rand yielded only sixty five, T M L yielded seventy eight, and T pars yielded eighty two. The positive predictive value for these predictions ranged from 0.80 to 0.85. Although it may seem counter-intuitive that the randomized tree could find any associations, we note that the problem with the conditional model using T rand is analogous to that of an IID model. Namely, whereas an IID model will over or under count observations by not accounting for hierarchical structure that exists in the data, a randomized tree will over or under count observations by assuming false hierarchical structure. In addition, the conditional model can compensate for a tree that fits the target data poorly by setting the mutation rate to infinity and thereby assuming the data to be IID. Indeed, the median λ value under T rand was an order of magnitude 63 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P re c is ion Recall Generating Tree PhyML Tree Parsimony Tree Randomized Generating Tree A PR curves. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 -P re c is ion q Generating Tree PhyML Tree Parsimony Tree Randomized Generating Tree B Calibration curves. Figure 5.2: PR and calibration curves on conditional evolution data when run using different trees. higher than that under T gen . 5.3 Application 1: Effect of immune pressure on HIV evolution To investigate the effects of immune pressure on HIV evolution, Moore et al. [ 157 ] obtained HIV sequences from 234 individuals along with the HLA-A and HLA-B alleles of the infected individuals. Performing several analyses, all of which assumed the data to be IID, they found strong correlations between the presence or absence of amino acids at particular positions and the presence or absence of particular HLA alleles in the infected patients, presumably reflecting the “escape” of amino acids under immune pressure. In Bhattacharya et al. [ 17 ], we analyzed a similar data set (N=96, HLA-I and HLA-II alleles) and showed that use of the conditional model substantially improved the accuracy of identification of such HLA-codon associations. Here, we analyzed a superset (N=205) of the data used in [ 17 ] (HLA-I alleles only) using both the conditional and undirected joint model. First, we constructed a phylogenetic tree from the full set of sequences ( subsec- 64 tion 5.1.4 ). We then used the single-variable model to determine whether any HLA alleles followed the tree. We found that two pairs of HLA-1 alleles—B*4201, Cw*1701, and A*0207, B*4601, where each pair is in tight linkage disequilibrium—followed the tree and thus separated the HLA data into two sets: (1) “C1701” consisting of these four alleles and (2) “notC1701” consisting of the remaining alleles, and analyzed these two sets separately. Our results using BIC show that the conditional model better explains the not- C1701 data (p = 0.0001, N = 256296), whereas the undirected joint model better explains the C1701 data (p = 1.9 × 10 −24 , N = 5664). In the case of the C1701 data, it seems that the phylogenetic tree is more a confounder of the data in the traditional sense, wherein the tree is associated with both the HLA and the sequences and induces false correlations between HLA and sequence. In this application, we were fortunate that additional information was available to help confirm the HLA-sequence associations that we found. In particular, a known epitope in the vicinity of a found association supports the validity of that association, as immune pressure is focused on epitopes and the immediate surrounding regions that participate in the presentation of the those epitopes on the HLA molecules at the cell surface [ 119 ]. Thus, we constructed discrimination curves where an HLA- sequence association was considered “true” if it is within three amino acids of a known epitope with the corresponding HLA and “false” otherwise. This bronze standard does not take into account undiscovered epitopes or linkage disequilibrium, but should nonetheless be unbiased with respect to a comparison of the alternative methods for identifying associations. The discrimination curve in Figure 5.3 for the notC1701 data is consistent with the BIC and synthetic results, indicating that the conditional model best fits this data. We could not construct a discrimination curve for the C1701 data, as there are no known A*0207, B*4201, B*4601, or Cw*1701 epitopes in Gag. The associations found by the conditional model with q < 0.2 on the real data are shown in Table 5.1 . 65 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.02 0.04 0.06 0.08 0.1 P re c is ion Recall Conditional Undirected Joint FET Figure 5.3: PR curves for the real the full HLA-amino-acid data. Ground truth was estimated by identifying known epitopes within three residues of the predicted association. This definition is known to miss a large number of real epitopes. 5.4 Application 2: Pairwise correlations between amino acids in HIV Identification of pairwise correlations between amino acids is important to many areas of biology, as correlations can indicate structural or functional interaction [ 27 , 67 ]. Many methods, including the undirected joint model [ 178 ], have been developed to identify correlated residues. Continuing our focus on HIV, we applied both the undirected joint and the con- ditional model to the sequence data from the Western Australia cohort [ 157 ]. We concentrated on the HIV-1 p6 protein, which is cleaved from the Gag 55 polyprotein. This fifty two amino-acid protein was chosen because it is the shortest HIV protein, making pair-wise amino acid tests feasible for all models. We fit the conditional model in both directions (making both X and Y target variables), and selected the best model according to BIC. Remarkably, the BIC scores of the conditional model are significantly higher than those of the joint model (p < 10 −100 , N = 52767). We suspect that the conditional 66 Table 5.1: Predicted HLA-amino acid associations in Gag. Pos HLA p q 242 B5701 4.3E-08 0 28 A0301 1.5E-07 0 242 B5801 3.2E-06 0.03 147 C0602 5.0E-06 0.03 26 C0303 6.9E-06 0.05 482 B4001 2.8E-05 0.1 397 A3101 3.8E-05 0.13 495 B4701 6.9E-05 0.17 model may be better because many mutations could be compensating for other mu- tations driven by HLA immune pressure. The conditional model finds that 893 of the 52767 (1.7%) amino acid pairs, and 310 of 1300 (24%) of position pairs, are correlated at q < 0.2. This dense network of interactions is consistent with the idea that many of the mutations are compensatory in nature. For example, the conditional model identifies two HLA-mediated escape mutations in p6 ( Table 5.1 ). Mutations at these two positions account for forty two (13.5%) of the position-pair correlations. We have developed a tool for visualizing the network of dependencies ( Figure 5.4 ). The visualization highlights at least one potentially interesting set of interactions. In particular, R16 is strongly correlated with residues at positions 21 through 36, many of which are correlated with each other as well as with residues throughout the protein. This complex network of interactions connects the two α-helix domains of p6 [ 68 ] and may be of structural or functional significance. 67 Figure 5.4: Correlated amino-acid pairs in HIV-1 p6. The fifty two consensus amino acids of P6 are drawn as a circle, with the N-terminal end shown at the far right and the protein extending counter-clockwise. Each arc represents an association predicted by the conditional model and is significant at q < 0.2. Arc color reflects the q-value of the association. Dark gray consensus residues denote positions where there are fewer than three sequences with a non-consensus residue. 68 5.5 Application 3: Genomic search for genotype-phenotype associations in Arabidopsis thaliana Aranzana et al. recently demonstrated the potential utility of genome wide association studies (GWAS), as well as the importance of accounting for hierarchical population structure [ 8 ]. In this study, the authors genotyped 848 loci in ninety six Arabidopsis thaliana strains and looked for haplotypes that were correlated with hypersensitive response to P. syringae strains expressing one of three avirulence (avr) genes (avr- Rpm1, avrRpt2, or avrPph3). In plants, each avr bacterial protein is recognized by a corresponding resistance (R) gene. If both plant and pathogen have active copies of the respective avr-R genes, a biochemical cascade is triggered at the point of infec- tion, leading to massive programmed cell death and containment of the infection (for review, see [ 43 ]). Using both an IID-based model and a method that used the hierar- chical population structure that was constructed from the sequenced loci, the authors showed that loci adjacent to the known R genes are highly correlated with the corre- sponding avr phenotypes. Unfortunately, the authors noted that their statistics were poorly calibrated, precluding confident predictions of the other pathogen-response proteins that are involved in the hypersensitive response cascade. Here, we apply our well-calibrated methods to the same data, using a genetic-similarity tree constructed from the sequence data. Although Arabidopsis is a sexually reproducing species, it is highly selfing, meaning that organisms primarily mate with themselves. As a result, the population structure induced is hierarchical and bears striking resemblance to a phylogenetic tree. Aran- zana et al. found that a tree built from pairwise similarity matrices on shared alleles provided a good qualitative description of both the geographic distribution of the organisms and the distribution of avr and flowering time alleles [ 8 ]. Quantitatively, we found that sixty one percent of the haplotypes and two of the three phenotypes followed the “phylogenetic” tree constructed from the sequence data. 69 When applying our conditional model to this application, it is not clear whether the target variables should be haplotypes or phenotypes. In general, genetic variations directly influence phenotypes, but phenotypes also indirectly influence haplotypes through selection pressure. As two thirds of both variables followed the tree, we ran the conditional model in both directions, once using the phenotypes as the target and once using them as the predictor, using BIC to determine which direction was best for any given haplotype-phenotype pair. We found that the BIC scores for the conditional and undirected joint models were not significantly different (p = 0.70, N = 14043). Consequently, we arbitrarily choose to examine the results of the conditional model in detail. Figure 5.5 shows the genome wide distribution of conditional evolution q-values for each of the three phenotypes. For each phenotype, the most significant association is a locus near the corresponding R gene. We constructed this figure to be similar to the one in Aranzana et al. to facilitate comparison. Our synthetic tests indicate that the conditional method is well calibrated, im- plying that roughly 80% of the associations we find with a q < 0.2 cutoff should be legitimate. To explore this implication, we took the fifty one genotypic associ- ations (comprising forty unique loci) that correlate with these three hypersensitive phenotypes at this cutoff, and noted which of the associated loci were near known or putative bacterial response proteins according to http://www.arabidopsis.org . Our standard of “true positive” was defined to be proximity to a protein whose de- scription included the phrase “disease response” (see subsection 5.1.7 ). We found that twenty three (45%) of the predictions were within fifty kilobases of such proteins . This bronze standard undoubtedly contains false positives and false negatives, and therefore cannot be used to confirm that our methods our calibrated. Nonetheless, we easily can reject the null hypothesis that these twenty three associations are found near disease-response proteins by chance (p < 0.0001). 70 Figure 5.5: Genomic distribution of genotype-phenotype association scores for Ara- bidopsis bacterial response. 4681 haplotypes were compared against each of the three bacterial response phenotypes, Rpm1 (Top), Rpt2 (Middle) and Pph3 (Bottom). For each haplotype, the four conditional models were run and negative log 10 of the most significant q-value is plotted. For each phenotype, the most significant association is a locus within 10 Kb of the corresponding R gene. The dotted line shows the q = 0.20 threshold. 71 5.6 Studies using the univariate conditional evolution model The univariate conditional model has proven useful in predicting HLA-mediated CTL escape mutations in HIV. After presenting the model in the context of demonstrating the confounding effect of phylogeny in this domain [ 17 ], as well as a detailed exami- nation of the model [ 31 ] (from which the above sections were taken), several studies have used the conditional model in the context of HLA-mediated escape in HIV. Here we briefly review these studies. The first paper to use the model in detail was Brumme et al. [ 24 ], who applied the approach to the Protease, RT, Vpr and Nef HIV proteins. There the authors used a cohort of about 700 chronically clade B infected individuals in the largest study of its kind. The associations predicted by the conditional model suggested both a broad effect of HIV imprinting on HIV clade B protein diversity, with Nef exhibiting the broadest effect. Importantly, the fact that associations were identifiable extends previous case studies that suggest CTL escape is broadly consistent and therefore predictable (see chapter 3 ). With regards to method validation, 35% of associations at 20% FDR mapped to published epitopes against the same HLA allele. An additional 50% of the associations mapped to predicted epitopes, of which 20– 40% (depending on protein) were broadly confirmed using independent interferon-γ (IFN-γ) ELISpot data. Furthermore, the utility of phylogenetic correction, even on a relatively homogeneous (single clade) cohort was demonstrated by the higher proportion of conditional evolution associations that mapped to known epitopes than did associations computed via Fisher’s exact test. Finally, Brumme et al. found a weak but significant negative correlation between the number of predicted resistant substitutions and CD4 counts, an observation confirmed independently in each of the studied proteins. In a similar follow up study, Brumme et al. [ 25 ] studied HLA-mediated escape in Gag, a protein suspected to be particularly influential in control of HIV (see chap- 72 ter 3 ). Here, the number of potential escape sites (as determined by HLA profile and the conditional model associations) was found to be negatively correlated with pVL, suggesting that the potential to broadly target Gag is correlated with relative control of infection. Furthermore, the proportion of escaped sites was positively correlated with pVL, suggesting that viral load increased as escape mutations were selected. Although statistically significant, these trends explained only a small fraction of the pVL variance, suggesting that much of the complexity is not captured by these simple studies. At 20% FDR, 46% of associations mapped within or near published epitopes, while an additional 12% mapped within or near putative epitopes supported by ei- ther IFN-γ ELISpot or epitope prediction data (rules for epitope prediction were made stricter here than in [ 24 ] due to further studies on the false discovery rates of the prediction algorithm). Rousseau et al. [ 196 ] looked at associations across the entire clade C genome on a similar-size cohort, using a combination of the conditional evolution model and models developed by Los Alamos National Labs (LANL) [ 17 ] and in the Mullins lab. This study compared the associations found in their clade C cohort to those found by Brumme et al.[ 24 ] and found both common and divergent escape associations, suggesting that future studies more directly assess clade similarities and differences in this domain. In agreement with work by Goulder and others ( chapter 3 ), Rousseau et al. found that, genome wide, HLA B and C alleles were more likely to drive HIV diversity than were A alleles, suggesting a more active role for HLA-B and C in HIV control. Remarkably, they found significant difference in the ratio of predicted susceptible to resistant residues in patients without the HLA on a per protein bases, suggesting that escape mutations are more costly to viral fitness (and thus under stronger reversion pressure) in some proteins than in others. In agreement with previous work that found that targeting Gag correlates with lower pVL, Gag had the second highest susceptible to resistant ratio, behind only Vpr, an often ignored accessory protein. Interestingly, Rousseau et al. also analyzed 9-mers (as opposed to 73 single codons). Although in general the power of 9-mers was greatly reduced due to the large number of observed 9-mers, they did find some escape patterns that involved two substitutions that were not detected when only a single codon was considered, suggesting the presence of alternative escape pathways that dilute the signal when only single codons are considered. Matthews et al. [ 153 ] took the correlations with pVL one step further. Using a superset of the cohort in Rousseau et al., they focused on Gag, Pol and Nef, identified new epitopes and looked at the correlation of different types of associations with pVL. Specifically, using ELISpot data, they first confirmed previous reports that most epitopes are targeted by HLA-B alleles, and that targeting of B-restricted epitopes in Gag was correlated with lower pVL. They next examined the associations derived using the univariate conditional model. Remarkably, at 5% FDR, they found that 92% of HLA-B associations were within described epitopes. Furthermore, the number of associations in Gag per HLA-B allele was strongly correlated with median pVL for patients with that allele (r=-0.57, p=0.0034). Comparing these results to the ELISpot results suggests that measuring escape is a good surrogate for measuring the number of epitopes actually targeted by an allele. Furthermore, Matthews et al. went on to show that most of the correlation came from reversion associations, suggesting that targeting epitopes for which escape elicits a fitness cost (and hence pressure to revert in patients without the allele) has the most effect on viremia control. It is natural to assume that it would be beneficial for the immune system to target conserved regions of HIV, as these regions are presumably conserved because fitness constraints limit the amount of variation that can be tolerated [ 194 ]. To directly test this hypothesis, Wang et al. [ 225 ] analyzed escape associations in a whole-viral genome study of 98 patients and compared the conservation of escaping sites an protective and hazardous HLA alleles. Strikingly, while they found no correlation between con- servation and relative hazard of HLA alleles when looking at all associations, they found a strong correlation (R = −57, p = 0.028) among associations in epitopes that 74 are immunodominantly targeted during acute infection. This result suggests that tar- geting conserved epitopes early in infection may lead to relative control of the virus and improved prognosis. These studies have collectively added to the vast literature that was previously based largely on case studies and small cohorts showing that patterns of immune escape are generally consistent and that control of viremia that is correlated to HLA allele and protein is determined largely by the specific epitopes that allele presents and CTL targets as well as the escape mutations that are available to the virus. From the principles of selection pressure and adaptation, we can assume that, for a given epitope, there are essentially three viral load functions of interest and that they can be ordered as: pVL(susceptible, CTL) < pVL(resistant) ≤ pVL(susceptible, noCTL) In cases where pVL(resistant) ≈ pVL(susceptible, noCTL), targeting the epitope will have only a transient effect that disappears after escape, and the resistant form of common HLA alleles may come to dominate the population (see chapter 3 ). However, when pVL(resistant) pVL(susceptible, noCTL), viremia control may continue even after escape, and reversion upon transmission to HLA-mismatched recipients is ex- pected to be rapid. Thus, studies correlating pVL to various types of escape may suggest which epitopes can most effectively be targeted, and thus, which epitopes are promising targets for a T-cell-based vaccine. Moving forward, identifying those spe- cific escape variants that do not completely restore fitness will be of vital importance. The most direct assessment of pVL(susceptible, noCTL) − pVL(resistant) recently came from Goepfert et al. [ 81 ], who took the conditional model associations from the clade C Gag sequences and applied them to 117 matched donor-recipient pairs, with the goal of determining whether transmission of resistant variants from the donor correlated with reduced pVL in the HLA-mismatched recipient six months after infection, thus directly measuring the cumulative fitness cost of escape mutations. 75 Similar to the previous studies, the authors found that only Gag HLA-B associations correlated with reduced pVL, suggesting that B-allele control of viremia is largely due to the fitness costs of the selected escape substitutions. Finally, to analyze the rate of escape in acute infection and to quantify the amount of HIV evolution attributable to HLA-mediated selection pressure, Brumme et al. [ 23 ] took longitudinal samples from 100 acutely infected patients sequenced repeatedly from within six months of infection through two years post infection and compared the observed substitutions to the HLA associations from [ 24 ]. They showed that rates of escape correlated with known patterns of immunodominance and that over 35% of evolution in acutely infected patients is directly attributable to HLA-mediated selection pressure. 5.7 Limitations of univariate conditional evolution model Despite the demonstrated effectiveness of the univariate conditional model for CTL- mediated HIV escape, one key limitation of the univariate model has become apparent from these studies: the network of correlations is such that pairwise tests find too many spurious associations. The most obvious source of confounding comes from the linkage disequilibrium (LD) structure among the HLA alleles. Because the A, B and C alleles lie in the same region on the chromosome, they tend to be inherited together. For example, in the Brumme et al. study [ 24 ], 272 HLA allele pairs are in LD at 20% FDR. Thus, for example, if B14 leads to escape at position i, then with high probability, we will find an association between C08 and position i, as all B14 patients also have C08, and half of C08 patients have B14. In the studies by Brumme and Rousseau [ 24 , 25 , 196 ], HLA LD was accounted for using an ad hoc post processing step. Briefly, for every codon that had the exact same association (same amino acid and direction of correlation) with two HLA alleles H 1 and H 2 that were in LD, if one of the HLA alleles was known to bind an epitope near A, that allele was kept and other discarded; otherwise, the allele with the lower q value was kept. 76 Although this approach generally works, it requires defining an arbitrary definition of LD and will miss associations where both alleles interact with the same epitope. Thus, in Matthews at al. we implemented a Decision Tree approach that effectively automated the ad hoc procedure. Here, for a given amino acid at a given codon, we run the univariate model as described here. We then identify the HLA with the strongest association (by p-value), remove all patients who have that HLA, then repeat the procedure on the remaining patients. This procedure is iterated until the most significant association has p > 0.05. Thus, the resulting associations are independent of each other, as they measure effects in the absence of previously defined (more significant) associations. The tradeoff is loss of power, as on each iteration fewer patients are considered. Furthermore, interactions between HLA alleles are ignored. As observed in all three studies, many codons experience “push-pull” effects between two or more alleles, where each allele selects for a different amino acid. Although measuring associations in the absence of certain alleles is useful, it may be beneficial to model the combined effects. Similar to HLA LD, covariation among HIV codons can be expected to be a source of confounding. As seen in Application 2 above, there appears to be a dense network of covariation among HIV codons, at least in some proteins. This covariation network will lead to spurious associations. For example, if HIV codons A and B are correlated, and HLA H leads to escape in A, then we have the causal model H → A → B. If these associations are strong enough, the univariate conditional model will find the associations H → A (correctly) and H → B (incorrectly). As the density of codon covariation increases, so will the number of spurious associations. In the context of HLA-mediated escape, this may be tolerable, in the sense that such indirect associations are still “HLA-associated” and may have experimental relevance. Nevertheless, determining which associations lead to epitope escape and which are (for example) compensatory may have direct implications on vaccine design. Furthermore, when codon-codon associations are the primary goal, as in the case of the amino acid 77 covariation problem discussed in chapter 2 , indirect associations will tend to lead to false conclusions. In addition, given the density of HLA-associated polymorphisms [ 24 , 25 , 196 ], it is clear that these associations will play a confounding role in contexts where patterns of covariation are sought. It is therefore desirable to build a conditional model of evolution that simultaneously accounts for all known sources of selection pressure. 5.8 Conclusions We have described two evolutionary processes that can confound association analy- ses and have defined two corresponding generative models for discrete data that can correct for, and even leverage, the existence of these processes. We have found that explicitly modeling evolutionary processes increases discriminatory power and results in well-calibrated estimates of one minus positive predictive value. We have imple- mented methods for fitting these models to data and a tool for visualizing the results of the analysis. These tools are available on the internet. Neither the undirected joint nor the conditional model outperformed the other on all real data sets, suggesting that both models should be considered when analyzing new data. Nonetheless, the conditional model better fit most of the real data that we analyzed. The conditional model better described the effects of immune pressure on HIV evolution, and perhaps more surprising, it better described the correlation between HIV-1 p6 amino-acid pairs. This observation may be due to the rapid evolu- tion of HIV and positive selection pressure from the immune response in conjunction with compensatory mutations in the observed patients. Importantly, the conditional model performed well even in synthetic cases gener- ated by the undirected joint model, suggesting that the conditional model may be a good approximation to the undirected joint model. This will be an important con- sideration in scaling up to multiple interactions, as the number of parameters in a generalized n variable joint evolution model will be n + 2 n − 1, whereas the num- 78 ber of parameters in comparable conditional models will be 2 + n. Developing such conditional models is the main future work of this dissertation proposal. 79 Chapter 6 EVALUATION OF MULTIVARIATE MODELS We now turn to an evaluation of the two multivariate models, described in sec- tion 4.2 , using synthetic data as an analysis tool. Although we go to great lengths to ensure that the synthetic data is a reasonable reflection of real data, it is never surprising when the model that generates the data is the model that performs best on that data. Nevertheless, as in the previous chapter, we can use synthetic data to compare the expressiveness of two models. In this case, we show that the Noisy Add model is superior to the Decision Tree model, because even when the data are known to follow the assumption of the Decision Tree, Noisy Add still performs as well as Decision Tree (the reverse is not true). Furthermore, we have already argued that Noisy Add is conceptually more appropriate than other models. But even as- suming the conceptual superiority of the model, it remains to be seen whether there is any practical advantage to the model. As we will see, there is indeed a profound advantage. 6.1 Technical Details 6.1.1 Computing q-values The asymptotic conservative guarantee of Equation ( 4.1 ) (see section 4.3 ) requires a conservative estimate of Equation ( 4.2 ), which requires a valid (or stochastically conservative) p-value. In order to achieve a valid p-value, all model assumptions must be reasonably met. In particular, all sources of confounding must be accounted for. In principle, our multivariate models can account for these sources, provided the input phylogeny is reasonable and all other sources of confounding are provided as predictor |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling