Phylogenetic dependency networks: Inferring patterns of adaptation in hiv
Download 4.8 Kb. Pdf ko'rish
|
- Bu sahifa navigatsiya:
- Document Outline
pi(a)
α Epitope Resistance SNPs Sieve Figure B.3: The advantage of using π 0 (α) computed using the filtering technique over not filtering. Because filtering only affects ˆ π 0 , these gains result in proportionally reduced (yet conservative) pFDR estimates. providing increased power, ˆ π 0 (α) may provide valuable information in cases were a large proportion of tests could not achieve α. In such cases, the overall π 0 may be quite high, but the π 0 among tests that could achieve α (those that we are interested in) may be much lower. Figure B.3 demonstrates the advantage of using ˆ π 0 (α) over ˆ π 0 (1). B.4 Numerical results To explore the applicability of our proposed pFDR estimator, we created a number of Epitope-derived synthetic data sets with different number of tables that follow the mixture model assumptions above, allowing for an unequal distribution of marginals as defined in Assumption ( B.33 ) (see the Appendix for details). For each of these data sets, we plotted the estimated pF DR(α) against the true proportion of false discoveries using p < α as the threshold ( Figure B.4 ). In practice, it is often the case that pF DR(α) > pF DR(β) for some β > α. 216 0.0001 0.001 0.01 0.1 0.0001 0.001 0.01 0.1 70k 10k 35k Figure B.4: Estimated pFDR vs. true false discovery proportion for synthetic data with increasing number of tables generated from the Epitope data set. Estimates above the dashed line are conservative. Therefore, there is no reason to choose α as the rejection region, because choosing β will result in more rejected tests and a lower proportion of false positives among those rejected tests. For this reason, Storey [ 208 ] proposed the q-value, defined to be q(α) min β≥α pF DR(β). (B.46) To demonstrate the power gains of our method in practice, we conclude by com- paring the number of significant results for each of our example data sets as a function of the q-value threshold ( Figure B.5 ). As can be seen, our conservative estimates re- sult in a substantial increase in the number of tests called significant at a variety of thresholds. 217 0 0.01 0.02 0.03 0.04 0.05 0 0.1 0.2 0.3 0.4 0.5 A Resistance 0 0.005 0.01 0.015 0.02 0 0.1 0.2 0.3 0.4 0.5 B Epitope 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 C Sieve 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 D SNPs Figure B.5: Plotting the portion of rejected cases vs. q-values for the real data sets. The solid line is the proposed method for discrete data and the dotted line is the S&T method using marginal p-values. B.5 Creating synthetic data sets In real data sets, the properties of the data that we estimate through our pFDR computation, such as π 0 or the true pFDR are of course unknown. In such cases, synthetic data sets that allow manipulation of these properties can provide insights as to the usefulness of various estimators. It is important, however, to create synthetic data sets that are as close as possible to the real data. In this section we explain the procedure that was used to create the synthetic data sets. To simulate the real data, we used only marginals that were observed. Given such marginals we first decide whether the synthetic table that we create will be null or alternative. For example, if we are interested in fixing π 0 , we can thus ensure that π 0 of the tables that we create are nulls. 218 B.5.1 Creating null and alternative tables from given marginals To create a null table given a set of marginals θ = {θ X , θ ¯ X , θ Y , θ ¯ Y }, we simulate n tests where each test has a result {X, Y } such that X is independent of Y . For each such test we select the result X ∈ {1, 0} following Pr {X = 1|H 0 } = θ X n . We select Y ∈ {1, 0} following Pr {Y = 1|H 0 } = θ Y n . To create an alternative table, in which X and Y are not independent, we simulate tests by first selecting X using the same procedure as above, and select Y |X, using the following distribution: Pr {Y = 1|H 1 , X = 1} = a θ X and Pr {Y = 1|H 1 , X = 0} = c θ ¯ X . B.5.2 Selecting marginals We have created two different types of data sets, one where all the marginals come from the same distribution, and one where the marginals distribution depends on whether the table is null or alternative. In the case of a single distribution of marginals, we divided the observed marginals into 10 exponential bins [1, 1/2], (1/2, 1/4], (1/4, 1/8], . . . and place each marginal θ into a bin according to min{θ X , θ ¯ X , θ Y , θ ¯ Y }/max{θ X , θ ¯ X , θ Y , θ ¯ Y }. We then choose a bin uniformly, and select a set of marginals uniformly from the bin. We then designate the selected marginal as null with probability π 0 and generate the table accordingly. This approach biases us towards choosing marginals that permit lower p-values, which enables us to generate interesting alternative tables, even when we force π 0 to be much lower than it is in the real data. When the distribution of marginals depends on the whether the table is null or alternative, we draw the θ from bin b ∈ 1, . . . , 10 with probability 1/2 10−b−1 for a null table and with probability 1/2 b for an alternative table. 219 B.6 Proofs and Remarks In this appendix, we formalize the theoretical results from the main paper. For brevity, we will write H 0 to mean the event H = 0 and H 1 to mean the event H = 1. Lemma 1. Given m tests, in which the P-values are IID and distributed according to the mixture Equation ( B.14 ), the H are IID Bernoulli random variables, and for each test i, θ i is independent of H i , E 1 m m i=1 Pr {P ≤ α|H 0 , θ i } = Pr {P ≤ α|H 0 } . (B.47) Proof of Lemma 1 . Because θ is independent of H, we can write Pr {P ≤ α|H 0 } = θ Pr {P ≤ α|H 0 , θ} · Pr {θ|H 0 } (B.48) = θ Pr {P ≤ α|H 0 , θ} · Pr {θ} (B.49) where the summation is over all possible marginals. Furthermore, E 1 m m j=1 1 {θ j = θ} = Pr {θ} . (B.50) Thus, Pr {P ≤ α|H 0 } = θ Pr {P ≤ α|H 0 θ} E 1 m m j=1 1 {θ j = θ} (B.51) = E 1 m m j=1 θ Pr {P ≤ α|H 0 , θ} · 1 {θ j = θ} (B.52) = E 1 m m i=1 Pr {P ≤ α|H 0 , θ i } . Lemma 2. Let ρ(·) be any non-negative function. Then, under the assumptions of Lemma 1 , E m i=1 p ρ(p) · 1 {p i = p} m i=1 p ρ(p) · Pr {P = p|H 0 , θ i } ≥ π 0 . (B.53) 220 Proof of Lemma 2 . Recall that Pr {P = p} = π 0 · Pr {P = p|H 0 } + π 1 · Pr {P = p|H 1 } , (B.54) where π 1 = Pr {H 1 } = 1 − π 0 . Thus, it follows that Pr {P = p} ≥ π 0 · Pr {P = p|H 0 } (B.55) and p ρ(p) Pr {P = p} ≥ p ρ(p) Pr {P = p|H 0 } · π 0 (B.56) for any non-negative function ρ(·). Thus, it follows that π 0 ≤ p ρ(p) Pr {P = p} p ρ(p) Pr {P = p|H 0 } . (B.57) It follows analogously to the proof of Lemma 1 that Pr {P = p|H 0 } = 1 m E m i=1 Pr {P = p|H 0 , θ i } (B.58) and Pr {P = p} = 1 m E m i=1 1 {p i = p} . (B.59) Thus, it follows that p ρ(p) Pr{P =p} p ρ(p) Pr{P =p|H 0 } = p ρ(p) 1 m E [ m i=1 1 {p i = p}] p ρ(p) 1 m E [ m i=1 Pr {P = p|H 0 , θ i }] (B.60) = E m i=1 p ρ(p) 1 {p i = p} E m i=1 p ρ(p) Pr {P = p|H 0 , θ i } . (B.61) Because p ρ(p) Pr {P = p} is a linearly increasing function of p ρ(p) Pr {P = p|H 0 }, it follows from Jensen’s inequality that E m i=1 p ρ(p) 1 {p i = p} E m i=1 p ρ(p) Pr {P = p|H 0 , θ i } ≤ E m i=1 p ρ(p) 1 {p i = p} m i=1 p ρ(p) Pr {P = p|H 0 , θ i } (B.62) Thus, E [ˆ π 0 ] ≥ π 0 . 221 Remark 1. Storey [ 208 , 209 ] argued that, for continuous statistics, we would expect most of the observations with p close to 1 to be true null, and thus a natural estimate for π 0 is ˆ π 0 (λ) = #{p i > λ} (1 − λ)m (B.63) for some tuning parameter 0 ≤ λ < 1. This procedure assumes a continuous under- lying distribution, such that (1 − λ) = Pr {p i > λ|H i = 0} for all i. It can be shown that Equation ( B.27 ) is a special case of Equation ( B.26 ) in which ρ(p) = 0 if p ≤ λ, 1 otherwise. (B.64) Proof. ˆ π 0 = p m i=1 ρ(p) 1 {p i = p} p m i=1 ρ(p) Pr {P = p|H 0 , θ i } (B.65) = m i=1 p>λ 1 {p i = p} m i=1 p>λ Pr {P = p|H 0 , θ i } (B.66) = #{p i > λ} m i=1 Pr {P > p|H 0 , θ i } (B.67) = #{p i > λ} m · Pr{P > p|H 0 } . (B.68) For discrete statistics, Pr {p i > λ|H 0 } ≥ (1 − λ), (B.69) thus, it follows that ˆ π 0 ≤ #{p i > λ} (1 − λ)m , (B.70) making it a tighter estimate of π 0 than we get when assuming the statistics are continuous. Remark 2. In an argument similar to what lead to Equation ( B.26 ), Pounds and Cheng [ 183 ] pointed out that π 0 ≤ E [P ] E [P |H 0 ] . (B.71) 222 Assuming E [P ] = ¯ p 1 m m i=1 p i (B.72) and E [P |H 0 ] ≥ 1 2 , Pounds and Cheng suggest defining ˆ π 0 2 · ¯ p. (B.73) It turns out that this Pounds-Cheng approach is a special case of Equation Equa- tion ( B.26 ), with a conservative approximation for E [ p| H 0 ]. Proof. Let ρ(p) = p, then ˆ π 0 = m i=1 p p · 1 {p i = p} m i=1 p p · Pr {P = p|H 0 , θ i } (B.74) = 1 m m i=1 p i 1 m m i=1 E [ P | H 0 , θ i ] (B.75) = ¯ p θ E [ P | H 0 , θ] · Pr{θ} (B.76) = ¯ p E [P |H 0 ] (B.77) ≤ ¯ p 0.5 , (B.78) where E [p|H 0 ] is our unbiased estimate of E [P |H 0 ] and we define Pr{θ} as in Equa- tion ( B.50 ). Lemma 3. Under the assumptions of Lemma 2 , lim m→∞ ˆ π 0 a.s. = π 0 + π 1 E [ ρ(p)| H 1 ] E [ ρ(p)| H 0 ] . (B.79) Proof. By the strong law of large numbers, equations ( B.58 ) and Equation ( B.59 ) imply that Pr{P = p|H 0 } converges almost surely to Pr {P = p|H 0 } and Pr{P = p} converges almost surely to Pr {P = p}. Thus, it follows from Equation ( B.60 ) that lim m→∞ ˆ π 0 a.s. = p ρ(p) Pr {P = p} p ρ(p) Pr {P = p|H 0 } . (B.80) 223 Furthermore, it follows from Equation ( B.22 ) that p ρ(p) Pr {P = p} p ρ(p) Pr {P = p|H 0 } = π 0 + π 1 · p ρ(p) Pr {P = p|H 1 } p ρ(p) Pr {P = p|H 0 } . (B.81) Thus, lim m→∞ ˆ π 0 a.s. = π 0 + π 1 · p ρ(p) Pr {P = p|H 1 } p ρ(p) Pr {P = p|H 0 } (B.82) = π 0 + π 1 · E [ ρ(p)| H 1 ] E [ ρ(p)| H 0 ] . Proof of Theorem 1 . lim m→∞ pF DR(α) = lim m→∞ ˆ π 0 · Pr{P ≤ α|H 0 } Pr{P ≤ α} · Pr{R(α) > 0} (B.83) = lim m→∞ ˆ π 0 · lim m→∞ Pr{P ≤ α|H = 0} lim m→∞ Pr{P ≤ α} · lim m→∞ Pr{R(α) > 0} . (B.84) By the strong law of large numbers, Lemma 1 implies that Pr{P ≤ α|H = 0} con- verges almost surely to Pr {P ≤ α|H 0 }, equation ( B.16 ) implies that Pr{P ≤ α} converges almost surely to Pr {P ≤ α}, and Pr{R(α) > 0} converges almost surely to 1. Thus lim m→∞ pF DR(α) a.s. = lim m→∞ ˆ π 0 · Pr {P ≤ α|H 0 } Pr {P ≤ α} (B.85) = lim m→∞ ˆ π 0 π 0 · pF DR(α). (B.86) Finally, it follows from Lemma 3 that lim m→∞ pF DR(α) a.s. = π 0 + π 1 E[ ρ(p)|H 1 ] E[ ρ(p)|H 0 ] π 0 · pF DR(α). (B.87) Proof of Theorem 2 . The proof follows analogously to that of Theorem 1 by noting that the present assumptions lead to lim m→∞ Pr{P ≤ α} a.s. = Pr {P ≤ α} (B.88) lim m→∞ Pr{P ≤ α|H 0 } a.s. ≥ Pr {P ≤ α|H 0 } (B.89) lim m→∞ ˆ π 0 a.s. ≥ π 0 + π 1 · E [ ρ(p)| H 1 ] E [ ρ(p)| H 0 ] . (B.90) 224 We shall prove each of these statements in turn. Equation Equation ( B.88 ) follows immediately by noting that our estimate Pr{P ≤ α} R∨1 m is not affected by the distribution of θ. Equation Equation ( B.89 ) can be seen by noting that we can no longer use equality Equation ( B.49 ) and must instead use Pr {P ≤ α|H 0 } = θ Pr {P ≤ α|H 0 , θ } · Pr {θ |H 0 } . (B.91) Thus, we have lim m→∞ Pr{P ≤ α|H 0 } (B.92) a.s. = θ Pr {P ≤ α|H 0 , θ } · Pr {θ } (B.93) = θ Pr {P ≤ α|H 0 , θ } × · · · · · · × Pr {θ |H 0 } · π 0 + Pr {θ |H 1 } · π 1 (B.94) = π 0 θ Pr {P ≤ α|H 0 , θ } · Pr {θ |H 0 } + · · · · · · + π 1 θ Pr {P ≤ α|H 0 , θ } · Pr {θ |H 1 } (B.95) ≥ π 0 θ Pr {P ≤ α|H 0 , θ } · Pr {θ |H 0 } + · · · · · · + π 1 θ Pr {P ≤ α|H 0 , θ } · Pr {θ |H 0 } (B.96) = Pr {P ≤ α|H 0 } , (B.97) where the inequality follows from Assumption ( B.33 ). Finally, Inequality ( B.90 ) follows from the fact that the added assumptions of The- orem 2 only affect the denominator of our π 0 estimate Equation ( B.26 ). Furthermore, Inequality ( B.89 ) implies lim m→∞ Pr{P ≥ α|H 0 } a.s. ≤ Pr {P ≥ α|H 0 } , (B.98) 225 from which it follows that lim m→∞ p ρ(p) · Pr{P = α|H 0 } a.s. ≤ p ρ(p) · Pr {P = α|H 0 } (B.99) for any non-decreasing function ρ(p). Thus, it follows that lim m→∞ ˆ π 0 a.s. ≥ E [ρ(p)] E [ ρ(p)| H 0 ] (B.100) = π 0 + π 1 · E [ ρ(p)| H 1 ] E [ ρ(p)| H 0 ] . Lemma 4. Under the assumptions of Theorem 3 , if Pr {P ≤ α|H 1 } Pr {P ≤ α|H 0 } (B.101) is non-increasing in α, then lim m→∞ pF DR ∗ (α) a.s. ≤ lim m→∞ pF DR(α). (B.102) Proof. Recall our large sample estimate pF DR(α) = ˆ π 0 · Pr{P ≤ α|H 0 } Pr{P ≤ α} (B.103) = ˆ π 0 · m · 1 m m i=1 Pr {P ≤ α|H 0 , θ i } R(α) ∨ 1 (B.104) = ˆ π 0 · m i=1 Pr {P ≤ α|H 0 , θ i } R(α) ∨ 1 (B.105) Removing n tests with p ∗ (θ) > α will have no effect on (R(α) ∨ 1) or on m i=1 Pr {P ≤ α|H 0 , θ i } . We will show, however, that, under the present assumptions, our π 0 estimate under filtering will almost surely be lower than our π 0 estimate without filtering. Let p + denote the event p ∗ (θ) > α and p − denote the event p ∗ (θ) ≤ α. From equation ( B.81 ), we can write lim m→∞ ˆ π 0 a.s. = π 0 + π 1 · E [ρ(p)|H 1 ] E [ρ(p)|H 0 ] = π 0 + π 1 · E [ρ(p)|H 1 , p + ] Pr {p + } + E [ρ(p)|H 1 , p − ] Pr {p − } E [ρ(p)|H 0 , p + ] Pr {p + } + E [ρ(p)|H 0 , p − ] Pr {p − } (B.106) 226 Let ˆ π 0 (α) E [ρ(p)|p − ] E [ρ(p)|H 0 , p − ] (B.107) be the estimated π 0 over T − α . We wish to show that lim m→∞ ˆ π 0 (α) ≤ lim m→∞ ˆ π 0 (1), (B.108) which, by Equation ( B.106 ) is true if an only if E [ρ(p)|H 1 , p + ] Pr {p + } + E [ρ(p)|H 1 , p − ] Pr {p − } E [ρ(p)|H 0 , p + ] Pr {p + } + E [ρ(p)|H 0 , p − ] Pr {p − } ≥ E [ρ(p)|H 1 , p − ] E [ρ(p)|H 0 , p − ] . (B.109) Thus, it follows that Equation ( B.108 ) is true if and only if E [ρ(p)|H 1 , p + ] E [ρ(p)|H 0 , p + ] ≥ E [ρ(p)|H 1 , p − ] E [ρ(p)|H 0 , p − ] . (B.110) Now Assumption ( B.44 ) implies that Pr {P > α|H 1 , p + } Pr {P > α|H 0 , p + } ≥ Pr {P > α|H 1 , p − } Pr {P > α|H 0 , p − } , (B.111) from which Inequality ( B.110 ), and hence Lemma 4 , follows from the constraint that ρ(·) is non-decreasing. B.7 Discussion The false discovery rate has proven to be an extremely useful tool when testing large numbers of tests, as it allows the researcher to balance the number of significant results with an estimate of the proportion of those results that are truly null. Storey presented novel methods for estimating pFDR and q-values for general test statistics [ 208 , 209 ]. He factored the pFDR computation into several components and suggested estimators for each component. Perhaps the most discussed component is the π 0 —the proportion of tests that are expected to be null over the entire data set. For example, Dalmasso and colleagues [ 42 ] derived a class of π 0 estimators for continuous distributions that take the same form as Equation ( B.26 ) and explored properties of ρ(·). They proved that a certain class of convex ρ(·) functions yielded provably less biased π 0 estimators 227 than ρ(p) = p. Similarly, Genovese and Wasserman [ 78 ] explore several estimators under a mixture model framework that assumes a uniform continuous null distribution and provide estimates of confidence intervals, and Langaas and colleagues [ 126 ] use the mixture model to define pi 0 estimators that perform particularly well under certain continuous convexity assumptions. When the data are finite, however, some of the underlying assumptions used by the above methods, such as the uniform distribution of p values under the null and the convexity and monotone distribution of p values under the alternative, are vi- olated [ 183 ]. In such cases, some of the methods developed for general statistics become overly conservative, and some may provide anti-conservative estimates. For example, the estimators of Dalmasso et al. [ 42 ] assume that the null distribution is non-increasing in p. As we have seen, contingency tables provide a common example where these assumptions are grossly violated, even when the number of observations in each table is quite high. In these cases, the use of marginal p-values leads to severe conservative bias in the FDR estimation. Pounds and Cheng [ 183 ] addressed the conservative bias of FDR estimation on finite data by proposing a new π 0 estimator. This estimator avoids the extreme conservative bias of Storey’s spline-fitting method on finite data, in which π 0 estimates at λ = 1 may have more bias rather than less. On our data sets, the two methods were comparable, the method of Pounds and Cheng was comparable to Storey’s estimator at λ = 0.5. A key assumption in the method of Pounds and Cheng is that the expected p-value under the null hypothesis is 0.5, which was grossly violated in all of our contingency table data sets. Replacing this assumption with the exact null distribution substantially decreased the bias in all our tests. Our theoretical results indicate that optimal ρ(·) is that which minimizes the ratio of the expected ρ(·) under the alternative hypothesis to the expected ρ(·) under the null hypothesis. Other ρ(·) functions than those described here may thus yield less biased estimates. Several authors have proposed randomization testing as a means of dealing with 228 non-uniform or unknown p-values distributions, with a focus on non-uniform contin- uous distributions (see [ 36 ] for review). Focusing on Fisher’s exact test allows us to implement exact permutation tests efficiently even for very large data sets, resulting in exact estimation of the pooled null distribution, a straightforward analysis of the convergence properties, and the removal of numerical error from the estimation. Furthermore, the exact null distribution allows us to identify and remove tests that cannot be called significant, thereby increasing power. This approach was first proposed by Gilbert [ 79 ], who proposed choosing a p-value threshold p 0 and removing a priori all tests for which no permutation of the contingency table results in p ≤ p 0 . To choose p 0 , Gilbert suggested using a derivative of the Bonferroni adjusted p-value. Unfortunately, it can be shown that this threshold is too aggressive and will often remove tests that should be considered significant. In contrast, choosing p 0 = α leaves the true pFDR unchanged while often achieving an increase in statistical power. This paper provides estimators for the various components of the pFDR, based on a permutation testing approach. We combine here several ideas that were previously suggested, adapting them to the important case of contingency tables. As we have shown above, our methods can rapidly provide tight estimates of pFDR and q-values for very large data sets. Although we have chosen to focus on Fisher’s exact test, analogous results can be derived for any discrete test for which all permutations of the data can be efficiently computed. 229 VITA Jonathan Carlson was born and raised in Beaverton, Oregon. In 2003, he gradu- ated from Dartmouth, where he met a beautiful girl named Kate, pole vaulted, and batted cleanup for the Fighting Mullets. Although the Mullets made the intramural championships several times, they had a propensity for choking and never came away with a T-shirt. In 2004, Jonathan married Kate, finally said goodbye to Dartmouth, and moved back west, hoping to find a team that could come through in the clutch. In 2006, he signed with the Infrared Sox in the University of Washington co-rec league and with the Fleas in the men’s league. He went on to win two T-shirts with the IR Sox, setting team records for home runs and slugging percentage, and one T-shirt with the Fleas. In 2009 he graduated with his Ph.D. in computer science and engi- neering from the University of Washington. He currently resides in Marina del Rey, California, is a researcher for the eScience group of Microsoft Research, and is a free agent. Document Outline
Download 4.8 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling