Phylogenetic dependency networks: Inferring patterns of adaptation in hiv

bet	11/12
Sana	23.11.2017
Hajmi	4,8 Kb.
	#20686

1 ... 4 5 6 7 8 9 10 11 12

actions of the Royal Society of London B Biological Sciences 255(1342):37–45.
10
,
11
,
13
,
16
,
89

176
[172] Palella FJ, Delaney KM, Moorman AC, Loveless MO, Fuhrer J, Satten GA,
Aschman DJ, Holmberg SD and Investigators THOS (1998). Declining mor-
bidity and mortality among patients with advanced human immunodeﬁciency
virus infection. New England Journal of Medicine 338(13):853–860.
1
[173] Pazos F, Helmer-Citterich M, Ausiello G and Valencia A (1997). Correlated
mutations contain information about protein-protein interaction. Journal of
Molecular Biology 271(4):511–523.
10
[174] Peters HO, Mendoza MG, Capina RE, Luo M, Mao X, Gubbins M, Nagelkerke
NJD, MacArthur I, Sheardown BB, Kimani J, Wachihi C, S. T and Plummer FA
(2008). An integrative bioinformatic approach for studying escape mutations in
human immunodeﬁciency virus type 1 Gag in the pumwani sex worker cohort.
Journal of Virology 82(4):1980–1992.
12
,
114
,
116
[175] Peyerl FW, Barouch DH, Yeh WW, Bazick HS, Kunstman J, Kunstman KJ,
Wolinsky SM and Letvin NL (2003). Simian-human immunodeﬁciency virus
escape from cytotoxic T-lymphocyte recognition at a structurally constrained
epitope. Journal of Virology 77(23):12572.
104
[176] Peyerl FW, Bazick HS, Newberg MH, Barouch DH, Sodroski J and Letvin NL
(2004). Fitness costs limit viral escape from cytotoxic T lymphocytes at a
structurally constrained epitope. Journal of Virology 78(24):13901.
104
[177] Pollock DD and Taylor WR (1997). Eﬀectiveness of correlation analysis in iden-
tifying protein residues undergoing correlated evolution. Protein Engineering
10(6):647–657.
10
[178] Pollock DD, Taylor WR and Goldman N (1999). Coevolving protein residues:
maximum likelihood identiﬁcation and relationship to structure. Journal of
Molecular Biology 287(1):187–198.
11
,
13
,
16
,
17
,
51
,
52
,
65
,
89

177
[179] Poon A and Chao L (2005). The rate of compensatory mutation in the DNA
bacteriophage φX174. Genetics 170(3):989–999.
106
,
108
,
114
[180] Poon A, Davis BH and Chao L (2005). The coupon collector and the suppressor
mutation: estimating the number of compensatory mutations by maximum
likelihood. Genetics 170(3):1323–1332.
[181] Poon AFY, Lewis FI, Pond SLK and Frost SDW (2007). Evolutionary in-
teractions between N-linked glycosylation sites in the HIV-1 envelope. PLoS
Computational Biology 3(1):e11.
11
,
13
,
15
[182] Poon AFY, Lewis FI, Pond SLK and Frost SDW (2007). An evolutionary-
network model reveals stratiﬁed interactions in the V3 loop of the HIV-1 Enve-
lope. PLoS Computational Biology 3(11):e231.
11
,
15
,
16
,
89
,
117
,
129
[183] Pounds S and Cheng C (2006). Robust estimation of the false discovery rate.
Bioinformatics 22(16):1979.
204
,
209
,
221
,
227
[184] Press W, Teukolsky S, Vetterling W and Flannery B (1992). Numerical Recipes
in C. Cambridge University Press, New York, NY.
52
,
53
[185] Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA and Reich D
(2006). Principal components analysis corrects for stratiﬁcation in genome-wide
association studies. Nature Genetics 38(8):904–909.
13
[186] Pritchard JK, Stephens M, Rosenberg NA and Donnelly P (2000). Association
mapping in structured populations. The American Journal of Human Genetics
67(1):170–181.
13
[187] Pritchard L, Bladon P, M O Mitchell J and J Dufton M (2001). Evaluation
of a novel method for the identiﬁcation of coevolving protein residues. Protein
Engineering 14(8):549–555.
89

178
[188] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller
J, Sklar P, de Bakker PI, Daly MJ and Sham PC (2007). PLINK: a tool set for
whole-genome association and population-based linkage analyses. The American
Journal of Human Genetics 81(3):559–575.
[189] Reiner AP, Ziv E, Lind DL, Nievergelt CM, Schork NJ, Cummings SR, Phong
A, Burchard EG, Harris TB, Psaty BM and Kwok PY (2005). Population
structure, admixture, and aging-related phenotypes in African American adults:
the Cardiovascular Health Study. The American Journal of Human Genetics
76(3):463–477.
[190] Ridley M (1983). The Explanation of Organic Diversity: The Comparative
Method and Adaptations for Mating. Oxford University Press, Oxford.
10
,
13
,
15
,
16
,
30
,
89
[191] Risch NJ (2000). Searching for genetic determinants in the new millennium.
Nature 405(6788):847–856.
[192] Rodrigue N, Lartillot N, Bryant D and Philippe H (2005). Site interdepen-
dence attributed to tertiary structure in amino acid sequence evolution. Gene
347(2):207–217.
[193] Rodriguez F, Harkins S, Slifka MK and Whitton JL (2002). Immunodominance
in virus-induced CD8
+
T-cell responses is dramatically modiﬁed by DNA immu-
nization and is regulated by gamma interferon. Journal of Virology 76(9):4251.
[194] Rolland M, Nickle DC and Mullins JI (2007). HIV-1 group M conserved elements
vaccine. PLoS Pathogens 3(11):e157.
73
[195] Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA
and Feldman MW (2002). Genetic structure of human populations. Science
298(5602):2381–2385.
13

179
[196] Rousseau CM, Daniels MG, Carlson JM, Kadie C, Crawford H, Prendergast
A, Matthews P, Payne R, Rolland M, Raugi DN, Maust BS, Learn GH, Nickle
DC, Coovadia H, Ndung’u T, Frahm N, Brander C, Walker BD, Goulder PJR,
Bhattacharya T, Heckerman DE, Korber BT and Mullins JI (2008). HLA class
I-driven evolution of human immunodeﬁciency virus type 1 subtype C proteome:
immune escape and viral load. Journal of Virology 82(13):6434–6446.
26
,
27
,
72
,
75
,
77
,
84
,
86
,
87
,
90
,
95
,
98
,
99
,
110
,
113
,
114
,
115
,
116
,
198
[197] Sanchez-Merino V, Farrow M, Brewster F, Somasundaran M and Luzuriaga K
(2008). Identiﬁcation and characterization of HIV-1 CD8
+
T cell escape variants
with impaired ﬁtness. The Journal of Infectious Diseases 197(2):300–8.
[198] Satten GA, Flanders WD and Yang Q (2001). Accounting for unmeasured
population substructure in case-control studies of genetic association using a
novel latent-class model. The American Journal of Human Genetics 68(2):466–
477.
13
[199] Schmitz JE, Kuroda MJ, Santra S, Sasseville VG, Simon MA, Lifton MA, Racz
P, Tenner-Racz K, Dalesandro M, Scallon BJ, Ghrayeb J, Forman MA, Mon-
teﬁori DC, Rieber EP, Letvin NL and Reimann KA (1999). Control of viremia
in simian immunodeﬁciency virus infection by CD8
+
lymphocytes.
Science
283(5403):857–860.
19
[200] Schneidewind A, Brockman MA, Sidney J, Wang YE, Chen H, Suscovich TJ,
Li B, Adam RI, Allgaier RL, Mothe BR, Kuntzen T, Oniangue-Ndza C, Trocha
A, Yu XG, Brander C, Sette A, Walker BD and Allen TM (2008). Structural
and functional constraints limit options for cytotoxic T-lymphocyte escape in
the immunodominant HLA-B27-restricted epitope in human immunodeﬁciency
virus type 1 capsid. Journal of Virology 82(11):5594–5605.
104
,
105
,
110
,
114
[201] Schneidewind A, Brockman MA, Yang R, Adam RI, Li B, Le Gall S, Rinaldo

180
CR, Craggs SL, Allgaier RL, Power KA, Kuntzen T, Tung CS, LaBute MX,
Mueller SM, Harrer T, McMichael AJ, Goulder PJR, Aiken C, Brander C, Kelle-
her AD and Allen TM (2007). Escape from the dominant HLA-B27-restricted
cytotoxic T-lymphocyte response in Gag is associated with a dramatic reduc-
tion in human immunodeﬁciency virus type 1 replication. Journal of Virology
81(22):12382–12393.
20
,
21
,
29
,
104
,
105
,
110
,
113
,
114
[202] Schneidewind A, Brumme ZL, Brumme CJ, Power KA, Reyor LL, O’Sullivan
K, Gladden A, Hempel U, Kuntzen T, Wang YE, Oniangue-Ndza C, Jessen H,
Markowitz M, Rosenberg ES, Sekaly RP, Kelleher AD, Walker BD and Allen
TM (2008). Transmission and long-term stability of compensated CD8 escape
mutations. Journal of Virology in press. doi:10.1128/JVI.01108-08.
[203] Self SG and Liang KY (1987). Asymptotic properties of maximum likelihood
estimators and likelihood ratio tests under nonstandard conditions. Journal of
the American Statistical Association 82(398):605–610.
[204] Setakis E, Stirnadel H and Balding DJ (2006). Logistic regression protects
against population structure in genetic association studies. Genome Research
16(2):290–296.
13
[205] Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan
H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL and Mullins JI (1999).
Consistent viral evolutionary changes associated with the progression of human
immunodeﬁciency virus type 1 infection. Journal of Virology 73(12):10489–
10502.
18
[206] Sinha H, Nicholson BP, Steinmetz LM and McCusker JH (2006). Complex
genetic interactions in a quantitative trait locus. PLoS Genetics 2(2):e13.

181
[207] Stewart JJ, Watts P and Litwin S (2001). An algorithm for mapping positively
selected members of quasispecies-type viruses. BMC Bioinformatics 2(1):1.
12
[208] Storey JD (2002). A direct approach to false discovery rates. Journal of the
Royal Statistical Society - Series B: Statistical Methodology 64(3):479–498.
196
,
202
,
203
,
206
,
208
,
216
,
221
,
226
[209] Storey JD (2003). The positive false discovery rate: a Bayesian interpretation
and the q-value. Annals of Statistics 31(6):2013–2035.
196
,
205
,
208
,
209
,
221
,
226
[210] Storey JD, Taylor JE and Siegmund D (2004). Strong control, conservative
point estimation and simultaneous conservative consistency of false discovery
rates: a uniﬁed approach. Journal of the Royal Statistical Society - Series B:
Statistical Methodology 66(1):187–205.
203
[211] Storey JD and Tibshirani R (2003). Statistical signiﬁcance for genomewide
studies. Proceedings of the National Academy of Sciences of the United States
of America 100(16):9440–9445.
42
,
196
,
203
,
209
[212] Streeck H, Lichterfeld M, Alter G, Meier A, Teigen N, Yassine-Diab B, Sidhu
HK, Little S, Kelleher A, Routy JP, Rosenberg ES, Sekaly RP, Walker BD
and Altfeld M (2007).
Recognition of a deﬁned region within p24 Gag by
CD8+ T cells during primary human immunodeﬁciency virus type 1 infection
in individuals expressing protective HLA class I alleles. Journal of Virology
81(14):7725–7731.
[213] Suel GM, Lockless SW, Wall MA and Ranganathan R (2003). Evolutionarily
conserved networks of residues mediate allosteric communication in proteins.
Nature Structural and Molecular Biology 10(1):59–69.
11

182
[214] Tang C, Ndassa Y and Summers MF (2002). Structure of the N-terminal 283-
residue fragment of the immature HIV-1 Gag polyprotein. Nature Structural
and Molecular Biology 9(7):537–543.
105
,
106
,
107
[215] Taylor WR and Hatrick K (1994). Compensating changes in protein multiple
sequence alignments. Protein Engineering 7(3):341–348.
10
[216] Thomas DC, Haile RW and Duggan D (2005).
Recent developments in
genomewide association scans: a workshop summary and review. The American
Journal of Human Genetics 77(3):337–345.
[217] Thornsberry JM, Goodman MM, Doebley J, Kresovich S, Nielsen D and Buckler
ES (2001). Dwarf8 polymorphisms associate with variation in ﬂowering time.
Nature Genetics 28(3):286–289.
13
[218] Tillier ERM and Lui TWH (2003). Using multiple interdependency to separate
functional from phylogenetic correlations in protein alignments. Bioinformatics
19(6):750–755.
[219] Turnbull EL, Lopes AR, Jones NA, Cornforth D, Newton P, Aldam D, Pellegrino
P, Turner J, Williams I, Wilson CM, Goepfert PA, Maini MK and Borrow
P (2006). HIV-1 epitope-speciﬁc CD8+ T cell responses strongly associated
with delayed disease progression cross-recognize epitope variants eﬃciently. The
Journal of Immunology 176(10):6130–6146.
27
[220] Ueno T, Motozono C, Dohki S, Mwimanzi P, Rauch S, Fackler OT, Oka S
and Takiguchi M (2008). CTL-mediated selective pressure inﬂuences dynamic
evolution and pathogenic functions of HIV-1 Nef. The Journal of Immunology
180(2):1107.
[221] UNAIDS/WHO (2007). AIDS epidemic update.
http://www.unaids.org
.
1

183
[222] Vogel TU, Horton H, Fuller DH, Carter DK, Vielhuber K, O’Connor DH, Ship-
ley T, Fuller J, Sutter G, Erﬂe V, Wilson N, Picker LJ and Watkins DI (2002).
Diﬀerences between T cell epitopes recognized after immunization and after
infection. The Journal of Immunology 169(8):4511–4521.
[223] Voight BF and Pritchard JK (2005). Confounding from cryptic relatedness in
case-control association studies. PLoS Genetics 1(3):e32.
13
[224] Wang Y and DeLisi C (2006). Inferring protein-protein interactions in viral
proteins by co-evolution of conserved side chains. Genome 17(1):23–35.
116
[225] Wang YE, Li B, Carlson JM, Streeck H, Gladden AD, Goodman R, Schnei-
dewind A, Power KA, Toth I, Frahm N, Alter G, Brander C, Carrington M,
Walker BD, Altfeld M, Heckerman D and Allen TM (2009). Protective HLA
class I alleles that restrict acute-phase CD8
+
T-cell responses are associated
with viral escape mutations located in highly conserved regions of human im-
munodeﬁciency virus type 1. Journal of Virology 83(4):1845–1855.
73
[226] Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF,
Salazar MG, Kilby JM, Saag MS, Komarova NL, Nowak MA, Hahn BH, Kwong
PD and Shaw GM (2003). Antibody neutralization and escape by HIV-1. Nature
422(6929):307–312.
18
[227] Williams SG and Lovell SC (2009).
The eﬀect of sequence evolution on
protein structural divergence.
Molecular Biology and Evolution in press.
doi:10.1093/molbev/msp020.
[228] Wollenberg KR and Atchley WR (2000). Separation of phylogenetic and func-
tional associations in biological sequences by using the parametric bootstrap.
Proceedings of the National Academy of Sciences of the United States of America
97(7):3288–3291.
10
,
11

184
[229] Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, Bunce M,
Muyembe JJ, Kabongo JMM, Kalengayi RM, Van Marck E, Gilbert MTP and
Wolinsky SM (2008). Direct evidence of extensive diversity of HIV-1 in Kinshasa
by 1960. Nature 455(7213):661–664.
18
,
148
[230] Xie Y, Pan W and Khodursky AB (2005). A note on using permutation-based
false discovery rate estimates to compare diﬀerent analysis methods for microar-
ray data. Bioinformatics 21(23):4280–4288.
196
,
211
,
212
[231] Yang Z, Nielsen R, Goldman N and Pedersen AMK (2000). Codon-substitution
models for heterogeneous selection pressure at amino acid sites.
Genetics
155(1):431–449.
12
[232] Yanofsky C, Horn V and Thorpe D (1964). Protein structure relationships
revealed by mutational analysis. Science 146(3651):1593–1594.
114
[233] Yeang CH and Haussler D (2007). Detecting coevolution in and among protein
domains. PLoS Computational Biology 3(11):e211.
11
[234] Yeh WW, Cale EM, Jaru-Ampornpan P, Lord CI, Peyerl FW and Letvin NL
(2006). Compensatory substitutions restore normal core assembly in simian im-
munodeﬁciency virus isolates with Gag epitope cytotoxic T-lymphocyte escape
mutations. Journal of Virology 80(16):8168–8177.
104
[235] Yewdell J (2005). The seven dirty little secrets of major histocompatibility
complex class I antigen processing. Immunological Reviews 207(1):8–18.
[236] Yewdell JW (2006). Confronting complexity: real-world immunodominance in
antiviral CD8
+
T cell responses. Immunity 25(4):533–543.
26
[237] Yewdell JW and Bennink JR (1999). Immunodominance in major histocom-
patibility complex class I-restricted T lymphocyte responses. Annual Review of
Immunology 17(1):51–88.
26
,
108

185
[238] Yewdell JW and Haeryfar SM (2005). Understanding presentation of viral anti-
gens to CD8
+
T cells in vivo: the key to rational vaccine design. Annual Review
of Immunology 23(1):651–682.
[239] Yokomaku Y, Miura H, Tomiyama H, Kawana-Tachikawa A, Takiguchi M, Ko-
jima A, Nagai Y, Iwamoto A, Matsuda Z and Ariyoshi K (2004). Impaired
processing and presentation of cytotoxic-T-lymphocyte (CTL) epitopes are ma-
jor escape mechanisms from CTL immune pressure in human immunodeﬁciency
virus type 1 infection. Journal of Virology 78(3):1324.
20
[240] Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen
MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S and Buckler ES (2006). A
uniﬁed mixed-model method for association mapping that accounts for multiple
levels of relatedness. Nature Genetics 38(2):203–208.
13
[241] Yu XG, Lichterfeld M, Chetty S, Williams KL, Mui SK, Miura T, Frahm N,
Feeney ME, Tang Y, Pereyra F, LaBute MX, Pfaﬀerott K, Leslie A, Crawford
H, Allgaier R, Hildebrand W, Kaslow R, Brander C, Allen TM, Rosenberg ES,
Kiepiela P, Vajpayee M, Goepfert PA, Altfeld M, Goulder PJR and Walker
BD (2007).
Mutually Exclusive T-Cell Receptor Induction and Diﬀerential
Susceptibility to Human Immunodeﬁciency Virus Type 1 Mutational Escape
Associated with a Two-Amino-Acid Diﬀerence between HLA Class I Subtypes.
Journal of Virology 81(4):1619–1631.
14
,
110
[242] Zhang J and Rosenberg HF (2002). Complementary advantageous substitutions
in the evolution of an antiviral RNase of higher primates. Proceedings of the
National Academy of Sciences of the United States of America 99(8):5486–5491.
114
[243] Zu˜
niga R, Lucchetti A, Galvan P, Sanchez S, Sanchez C, Hernandez A, Sanchez
H, Frahm N, Linde CH, Hewitt HS, Hildebrand W, Altfeld M, Allen TM, Walker

186
BD, Korber BT, Leitner T, Sanchez J and Brander C (2006). Relative dom-
inance of Gag p24-speciﬁc cytotoxic T lymphocytes is associated with human
immunodeﬁciency virus control. Journal of Virology 80(6):3122–3125.
115

187
Appendix A
NEXT GENERATION SEQUENCING:
EXTENDING THE MODEL TO
SINGLE GENOME SEQUENCES
In this appendix, we brieﬂy outline the likelihood calculation and EM steps for
the Single Genome Sequencing (SGS) case, which was introduced and motivated in
subsection 8.1.5
. These calculations follow directly from known methods (see, e.g.,
[
93
], but we provide a derivation here to clarify the model. A thorough exploration
of the behavior of the model on real and synthetic data is left for future work.
A.1
Likelihood calculation
The calculation of likelihood on a tree is a well studied problem, ﬁrst made popular
in the phylogenetics community by Felsenstein [
60
]. Although we heavily use this
approach in all of our models, we have thus far omitted a thorough discussion of
likelihood calculation as it is generally familiar to the community. We will ﬁnd it
useful here, however, to brieﬂy review the likelihood calculation.
A.1.1
Notation
For a given phylogeny Ψ, observed target variables D = Y
1
, . . . , Y
N
, observed predictor
variables X = X
1
, . . . , X
n
, and model parameters θ, the likelihood of the model with
respect to θ is written
L
Ψ
(D|θ, X) = Pr {D|θ, Ψ, X} .
(A.1)
As the tree is held constant throughout, we will drop Ψ for the rest of this discussion.
The goal is ﬁrst compute L(D|θ), then to ﬁnd θ that maximize L. First, let us

188
introduce some notation.
Let {V
k
} represent the set of nodes in the phylogeny. If we are considering the
model of conditional adaptation, the {V
k
} will include the hidden nodes described in
chapter 4
. The ordering is arbitrary, but for convenience, let V
0
be the root (which is
itself arbitrary, as the models we consider are reversible), and let V
1
, . . . , V
N
denote
the nodes corresponding to variables Y
1
, . . . , Y
N
(as previously, we will sometimes
simply refer to node V
i
, 1 ≤ i ≤ N , by its corresponding variable Y
i
). We denote
the parent of a node V
k
as p(V
k
) and the set of children of V
k
as c(V
k
). Now the
observed data D consists of the values of the target and predictor attributes for each
individual i, each of which corresponds to a leaf on the tree. Let D(V
k
) represent the
data belonging to the subset of individuals in the subtree rooted at node V
k
. Recall
that the parameters of our model are θ = (π, λ, s).
A.1.2
Likelihood calculation in the unlinked case
Given this notation, the likelihood can then be written as
L(D|θ, X) =
v∈{0,1}
Pr {V
0
= v|θ}
V
c
∈c(V
0
)
L(D(V
c
)|θ, X, V
0
= v)
(A.2)
where L(D(V
k
)|θ, X, V
0
= v) is the likelihood of the subtree rooted at V
k
conditioned
on V
0
= v. Note that Pr {V
0
= 1|θ} = π, and Pr {V
0
= 0|θ} = 1 − π. The conditional
likelihood of an arbitrary node given its parent can be recursively deﬁned as
L(D(V
k
)|θ, X, p(V
k
) = p) =
v∈{0,1}
Pr {V
k
= v|θ, p(V
k
) = p}
V
c
∈c(V
k
)
L(D(V
c
)|θ, X, V
k
= v) (A.3)
if V
k
is a branch node. Note that Pr {V
k
= v|p(V
k
) = p, θ} is the transition probability
and is a function of the parameters π and λ and the branch length t
k
between V
k
and
its parent. If V
k
is a leaf in the phylogeny, then the likelihood is given by the leaf

189
distribution as deﬁned in
chapter 4
, which we write as
L(D(V
i
)|θ, X, p(V
i
) = p) =
y∈{0,1}
Pr {Y = y|θ, X, H = p}
1 {Y
i
= y} ,
(A.4)
where the right hand side of the equation is the familiar leaf distribution, which
computes the probability that the trait Y will be y given that the corresponding
hidden node is p
i
. Note that we only want to sum the probabilities for the observed
value of Y
i
. This condition is provided by the identity function
1 {·}, which evaluates
to 1 if the condition is true and 0 otherwise. Thus, the likelihood is simply equal to
the leaf distribution corresponding to the observed value of trait Y .
A.1.3
Likelihood calculation in the SGS case
Now consider the case where our data consists of SGS sequences, such that we have
multiple HIV sequences for each individual. As before, our ﬁrst step will be to con-
struct a phylogeny for all the sequences. Due to the rapid rate of evolution of HIV, we
expect that sequences sampled from a given individual will form a subtree within the
larger tree, though this is not guaranteed. For simplicity, however, we will assume this
is the case, though the derivation discussed in this appendix can be easily generalized
to avoid this assumption.
Previously, we had one HIV sequence per individual. When considering the target
trait Y and the predictor trait X, we therefore labeled individual variables Y
i
and X
i
for each individual i. Now, each individual has multiple sequences. Therefore, deﬁne
Y
i,g
to be the variable corresponding the gth sequence from the ith individual, with
H
i,g
denoting the corresponding hidden variable. For simplicity, we will focus on the
univariate model with the trait X corresponding to an HLA allele (though the results
generalize to Noisy Add as well). Therefore, will will continue to index X as X
i
.
Consider
Figure 8.1
on page
134
. The addition of the linked nodes L have two
implications. First, they change the leaf distribution calculation to Equation (
8.1
), as
discussed in
subsection 8.1.5
. Second, our likelihood calculation must now integrate

190
out the linked nodes. Intuitively, this can be done by ﬁrst ﬁxing the values of the
linked nodes, then computing the likelihood. We then simply compute the weighted
average of the likelihoods under every possible assignment of linked nodes, weighted
by the prior probability of a given conﬁguration, which is given by the parameter
s
1
and the observed predictor variables X. To make this explicit, we can expand
Equation (
A.1
) around the linked nodes. Speciﬁcally, let L
1
, . . . , L
N
be the set of
variables represent the linked trait L for each individual i ∈ {1, . . . , N }. Then the
likelihood can be written as
L(D|θ, X) =
l
1
∈{0,1}
· · ·
l
N
∈{0,1}
L(D|θ, X, L
1
= l
1
, . . . , L
N
= l
N
) × · · ·
. . . × Pr {L
1
= l
1
, . . . , L
N
= l
N
|X} . (A.5)
The structure of the graph implies that each linked node L
i
is conditionally indepen-
dent of all other linked nodes, given X. Thus, we can compute
Pr {L
1
= l
1
, . . . , L
N
= l
N
|X} =
i
Pr {L
i
= l
i
|X
i
} ,
(A.6)
where
Pr {L
i
= l
i
|X
i
} =



















s
1
if X
i
= 1 and l
i
= 1
1 − s
1
if X
i
= 1 and l
i
= 0
0
if X
i
= 0 and l
i
= 1
1
if X
i
= 0 and l
i
= 0
(A.7)
A straightforward computation of Equation (
A.5
) would involve an exponential num-
ber of terms. However, the independencies implied by the tree allow us to simplify
the expression in terms of subtrees, much as we did in Equation (
A.3
).
Notice how each linked node points to a set of individuals. Let M
i
be the most
recent common ancestor (MRCA) of the leaves pointed to by L
i
, where the MRCA
is deﬁned to be root of the smallest subtree (i.e., that subtree with the fewest nodes)
that contains all of the leaves in question. For simplicity, suppose each linked node

191
corresponds to a unique MRCA. That is for i = j, M
i
= M
j
. Further, suppose that
for all i = j, M
i
is not a descendent of M
j
, meaning M
j
is not on the path from M
i
to the root node. This will typically be the case for the univariate model, though it
clearly is not the case for Noisy Add. Nevertheless, the results easily generalize and
this assumption will simplify the following discussion.
Given these assumptions, the (in)dependencies implied by the phylogeny (
Fig-
ure 8.1
) imply that
L(D(M
i
)|θ, X, L
1
, . . . , L
N
) = L(D(M
i
)|θ, X, L
i
).
(A.8)
That is, the likelihood of the subtree of M
i
is conditionally independent of all linked
nodes except L
i
. Thus, the conditional likelihood of the subtree rooted at M
i
is easily
computed using
L(D(M
i
)|θ, X, p(M
i
) = p) =
l∈{0,1}
L(D(M
i
)|θ, X, p(M
i
) = p, L
i
= l) Pr {L
i
= l|X
i
} ,
(A.9)
where
L(D(V
i
)|θ, X, p(V
i
) = p, L
i
= l) =









y∈{0,1}
Pr {Y = y|θ, X
i
, H = p, L
i
= l}
1 {Y
i
= y}
if V
i
is a leaf
v∈{0,1}
Pr {V
i
= v|θ, p(V
i
) = p} × · · ·
· · · ×
V
c
∈c(V
i
)
L(D(V
c
)|θ, X
i
, V
i
= v, L
i
= l)
otherwise
(A.10)
This observation allows us to use each M
i
as a cut note. That is, the likelihood cal-
culation proceeds as in the unlinked case, except when a MRCA node is encountered.
When a MRCA is encountered, we calculate the likelihood of the subtree rooted at
the MRCA using Equation (
A.8
).
A.2
Expectation maximization
Having described the computation of the likelihood under the linked model, and
assuming the reader is familiar with expectation maximization (EM) [
49
] on trees,

192
we can now brieﬂy described the EM procedure for the linked model. Recall that
EM consists of two steps: the Expectation/E step, in which the expected values of the
hidden variables are computed with respect to their posterior probability distribution,
and the Maximization/M step, in which the parameters of the model are chosen so
as to maximize the likelihood of the model conditioned on the expected values of the
hidden variables. In the case of a continuous time Markov process (CTMP), the M
step requires the expected transition probabilities between every parent-child pair in
the tree [
167
]. Using the notation developed in the previous section, we can write the
posterior joint probability between node V
k
and its parent p(V
k
) as
Pr {p(V
k
), V
k
|θ, X, D} .
Note that the posterior is conditioned on all the observed data, not just the data
in the subtree rooted at V
k
. For the unlinked case, these probabilities can be easily
computed using standard message passing algorithms (e.g. [
93
]). Brieﬂy, during the
likelihood calculation, the partial posterior conditional probability
Pr {V
k
|θ, X, D(V
k
), p(V
k
)} =
Pr {V
k
|θ, p(V
k
)}
V
c
∈c(V
k
)
L(D(V
c
)|θ, X, V
k
)
L(D(V
k
)|θ, X, p(V
k
))
(A.11)
is computed and stored. Similarly, a biproduct of computing the likelihood at the
root is the posterior marginal probability
Pr {V
0
|θ, X, D} =
Pr {V
0
}
V
c
∈c(V
0
)
L(D(V
c
)|θ, X, V
0
)
L(D|θ, X)
.
(A.12)
Given the posterior marginal probability of any node, we can recurse back down
the tree, computing posterior joint probabilities along the way. Given the posterior
marginal probability of node V
k
’s parent, the posterior joint probability is simply
Pr {p(V
k
), V
k
|θ, X, D} = Pr {V
k
|θ, X, D(V
k
), p(V
k
)} × Pr {p(V
k
)|θ, X, D}
(A.13)
and the posterior marginal probability of V
k
is obtained by taking Equation (
A.13
) and
summing over all potential values of p(V
k
). Note that the conditional independencies

193
aﬀorded by the tree structure of the model mean that Pr {V
k
|θ, X, D(V
k
), p(V
k
)} is
independent of all observed data not in the subtree of V
k
, making this simple message
passing scheme possible.
Given these posterior joint probabilities, we can now compute the expected value
of the CTMP suﬃcient statistics, as well as maximize the parameters with respect
to those expected values, using the method of [
167
]. As discussed in
section 4.4
, this
process also yields the posterior joint probability distribution between the hidden
variables H and their corresponding observed target variables Y
Pr {Y
i
, H
i
|θ, D, X
i
} ,
from which we can maximize the selection parameter s according to whichever con-
ditional adaptation model we are using.
A.2.1
EM for the linked case
EM for the linked case proceeds much like the likelihood calculation. That is, we use
M
i
as a cut node and calculate the posterior probabilities for each node in the subtree
rooted at M
i
for each value of L
i
.
The ﬁrst step in this process is to compute the posterior Pr {L
i
|θ, X, D}. To do
so, we ﬁrst condition on the parent of M
i
, isolating L
i
from the rest of the tree, then
use Bayes’ rule to calculate the partial posterior:
Pr {L
i
|θ, X
i
, D(M
i
), p(M
i
)} =
Pr {D(M
i
)|θ, L
i
, p(M
i
), X
i
} Pr {L
i
|X
i
}
L(D(M
i
)|θ, X
i
, p(M
i
))
.
(A.14)
Note that each time on the right hand side of the equation follows naturally from
our linked likelihood calculations. Now we simply multiply Equation (
A.14
) by the
posterior marginal probability of p(M
i
) ((
A.12
)) to yield
Pr {L
i
|θ, X, D} =
p∈{0,1}
Pr {L
i
|θ, X
i
, D(M
i
), p(M
i
) = p} × Pr {p(M
i
) = p|θ, X} .
(A.15)

194
The parameter s
1
can then be maximized by
s
1
=
i
Pr {L
i
= 1|θ, D, X
i
= 1}
1 {X
i
= 1}
i
1 {X
i
= 1}
.
(A.16)
With the posterior of L
i
in hand, we can now easily compute the posterior joint
probability between any node and it’s parent in the subtree rooted at M
i
by condi-
tioning on L
i
. Speciﬁcally, Equation (
A.13
) becomes
Pr {p(V
k
), V
k
|θ, X, D} =
l∈{0,1}
Pr {V
k
|θ, X, D(V
k
), p(V
k
), L
i
= l} × · · ·
· · · × Pr {p(V
k
)|θ, X, D, L
i
= l} Pr {L
i
= l|θ, X, D}
(A.17)
Note that Pr {p(M
i
)|θ, X, D, L
i
= l} = Pr {p(M
i
)|θ, X, D}, allowing us to restrict this
process of conditioning on L
i
to the subtree rooted at M
i
.
Finally, to maximize s
2
, we ﬁrst compute the posterior
Pr {H
i,g
, Y
i,g
|θ, X
i
= 1, D, L
i
= 1} = Pr {Y
i,g
|θ, X, D, H
i,g
, L
i
= 1} × · · ·
· · · × Pr {H
i,g
|θ, X, D, L
i
= 1} Pr {L
i
= 1|θ, X, D} , (A.18)
then maximize s
2
according to the appropriate leaf distribution and the posterior
observed for each sequence.
Finally, recall that throughout this discussion we have made the simplifying as-
sumption that each linked node has a unique MRCA and no MRCA is a descendent of
any other MRCA. When these restrictions are removed, the result is that each linked
node is no longer independent of all other linked nodes. Consider the case where two
linked nodes L
i
and L
j
share the same MRCA (as will likely happen for the Noisy
Add model). The above discussion can be easily extended to include this case by
noting that we now must condition on L
i
and L
j
simultaneously. Likewise, if M
j
is
a descendent of M
i
, then all calculations within the subtree rooted at M
j
will need
to be conditioned on both L
i
and L
j
. In general, the computational complexity will
scale exponentially with the number of linked nodes on which we must simultaneously

195
condition. In practice, this scenario is most likely to occur in the Noisy Add model, in
which case the calculation will tend to be exponential in the number of linked nodes
that are included as predictors for a given target variable. In the examples discussed
in this dissertation, linked predictor variables correspond to HLA alleles. In our data,
it is uncommon for a single target variable to be predicted by more than a handful
of HLA alleles; thus, in this domain, it is unlikely that the computational burden of
the linked model will be large enough to make this approach impractical.

196
Appendix B
ON COMPUTING FDR FOR FISHER’S EXACT TEST
The model selection approach we have used throughout this work is based on
estimated false discovery rates. Although our FDR estimates have generally been
well-calibrated on synthetic data, the theory of FDR estimation was developed for
continuous statistics and does not always perform well on discrete data, especially
when the number of observations is small. In this appendix, we digress a bit from
the main focus of the work to discuss the implications of FDR on discrete data. In
particular, we focus on the application of FDR estimation to two by two contingency
tables, for which p-values can be exactly calculated using Fisher’s exact test (FET)
[
66
]. Although the results discussed in this appendix are not directly applicable to
phylogenetic dependency networks, owing to the impracticality of computing exact p-
values over our phylogenetic models, there are a number of cases in sequence analysis
in which the phylogeny has no bearing on the null distribution of variables, allowing
us to use FET directly. In these cases, the results presented here provide a substantial
increase in power by leading to less-biased FDR estimates while maintaining provably
conservative asymptotic guarantees.
The idea is as follows. A key assumption in Storey’s [
208
,
209
,
211
] estimation
of FDR is that the p-values under the null distribution are distributed according to
Uniform[0,1]. In this case, for any α ∈ [0, 1] the probability that we sample a p-value
that is at most α is simply α. In discrete data, however, the p-value distribution under
the null is often heavily skewed toward one. In such cases, as we have noted (
subsec-
tion 5.1.2
), one can estimate the the FDR based on an empirical null distribution that
is sampled from, for example, randomization testing (see refs. [
57
,
106
,
230
]). In the

197
case of FET, we can analytically perform exhaustive randomization testing, allowing
us to compute the exact null distribution. As well will show, this exact calculation
allows us to compute a tight estimation of FDR, as well as to prove some asymptotic
conservative guarantees.
B.1
Examples of FET for sequence data
Let us begin by describing some examples in which FET is the appropriate test for
signiﬁcance. Whereas the bulk of this dissertation has focused on techniques involv-
ing phylogenetic correction, here we brieﬂy outline some examples where population
structure is non-existent, making FET the most appropriate test.
Longitudinal HIV resistance data. The most direct way to identify sequence
associations is threw longitudinal analysis. In this study design, the HIV is sequenced
in each individual at the start of the trial, and is resequenced at the end of the trial.
This approach is most appropriate when the researcher has complete control over the
application of the predictor variable. One such example is provided by the study of
Richard Harrigan and colleagues [
91
]. In this study, approximately 700 patients who
had not been on antiretroviral drugs were enrolled and their consensus HIV sequences
sequenced (this is the HOMER cohort from
chapter 7
). These patients then initiated
one of several types of HIV therapy. Several years later, the consensus HIV sequences
were again determined for each patient and compared to the original sequences. The
goal was to identify mutations that were associated with speciﬁc antiretroviral drugs.
When we used the data as an application for the PDN, we considered only the initial
sequences (as we needed sequences that did not reﬂect drug-related adaptation). In
this case, it was important to correct for phylogeny. Here, however, we can observe
the transitions taking place, rendering the phylogeny irrelevant. For each amino acid
a
k
at position k, we construct a binary variable A such that A = 1 if a
k
was observed
only in the pre-trial sequence, and A = 0 if a
k
was observed in both the pre-trial
and post-trial sequences. The variable A is then tested for independence against each

198
of the drugs in the trial. Here, we tested three classes of drugs over 1194 variables
representing observed amino acids at positions in the HIV Protease and Reverse
Transcriptase proteins, resulting in 3582 total tests, each with 281 observations. We
will refer to this data set as “Resistance”.
Epitope mapping.
The cellular arm of the immune response identiﬁes and
destroys infected cells. Such cells can be identiﬁed by the unique strings of viral
protein fragments, called epitopes, that are displayed on the surface of infected cells.
These epitopes are displayed by human leukocyte antigen (HLA) proteins. Thousands
of HLA variations have been identiﬁed in humans. Thus, a critical component of HIV
vaccine design is the identiﬁcation and characterization of the set of epitopes that
each HLA allele can present on the cell surface.
One high-throughput experimental method for epitope mapping is the use of over-
lapping peptide (OLP) scans, in which a sliding window of 10-15 amino acid peptide
fragments are created based on the viral genome of interest. Each OLP is then tested
against cells from dozens of patients, each of whom contains six HLA alleles, using
interferon-γ ELISpot assays [
1
,
156
]. We then attempt to map HLA alleles to re-
sponses. Thus, we test whether patients with the given HLA allele are more likely to
have a positive ELISpot assay for a given OLP than patients without the HLA allele.
Here, we reanalyze the data of Kiepiela et al. [
116
], comparing comparing 219 HLA
alleles with 343 OLPs, resulting in 74,774 total tests, with an average of 724 obser-
vations per test. These data were collected as part of the Durban cohort analyzed in
chapter 7
. This data set is denoted “Epitope”.
Sieve analysis. In sieve analysis we try to identify positions in the viral genome
at which the variability diﬀers between two diﬀerent populations [
79
]. For example,
if the position is more variable in patients from one region than those from another,
it may indicate that diﬀerent forces are acting on that region. Following Gilbert, we
have created a data set consisting of 567 HIV clade B sequences from the HOMER
cohort [
25
] and 522 HIV clade C sequences from the Durban cohort [
116
,
196
]. For

199
each of 363 position in the HIV Gag protein, we compared the frequency at which
sequences match the consensus for their respective clades. We will call this data set
“Sieve”.
Linkage disequilibrium mapping. Linkage disequilibrium (LD) occurs when
two sites on a chromosome are not statistically independent of each other. This
typically occurs when the sites are nearby on the chromosome, such that inheritance
of a given allele at one site is correlated with inheritance of a given allele at the other
site. When performing genome wide association studies (GWAS), it is helpful to
identify a set of variations (SNPs) that are in LD with each other, so that potentially
redundant results can be identiﬁed. We have taken the SNP data from a recent GWAS
[
37
], binarized the data and tested for independence between each of 401,017 pairs of
neighboring SNPs. There was an average of 1085 observations per test. We this data
set “SNP”.
Synthetic data To demonstrate various claims throughout this appendix, we use
synthetic data sets constructed to closely mimic the above real data sets. We will
describe the construction of these data sets in
section B.5
, after we have introduced
some useful notation and concepts.
B.2
Background
B.2.1
Contingency Tables and Fisher’s Exact Test
Consider an experiment that tests the independence of two random binary variables
X and Y . Suppose there are n observations, (x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
). The results
can be summarized in a contingency table t = (a, b, c, d), as deﬁned in
Table B.1
. We
denote the marginal counts of t as θ
t
= (θ
X
, θ
¯
X
, θ
Y
, θ
¯
Y
), which represent the number
of times each variable is observed in each state. These counts capture the maximum
likelihood estimators for the marginal probabilities Pr {X = 1} and Pr {Y = 1}. We
will often drop the superscript t when it is clear from context.

200
Table B.1: 2 × 2 contingency table on binary variables X and Y .
Y = 1
Y = 0
X
X = 1
a
b
θ
X
= a + b
X = 0
c
d
θ
¯
X
= c + d
Y
θ
Y
= a + c
θ
¯
Y
= b + d
n
Our goal is to test the null hypothesis H = 0 that X and Y are independent. Let T
be the random variable representing a table with marginals θ. Fisher [
66
] showed that,
if X and Y are independent and each is independent and identically distributed (IID),
then the probability of observing T = t is given by the hypergeometric distribution:
pr(t)
Pr T = t|H = 0, θ
t
=
θ
X
a
θ
¯
X
c
n
θ
Y
.
(B.1)
From Equation (
B.1
), a two-tailed marginal p-value can be computed for t using
p(t) = Pr pr(T ) ≤ pr(t)|θ
T
= θ
t
=
t ∈perm(t)
Pr {t } ·
1 {Pr {t } ≤ Pr {t}} , (B.2)
where perm(t) = {t : θ
t
= θ
t
}, and
1 {·} is the indicator function that evaluates
to one if the constraints are satisﬁed and zero otherwise. We call Equation (
B.2
) a
marginal p-value because it is conditioned on the marginals θ.
Many statistics have a uniform distribution of p-values under the null. That is, if
P is a continuous random variable representing a p-value, Pr {P ≤ α|H = 0} = α. It
is evident from Equation (
B.1
), however, that the distribution of FET p-values under
the null may be strongly skewed toward one and therefore Pr {P ≤ α|H = 0} ≤ α.
For example, the minimum achievable p-value for a given n is 1/
n
n/2
, and that can
only be achieved when the marginals are evenly distributed (i.e., when θ
X
= θ
¯
X
=
n
2
).
Figure B.1
shows the distribution of p-values from our various data sets.

201

0
100
200
300
400
500
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
A Resistance

0
500
1000
1500
2000
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
B Epitope

0
25
50
75
100
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
C Sieve

0
5000
10000
15000
20000
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
D SNPs
Figure B.1: The histogram of marginal p-values for the various data sets. p-values
are transformed using log
2
p.
B.2.2
Positive False Discovery Rates
Suppose that m hypothesis tests over 2 × 2 contingency tables t
1
, . . . , t
m
are si-
multaneously tested using Fisher’s exact test, with corresponding marginal p-values
p
1
, p
2
, . . . , p
m
, and we wish to estimate or control the false positive rate in some way.
Given the values of H
1
, H
2
, . . . , H
m
, where H
i
= 0 if the ith null hypothesis is true
and H
i
= 1 if the null hypothesis is false, and a rejection region Γ, the possible results
are summarized in
Table B.2
.
Note that only m, R and W are observed. Here, we will focus on nested rejection
regions deﬁned by the Type I error rate α. That is, given an α ∈ [0, 1], we will reject
all tests with P ≤ α. For convenience, we will use α to denote this rejection region.

202
Table B.2: Outcomes when testing m hypotheses.
Hypothesis
Accept
Reject
Total
Null true
U (Γ)
V (Γ)
m
0
Alternative true
T (Γ)
S(Γ)
m
1
Total
W (Γ)
R(Γ)
m
Benjamini and Hochberg [
15
] proposed controlling the False Discovery Rate (FDR),
which they deﬁned to be
F DR(α)
E
V (α)
R(α)
.
(B.3)
When R(α) = 0, this quantity is undeﬁned. In this case, Benjamini and Hochberg
deﬁned F DR(α) to be 0, which is equivalent to deﬁning
F DR(α)
E
V (α)
R(α)
R(α) > 0 · Pr {R(α) > 0} .
(B.4)
In practice, we are only interested in p-value thresholds that result in the rejection of
at least one test in our data. Thus, Storey [
208
] proposed the positive false discovery
rate (pFDR), which is conditioned on the rejection of at least one test:
pF DR(α) = E
V (α)
R(α)
Pr {R(α) > 0}
(B.5)
=
1
Pr {R(α) > 0}
· F DR(α).
(B.6)
In order to estimate pF DR(α), we need to make some assumptions about the
hypothesis tests. One useful set of assumptions proposed by Storey [
208
] is that the
random variables H
i
are IID Bernoulli variables with prior probability Pr {H
i
= 1} =
1 − π
0
and that the p-values are IID with
P
i
|H
i
∼ (1 − H
i
) · F
0
+ H
i
· F
1
,
(B.7)
for some null distribution F
0
and some alternative distribution F
1
. Under these as-

203
sumptions, Storey [
208
] showed that
pF DR(α) = Pr {H = 0|P ≤ α}
(B.8)
=
π
0
· Pr {P ≤ α|H = 0}
Pr {P ≤ α}
(B.9)
=
E [V (α)]
E [R(α)]
.
(B.10)
For small samples, it may be quite likely that no tests are rejected at threshold α.
Therefore, Storey [
208
] proposed the following estimator for pFDR:
pF DR(α)
ˆ
π
0
· Pr{P ≤ α|H = 0}
Pr{P ≤ α} · Pr{R(α) > 0}
.
(B.11)
Equation (
B.11
) provides estimators for each term of Equation (
B.9
), with an added
correction term that estimates Pr {R(α) > 0}. (Note that we did not employ this
term in the previous chapters, as we were concerned with large FDR estimates over
a large number of samples, meaning Pr {R(α) > 0} was close to 1.)
For p-values that are not biased towards 0 under the null distribution, Storey [
208
]
showed that
pF DR(α)
W (λ) · α
(1 − λ)R(α) (1 − (1 − α)
m
)
,
(B.12)
for some well-chosen λ, 0 ≤ λ < 1, is conservative both asymptotically and in ex-
pectation for ﬁnite samples using the intuition that Pr {P ≤ α|H = 0} ≤ α and
π
0
≈
W (λ)
(1−λ)m
. Storey [
208
,
211
] and others (e.g., [
126
]) have proposed methods for
choosing λ.
In practice, Storey suggested estimating pF DR(α) for each observed p-value p
i
and
proved that, under the aforementioned independence assumptions, the estimates are
simultaneously conservative for all p
i
[
210
]. When the tests are continuous and uni-
formly distributed, Pr {P ≤ p
i
|H = 0} = p
i
, making Storey’s estimator quite tight
in practice. When the tests are discrete, however, Pr {P ≤ p
i
|H = 0} can be sig-
niﬁcantly less than α.
For example, in the case of Fisher’s exact test, each ob-
served p-value p
i
is a marginal p-value dependent on the marginals θ
i
. Although

204
Pr {P ≤ p
i
|H = 0, θ
i
} = p
i
, for some test j = i, Pr {P ≤ p
i
|H = 0, θ
j
} ≤ p
i
. Thus,
the overall probability Pr {P ≤ P
i
|H = 0} ≤ p
i
, making Storey’s estimate overly con-
servative.
Although the marginal p-values do not provide good estimators of the overall
probability Pr {P ≤ α|H = 0}, precluding a simple procedure of computing pFDR
for each p
i
using Storey’s approach, the null distribution can be estimated by ran-
domization tests, as has been discussed in the microarray literature (for review,
see [
36
]).
Nevertheless, Pounds and Cheng [
183
] point out that, even assuming
Pr {P ≤ α|H = 0} = α, estimation methods for π
0
that rely on continuous assump-
tions are not applicable to discrete data. Instead, they propose deﬁning
ˆ
π
0
2¯
p,
(B.13)
where ¯
p is the arithmetic mean of the observed marginal p-values. Although Equa-
tion (
B.13
) also assumes a uniform p-value distribution under the null hypothesis,
this conservative estimator tends to yield tighter estimates than other methods when
the true π
0
is suﬃciently small [
183
].
B.3
Computing pFDR for Fisher’s exact test
Fisher’s exact test provides an eﬃcient means of computing Pr {P ≤ α|H = 0, θ} ex-
actly. Although the resulting marginal p-values should not be applied directly to
pFDR estimation, the computation can be eﬃciently leveraged to provide a tight
estimate for each of the terms in Equation (
B.11
). In this section, we deﬁne each
of these estimates and show that each estimate is unbiased or conservative (in this
context, an estimator is conservative if it is expected to overestimate terms in the
numerator of Equation (
B.11
) or underestimate terms in the denominator of Equa-
tion (
B.11
)). We evaluate the asymptotic convergence properties of Equation (
B.11
),
showing that it results in a conservative FDR estimate and characterize conditions
under which the estimates will be closer to the true values. In addition, we discuss

205
ways in which the exact test can be further leveraged to increase power by ﬁltering
irrelevant tests.
In what follows, we begin with Storey’s [
209
] assumptions. Speciﬁcally, we assume
that p-values are IID and follow the mixture model Equation (
B.7
) and the H
i
are
IID Bernoulli random variables. Because we are using Fisher’s exact test, we also
need to consider the marginals θ, which we assume are IID Although we we make no
assumptions about the speciﬁc distribution of θ, we assume that the p-values depend
on the marginals according to the mixture model
P |H, θ ∼ (1 − H) · F
0
(θ) + H · F
1
(θ),
(B.14)
where F
0
(θ) follows the hypergeometric distribution as deﬁned in Equation (
B.2
) and
F
1
(θ) is the generating function of the alternative model. We further assume that θ
is independent of H, though we will later relax this assumption.
B.3.1
Estimating
Pr {P ≤ α|H = 0}
from pooled p-values
To estimate Pr {P ≤ α|H = 0}, we use a pooled p-value estimate, which is the average
probability that each test has P ≤ α:
Pr{P ≤ α|H = 0}
1
m
m
i=1
Pr {P ≤ α|H = 0, θ
i
} .
(B.15)
Because FET computes Pr {P ≤ α|H = 0, θ
i
} by considering every permutation
of the data that is consistent with θ
i
, it can be seen that Equation (
B.15
) is the result
of computing every possible permutation of the data. Furthermore, given that H
and θ are independent, it follows that the pooled p-values are unbiased estimates of
Pr {P ≤ α|H = 0} (see Lemma
1
in
section B.6
).
Finally, because Pr {P ≤ α|H = 0, θ}
α for many marginals observed in prac-
tice, Equation (
B.15
) can provide a substantial gain in statistical power over the
estimate Pr{P ≤ α|H = 0}
α.
Figure B.2
shows the advantage of using pooled
p-values over α.

206

0
0.2
0.4
0.6
0
0.5
A Resistance

0
0.05
0.1
0.15
0
0.5
B Epitope

0
0.2
0.4
0.6
0.8
0
0.5
C Sieve

0
0.2
0.4
0.6
0.8
0
0.5
D SNPs
Figure B.2: Pooled vs. marginal p-values. The dashed line shows the α estimator.
The distance between the dashed line and the solid line is the gain from using the
pooled p-values.
B.3.2
Estimating
Pr {P ≤ α}
and
Pr {R(α) > 0}
Given that R(α) is the number of observed tests with p ≤ α, it follows that
E
R(α)
m
= Pr {P ≤ α} .
(B.16)
We therefore follow Storey [
208
] in deﬁning
Pr{P ≤ α}
R(α)
m
.
(B.17)
Because Equation (
B.11
) would be undeﬁned for R = 0, Storey [
208
] deﬁnes
Pr{P ≤ α}
R(α) ∨ 1
m
,
(B.18)

207
where R(α) ∨ 1 = max(R(α), 1), though in practice we are typically only interested
in values for α that were observed in our data set.
To determine pF DR(α), we also require the estimate Pr{R(α) > 0}. It follows
from our independence assumptions that
Pr {R(α) > 0} = 1 − (1 − Pr {P ≤ α})
m
(B.19)
≤ 1 − (1 − Pr {P ≤ α|H = 0})
m
,
(B.20)
making
Pr{R(α) > 0}
1 − 1 − Pr{P ≤ α|H = 0}
m
(B.21)
a conservative estimate of Pr {R(α) > 0}.
B.3.3
Estimating π
0
The ﬁnal step in the pF DR computation is to estimate π
0
. In this section, we use
a general framework for conservatively estimating π
0
[
42
] and show how existing
methods ﬁt within this framework.
Given the mixture model Equation (
B.14
), we can write
Pr {P = p} = π
0
· Pr {P = p|H = 0} + π
1
· Pr {P = p|H = 1} ,
(B.22)
where π
1
= Pr {H = 1} = 1 − π
0
[
42
,
78
,
126
]. Thus, it follows that
Pr {P = p} ≥ π
0
· Pr {P = p|H = 0}
(B.23)
and
π
0
≤
Pr {P = p}
Pr {P = p|H = 0}
.
(B.24)
Moreover, for any non-negative function ρ(·) we can also write
π
0
≤
p
ρ(p) Pr {P = p}
p
ρ(p) Pr {P = p|H = 0}
.
(B.25)

208
Therefore, estimating π
0
using
ˆ
π
0
m
i=1
p
ρ(p) ·
1 {p
i
= p}
m
i=1
p
ρ(p) · Pr {P = p|H = 0, θ
i
}
,
(B.26)
where ρ(·) is any non-negative function, is always a conservative estimate in expec-
tation (see Lemma
2
). Furthermore, in the limit, Equation (
B.26
) asymptotically
converges to
π
0
+ π
1
·
E [ ρ(p)| H = 1]
E [ ρ(p)| H = 0]
,
which implies that the ρ(·) function minimizing
E[ ρ(p)|H=1]
E[ ρ(p)|H=0]
will yield the least biased
estimator (see Lemma
3
).
Equation (
B.26
) gives us great ﬂexibility in computing π
0
estimates. One such
estimation method is given by Storey: [
208
,
209
]
ˆ
π
0
(λ) =
#{p
i
> λ}
(1 − λ)m
(B.27)
for some tuning parameter 0 ≤ λ < 1. For uniformly distributed statistics,
(1 − λ)m = E [ #{π > λ}| H = 0] .
(B.28)
For contingency tables, we can estimate Pr {P > λ|H = 0} using Pr{P ≤ α|H = 0},
which results in an unbiased estimate for E [ #{π > λ}| H = 0]. Therefore, Equa-
tion (
B.27
) is a special case of Equation (
B.26
) in which
ρ(p) =





0
if p ≤ λ,
1
otherwise.
(B.29)
As λ → 0, we have increasingly conservative bias, with ˆ
π
0
= 1 when λ = 0, whereas
the variance of the ˆ
π
0
estimate increases as λ → 1 due to the decreasing number of
observations. Indeed, diﬀerent heuristic approaches have been proposed to balance the
bias-variance tradeoﬀ inherent in picking λ. Equation (
B.26
) suggests an orthogonal
heuristic that may be useful in estimating ˆ
π
0
: choosing a weighting function ρ(·)

209
such that more weight is applied to tests with high p-value. A natural choice for a
weighting function is ρ(p) = Pr {P ≤ p|H = 0, θ
i
} = p
i
, which is equivalent to
ˆ
π
0
E [P ]
E [P |H = 0]
.
(B.30)
Under this weighting function, tests with low p-values will still contribute to the π
0
estimate, but not as much as tests with high p-values. In principle, we could deﬁne
ρ(·) such that we are eﬀectively summing p-values over the range λ < p ≤ 1; in
practice, however, we have found the ˆ
π
0
estimate to be quite stable over a wide range
of λ, and so simply set λ = 0.
Pounds and Cheng [
183
] suggested the estimator ˆ
π
0
2¯
p for discrete statistics,
where ¯
p is the average observed marginal p-value. Estimator (
B.30
) is similar to their
estimate, except that for contingency tables we can compute E [P |H = 0] exactly,
rather than assuming E [P |H = 0] = 0.5.
Table B.3
compares the true π
0
for synthetic data against the π
0
estimators of
Equation (
B.26
) and of Storey [
209
] (evaluated for λ = 0.5
1
) and Pounds and Cheng
[
183
] computed using marginal p-values. Equation (
B.26
) provides a tight and con-
servative estimate, substantially increasing power of methods that assume a uniform
p-value distribution. Similarly, accounting for the exact p-value distribution results
in a substantially lower π
0
estimate on each of the real data sets (
Table B.4
).
B.3.4
Convergence properties
We have presented estimates for each component of pF DR, showing how each is either
unbiased or conservative. It follows that pF DR is asymptotically conservative as
m → ∞. We can, however, be more precise in estimating the convergence properties
of our pF DR estimate. Speciﬁcally
1
We used the available R code from http://genomics.princeton.edu/storeylab/qvalue/index.html.
Results are truncated at 1. λ = 0.5 was chosen because the spline-ﬁtting method [
211
] always
results in estimates of π
0
= 1 for these discrete data sets.

210
Table B.3: Comparing ˆ
π
0
estimations for synthetic data sets derived from the Epitope
data with diﬀerent true π
0
. Storey’s method was evaluated at λ = 0.5.
π
0
E[P ]
E[P |H=0]
2¯
p
Storey
0.001
0.086
0.12
0.11
0.05
0.13
0.18
0.17
0.1
0.18
0.25
0.24
0.2
0.27
0.38
0.37
0.3
0.36
0.52
0.51
0.4
0.45
0.64
0.63
0.5
0.58
0.69
0.68
0.6
0.66
0.79
0.79
0.7
0.75
0.91
0.90
0.8
0.83
1.01
1
0.9
0.92
1.13
1
1
1.003
1.24
1
Theorem 1. Given m tests, in which the P-values are IID and distributed according
to the mixture Equation (
B.14
), the H are IID Bernoulli random variables, and for
each test i, θ
i
is independent of H
i
, and given a non-negative function ρ(·):
lim
m→∞
pF DR(α)
a.s.
=
π
0
+ π
1
E[ ρ(p)|H=1]
E[ ρ(p)|H=0]
π
0
· pF DR(α)
(B.31)
Proof. Provided in
section B.6
.
This convergence theorem shows that, for large samples, our pFDR estimate is
conservative and becomes tightest when ˆ
π
0
is computed using a ρ(·) function that is
expected to be much lower for true alternative hypotheses than for true null hypothe-
ses.

211
Table B.4: Comparing ˆ
π
0
estimations for the real data sets.
data set
E[P ]
E[P |H=0]
2¯
p
Storey
Sieve
0.63
0.82
0.69
SNPs
0.34
0.44
0.41
Epitope
0.99
1.84
1
Resistance
0.97
1.38
1
B.3.5
Dependent marginals
Until now, we have assumed the following mixture model:
P |H, θ ∼ (1 − H) · F
0
(θ) + H · F
1
(θ),
(B.32)
where θ is independent of H. It is conceivable, however, that true alternative tests
will tend to have more balanced marginals that permit lower p-values. For example,
in the Resistance data, it is possible that positions in which no mutations confer drug
resistance may be more conserved due to purifying selection, an evolutionary process
that results in less observed variation.
Under these conditions, we might expect
the pFDR estimate to become more conservative because a substantial proportion
of the balanced marginals included in our pooled-p-value estimate belong to true
alternative tests causing us to overestimate Pr {P ≤ α|H = 0}. This conservative
bias is analogous to that observed in the microarray community—a bias that arises
from permutation testing over alternative data that has a higher variance than the
null data [
106
,
230
]. In this section, we formalize the problem for discrete statistics
and show that our pFDR estimate becomes asymptotically more conservative as the
tendency increases for true alternative tests to have more evenly distributed marginals
than null tests. Note that these results hold only for π
0
estimators in which ρ(·) is non-
decreasing, a slightly more restrictive deﬁnition than we used in the previous sections.
In principle, the conservative bias could be reduced using heuristic measures similar

212
to those proposed in the microarray community [
106
,
230
]; however, these heuristics
are not guaranteed to result in a conservative pFDR estimate, so we will not explore
them further here.
Let us consider a special case in which Pr {θ|H = 1} = Pr {θ|H = 0}. Speciﬁcally,
we shall consider the case where the marginals of true alternative hypotheses tend
to have larger n and/or are more balanced, meaning that the marginals of the true
alternative hypotheses tend to permit smaller p-values than the marginals of the true
null hypotheses. In this case,
θ
Pr {P ≤ α|H = 0, θ} · Pr {θ|H = 0}
assumed
≤
θ
Pr {P ≤ α|H = 0, θ} · Pr {θ|H = 1} . (B.33)
Under these assumptions, we can derive the following large sample theorem:
Theorem 2. Given m tests that follow the assumptions of Theorem
1
, a non-decreasing
function ρ(·), and Assumption (
B.33
), then
lim
m→∞
pF DR(α) ≥
π
0
+ π
1
E[ ρ(p)|H=1]
E[ ρ(p)|H=0]
π
0
· pF DR(α).
(B.34)
Hence, under the above stated assumptions, and provided ρ(p) is non-decreasing
in p, our pFDR and FDR estimates will be asymptotically more conservative than
the case where the marginals are independent of H.
In practice, it is not known whether H and θ are independent. Consequently, we
recommend using the more restricted class of ρ(·) functions allowed by Theorem
2
.
B.3.6
Filtering irrelevant tests
The discreteness of the data provides a unique opportunity to increase power further.
For any discrete test, there exists an α such that Pr {P ≤ α} = 0. It turns out that
including such tests in our pFDR estimate will typically increase the conservative
bias. Thus, power can often be improved by ﬁrst ﬁltering out all all tests that couldn’t

213
possibly achieve the signiﬁcance threshold α. As we shall see, such ﬁltering will still
result in an asymptotically conservative estimate and, in some cases, will provably
increase power.
Because the range of the data is ﬁnite, computation of the most signiﬁcant achiev-
able p-value given the marginals is possible. We can deﬁne the minimum achievable
p-value for ﬁxed marginals as
p
∗
(θ)
min
t∈perm(θ)
p(t).
(B.35)
When computing pF DR(α), we can ignore (ﬁlter) all tests i such that
p
∗
(θ
i
) > α.
(B.36)
For a set of contingency tables T, we can now write
T = T
+
α
∪ T
−
α
(B.37)
for the disjoint sets
T
+
α
= {t
i
: t
i
∈ T ∧ p
∗
(θ
i
) > α}, and
(B.38)
T
−
α
= {t
i
: t
i
∈ T ∧ p
∗
(θ
i
) ≤ α}.
(B.39)
Filtering on p
∗
(θ) > α can be seen as estimating pF DR over T
−
α
. Let pF DR
∗
(α) and
pF DR
∗
(α) be the true and estimated pFDR, respectively, for T
−
α
. Then the following
theorem holds.
Theorem 3. Given m tests that follow the assumptions of Theorem
1
,
pF DR
∗
(α) = pF DR(α).
(B.40)
Proof. Recall that, for independent, identically distributed tests,
pF DR(α) =
E [V (α)]
E [R(α)]
.
(B.41)

214
Because
E [ R(α)| Pr {p(T ) ≤ α} = 0] = 0,
(B.42)
it follows that pF DR(α) will be the same for T and T
−
α
. Therefore, pF DR
∗
(α) =
pF DR(α).
Therefore, ﬁltering on p
∗
(θ
i
) > α) does not change the true pF DR. Furthermore,
under the assumptions of Theorem
2
, we have
pF DR
∗
(α)
a.s.
≤ lim
m→∞
pF DR
∗
(α),
(B.43)
Consequently, we can compute pF DR
∗
(α) instead of pF DR(α). Moreover, if true
alternative tests are monotonically more concentrated at lower p-values; that is, if
Pr {P ≤ α|H = 1}
Pr {P ≤ α|H = 0}
(B.44)
is non-increasing in α, then
lim
m→∞
pF DR
∗
(α)
a.s.
≤ lim
m→∞
pF DR(α),
(B.45)
meaning that ﬁltering is asymptotically guaranteed to provide a tighter estimate
and therefore increase power (see Lemma
4
in Appendix for details). Although this
condition is often not met for discrete data, it is often approximately met and provides
a good rationale for ﬁltering. In addition, because both pF DR(α) and pF DR
∗
(α)
are asymptotically greater than pF DR(α), for large samples we can compute both
the ﬁltered and unﬁltered estimates and choose whichever yields the lower value.
The only estimate that ﬁltering changes is ˆ
π
0
. Let ˆ
π
0
(α) denote the estimated π
0
over T
−
α
(α), and ˆ
π
0
(1) denote the estimated π
0
over T. It may be that ˆ
π
0
(α) ≤ ˆ
π
0
(1),
and ˆ
π
0
(α) is not a conservative estimate of the true π
0
of the original set of tests
T. ˆ
π
0
(α) is, however, a conservative estimate of π
0
among the ﬁltered set of tests
T
−
α
(α). That is, ˆ
π
0
(α) is a conservative estimate of the proportion of tests that are
truly null among those tests that could achieve signiﬁcance level α. In addition to

215

0.4
0.5
0.6
0.7
0.8
0.9
1
0.0001
0.001
0.01
0.1
1

Download 4,8 Kb.

Do'stlaringiz bilan baham:

1 ... 4 5 6 7 8 9 10 11 12