Phylogenetic dependency networks: Inferring patterns of adaptation in hiv

bet	1/12
Sana	23.11.2017
Hajmi	4,8 Kb.
	#20686

1 2 3 4 5 6 7 8 9 ... 12

Phylogenetic dependency networks:
Inferring patterns of adaptation in HIV
Jonathan M. Carlson
A dissertation submitted in partial fulﬁllment
of the requirements for the degree of
Doctor of Philosophy
University of Washington
2009
Program Authorized to Oﬀer Degree: Computer Science and Engineering

University of Washington
Graduate School
This is to certify that I have examined this copy of a doctoral dissertation by
Jonathan M. Carlson
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required by the ﬁnal
examining committee have been made.
Co-Chairs of the Supervisory Committee:
David Heckerman
Walter L. Ruzzo
Reading Committee:
David Heckerman
James Mullins
Walter L. Ruzzo
Date:

In presenting this dissertation in partial fulﬁllment of the requirements for the doctoral
degree at the University of Washington, I agree that the Library shall make its
copies freely available for inspection. I further agree that extensive copying of this
dissertation is allowable only for scholarly purposes, consistent with “fair use” as
prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this
dissertation may be referred to Proquest Information and Learning, 300 North Zeeb
Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted
“the right to reproduce and sell (a) copies of the manuscript in microform and/or (b)
printed copies of the manuscript made from microform.”
Signature
Date

University of Washington
Abstract
Phylogenetic dependency networks:
Inferring patterns of adaptation in HIV
Jonathan M. Carlson
Co-Chairs of the Supervisory Committee:
Aﬃliate Professor David Heckerman
Medical Education and Biomedical Informatics, and Microbiology
Professor Walter L. Ruzzo
Computer Science and Engineering
Populations adapt to their environment through a process of natural selection. By
studying this process, one can gain insight into the speciﬁc functions of adaptive traits
that provide an advantage in certain environments. HIV has proven to be remarkably
adept at adaptation. So much so that the virus quickly adapts to each individual who
is infected, eﬀectively nullifying the immune response of most patients. By identifying
the speciﬁc adaptations HIV employs against the immune system, it may be possible
to identify vaccine targets that reduce HIV’s capacity to successfully adapt.
This dissertation introduces the Phylogenetic Dependency Network (PDN) for the
identiﬁcation of adaptive traits and the environments in which they arise. The PDN is
a directed graphical model in which nodes represent measurable traits of the popula-
tion and the environment and arcs represent probabilistic dependencies among traits.
The probability component of the PDN consists of a model of adaptive evolution in
which each population trait adapts to a set of predictors, traits to which it is con-
nected in the PDN. The structure of the PDN is identiﬁed through a model selection
approach and can be interpreted as an estimate of which traits directly interact. We

introduce a class of probabilistic adaptive evolution models called conditional adap-
tation models. These models assume that each trait has evolved independent of all
other traits in the PDN until it reached the current environment, at which point the
predictors act to inﬂuence adaptation of the trait.
One of the key beneﬁts of this approach over traditional methods is the ability to
simultaneously model multiple interactions. Existing approaches are typically con-
strained to consider the evolutionary interaction of two traits at a time. In complex
environments in which each trait interacts with many other traits, this constrained
view of adaptation blurs the distinction of which traits are truly interacting and which
are only indirectly correlated. By modeling these interactions using conditional adap-
tation models, we are able to accurately capture dense networks of interactions.
We apply our PDN approach to study adaptation of HIV to the human cellular
immune response, identifying a large set of HIV adaptations that consistently arise
in patients with similar immune genetics. These adaptations often take the form of
multiple mutations spanning large regions of HIV proteins and indicate the presence
of preferred patterns of adaptation. Although these adaptation networks are quite
complex, the presence of these preferred adaptation patterns suggest weak points in
viral adaptation that may be exploited by future vaccines.

TABLE OF CONTENTS
Page
List of Figures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
Chapter 1:
Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Chapter 2:
Detecting Adaptation: Introduction and Review
. . . . . . . . .
5
2.1
Selection and adaptation
. . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Phylogeny confounds the comparative method
. . . . . . . . . . . . .
7
2.3
Related work on the comparative method
. . . . . . . . . . . . . . . .
9
2.4
Limitations of existing methods
. . . . . . . . . . . . . . . . . . . . .
14
Chapter 3:
HIV Immune Escape: Introduction and Review
. . . . . . . . .
18
3.1
The HLA-restricted CTL response is a major selective force driving
HIV-1 evolution within an infected host
. . . . . . . . . . . . . . . . .
19
3.2
Escape follows generally predictable patterns in response to speciﬁc
immune pressures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
Immune selection pressures drive HIV evolution at the population level:
but to what extent?
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.4
Assessing the extent of HLA-driven HIV-1 evolution at the population
level: challenges and controversies
. . . . . . . . . . . . . . . . . . . .
22
3.5
HLA-associated immune pressures inﬂuence population HIV diversity
at up to 40% of positions in some proteins
. . . . . . . . . . . . . . .
25
3.6
Clinical consequences of immune-mediated evolution
. . . . . . . . . .
26
3.7
Strategies to cope with viral diversity in HIV-1 vaccine design
. . . .
28
3.8
Remaining challenges
. . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Chapter 4:
Phylogenetic Dependency Networks
. . . . . . . . . . . . . . . .
31
i

4.1
Phylogenetically corrected distributions for one predictor trait
. . . .
32
4.2
Phylogenetically corrected distributions for more than one predictor trait
37
4.3
q-values
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.4
Model Details
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Chapter 5:
Evaluation and Application of the Univariate Model
. . . . . .
51
5.1
Technical details
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.2
Experiments with synthetic data
. . . . . . . . . . . . . . . . . . . . .
59
5.3
Application 1: Eﬀect of immune pressure on HIV evolution
. . . . . .
63
5.4
Application 2: Pairwise correlations between amino acids in HIV
. . .
65
5.5
Application 3: Genomic search for genotype-phenotype associations in
Arabidopsis thaliana
. . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.6
Studies using the univariate conditional evolution model
. . . . . . .
71
5.7
Limitations of univariate conditional evolution model
. . . . . . . . .
75
5.8
Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Chapter 6:
Evaluation of Multivariate Models
. . . . . . . . . . . . . . . .
79
6.1
Technical Details
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
6.2
Model validation on synthetic data
. . . . . . . . . . . . . . . . . . .
83
Chapter 7:
Using PDNs to Infer Patterns of Immune Escape and Covariation
in HIV
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
7.1
Technical details
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
7.2
Phylogenetic dependency network for Gag p17 and p24
. . . . . . . .
99
7.3
Discussion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
Chapter 8:
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
8.1
Limitations and future directions
. . . . . . . . . . . . . . . . . . . .
124
Glossary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137
Bibliography
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
150
Appendix A: Next Generation Sequencing: Extending the model to single
genome sequences
. . . . . . . . . . . . . . . . . . . . . . . . .
187
A.1 Likelihood calculation
. . . . . . . . . . . . . . . . . . . . . . . . . .
187
ii

A.2 Expectation maximization
. . . . . . . . . . . . . . . . . . . . . . . .
191
Appendix B: On computing FDR for Fisher’s exact test
. . . . . . . . . . . .
196
B.1 Examples of FET for sequence data
. . . . . . . . . . . . . . . . . . .
197
B.2 Background
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
199
B.3 Computing pFDR for Fisher’s exact test
. . . . . . . . . . . . . . . .
204
B.4 Numerical results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
215
B.5 Creating synthetic data sets
. . . . . . . . . . . . . . . . . . . . . . .
217
B.6 Proofs and Remarks
. . . . . . . . . . . . . . . . . . . . . . . . . . .
219
B.7 Discussion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
226
iii

LIST OF FIGURES
Figure Number
Page
2.1
Phylogeny confounds the comparative method
. . . . . . . . . . . . .
8
4.1
Phylogenetic dependency network
. . . . . . . . . . . . . . . . . . . .
33
4.2
The univariate model
. . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.3
The multivariate model
. . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.4
Decision Tree leaf distribution
. . . . . . . . . . . . . . . . . . . . . .
40
4.5
Noisy Add leaf distribution
. . . . . . . . . . . . . . . . . . . . . . . .
48
5.1
PR and calibration curves on synthetic data.
. . . . . . . . . . . . . .
61
5.2
PR and calibration curves over diﬀerent trees
. . . . . . . . . . . . .
63
5.3
PR curves for the real the full HLA-amino-acid data.
. . . . . . . . .
65
5.4
Correlated amino-acid pairs in HIV-1 p6.
. . . . . . . . . . . . . . . .
67
5.5
GWAS for Arabidopsis bacterial response
. . . . . . . . . . . . . . . .
70
6.1
p-value calibration
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
6.2
Noisy Add represents real data better than Decision Tree
. . . . . . .
85
6.3
Performance on HOMER data
. . . . . . . . . . . . . . . . . . . . . .
88
6.4
Tree built from the combined HOMER and Durban cohorts
. . . . . .
91
6.5
Performance on synthetic mixed clade data
. . . . . . . . . . . . . . .
92
6.6
Power to detect associations
. . . . . . . . . . . . . . . . . . . . . . .
97
7.1
Gag PDN for combined HOMER and Contract cohorts
. . . . . . . .
101
7.2
Number of optimal epitopes found vs. q-value rank
. . . . . . . . . .
112
8.1
Univariate model with linked predictors
. . . . . . . . . . . . . . . . .
134
8.2
Noisy Add model with linked and unlinked predictors
. . . . . . . . .
136
B.1 P-value histograms
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
201
B.2 Pooled vs. marginal p-values
. . . . . . . . . . . . . . . . . . . . . . .
206
iv

B.3 The advantage of using π
0
(α) computed using the ﬁltering technique
over not ﬁltering. Because ﬁltering only aﬀects ˆ
π
0
, these gains result
in proportionally reduced (yet conservative) pFDR estimates.
. . . . .
215
B.4 Estimated pFDR vs. true false discovery proportion
. . . . . . . . . .
216
B.5 Power gains for proposed pFDR method
. . . . . . . . . . . . . . . .
217
v

LIST OF TABLES
Table Number
Page
5.1
Predicted HLA-amino acid associations in Gag.
. . . . . . . . . . . .
66
7.1
HLA-codon associations in which consensus is the predicted resistant
form
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
B.1 2 × 2 contingency table on binary variables X and Y .
. . . . . . . . .
200
B.2 Outcomes when testing m hypotheses.
. . . . . . . . . . . . . . . . .
202
B.3 Comparing ˆ
π
0
estimations for synthetic data sets derived from the Epi-
tope data with diﬀerent true π
0
. Storey’s method was evaluated at
λ = 0.5.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
210
B.4 Comparing ˆ
π
0
estimations for the real data sets.
. . . . . . . . . . . .
211
vi

ACKNOWLEDGMENTS
I would like ﬁrst to thank my supervisor, David Heckerman, who has been an in-
valuable mentor, manager, colleague and friend, as well as Larry Ruzzo, who has been
incredibly supportive of me throughout my rather non-traditional graduate school ex-
perience. I would also like to thank Jim Mullins, for his collaboration, sharing his
data and post docs, and serving on my reading committee, and Elizabeth Thompson,
for serving as my GSR.
This work has been been the result of numerous collaborations, for which I am
extremely grateful. In particular, this dissertation would be entirely diﬀerent without
the collaboration of Zabrina Brumme, Chanson Brumme and Richard Harrigan, who
introduced me to (and continue my education on) HIV, have provided invaluable data
and ideas, and have critically read and contributed to just about all the work I have
done related to HIV. Special acknowledgment also goes to Philip Goulder and Philippa
Matthews, who provided invaluable data and ideas for the results in
chapter 7
and
ﬁrst suggested we try the Decision Tree model.
I have also had incredible support at MSR. In particular, Carl Kadie has been an
incredible coding resource. Much of the implementation is due directly to him, and
most everything else reﬂects his inﬂuence. In addition, the work in Appendix
B
was
done in close collaboration with Guy Shani, who developed the eﬃcient algorithms
and performed the experiments. And of course Jennifer Listgarten, who generally
makes MSR a more interesting and enjoyable place to work.
Thanks to my parents for their unwavering support; to Richard Peterson, who
introduced me to biology, to Tom Cormen, who introduced me to computer science,
vii

and to Bob Gross, who brought it together for me in computational biology and gave
me incredible research opportunities as an undergraduate; to Arijit Chakravarty, for
mentoring me and shaping my approach to research; to Scott Saponas, Jon Froehlich,
Seth Bridges and others, who made grad school fun and helped prove that CS grad
students can hold their own in IM sports; and to Kate, for absolutely everything.
This work was supported in part by funding from a Microsoft Research Graduate
Fellowship.
viii

DEDICATION
To my amazing wife Kate, who is everything to me.
ix

1
Chapter 1
INTRODUCTION
Since its identiﬁcation as the pathogenic cause of Acquired Immunodeﬁciency Syn-
drome (AIDS) in the early 1980s, Human Immunodeﬁciency Virus Type 1 (HIV-1)
has emerged as a major global pandemic with an estimated 33 million infected indi-
viduals worldwide at the end of 2007 [
221
]. Speciﬁcally targeting the CD4
+
subset
of T-lymphocytes (the so-called “helper T cells”), HIV-1 causes a progressive deteri-
oration of immune function, leaving the infected individual susceptible to a range of
opportunistic infections that eventually lead to AIDS and death. Although improve-
ments in antiretroviral therapy have dramatically reduced HIV-related morbidity and
mortality among those with access to treatment [
172
], the search for an eﬀective
HIV-1 vaccine continues.
One of the enduring challenges facing HIV vaccine design is the remarkable rate
of viral mutation and adaptation that allows the virus to evade the adaptive immune
response of the host. As the immune system learns to target the virus, novel viral
mutations that allow the virus to escape the attack provide an advantage over virus
particles (virions) lacking the mutations. These escape mutations thus come to dom-
inate the viral population in the host via selection and the immune system is left
learning to target what must appear to be a novel pathogen, causing the process to
repeat. The result has been devastating to HIV vaccine design. The goal of a vaccine
is to train the host immune system to recognize the virus before exposure, but when
the virus is constantly changing, it is extremely diﬃcult to predict what the attacking
virus will look like, and thus, how to train the immune system. Even when monkeys
are inoculated and subsequently challenged with the same virus (so called homologous

2
challenge), protection is often incomplete, in part because any virions that survive
the initial attack quickly adapt. So then, is the search for a vaccine futile?
Although the power of adaptation may seem insurmountable, it should not be
surprising that there is a large body of evidence that the space of viable mutations
is constrained. Indeed, through all the myriad mutations the result must still be
an infectious virion. It would thus seem that identiﬁcation and characterization of
these constraints is both possible and necessary for the advancement of the ﬁeld.
That, in a nutshell, is the purpose of this dissertation: to develop, test, and apply a
statistical model of evolution that can robustly identify patterns and constraints in
viral evolution. The result is promising. Although the patterns are complex and will
require signiﬁcant follow up to tease apart, they are also dense, suggesting a promising
consistency that may provide insight into weak points in viral adaptation and suggest
new targets for vaccine design.
It should be evident that adaption is not a process unique to HIV, and thus the
models explored here may ﬁnd many uses outside the realm of HIV. Nevertheless, the
rapid rate of HIV adaptation provides a unique opportunity to capture adaptation
in near real time and to analyze distinct populations that are isolated from each
other (by virtue of infecting diﬀerent hosts), yet have had a chance to adapt to
their environment. Thus, the broad approach we take is to analyze a large cohort of
individuals who have been infected for some time (to allow the adaptations to arise)
and have not been exposed to antiretroviral therapy (which introduces a tremendous
source of selection pressure that may obliterate the signal induced by the immune
system). We do so by employing statistical models of evolution that attempt to
identify common patterns of selection and adaptation.
The statistical model we propose is the phylogenetic dependency network (PDN),
so-called because it is an adaptation of the dependency network [
94
] to an evolutionary
context. In brief, the PDN is a graphical model that relies heavily on local proba-
bility distributions that are conditioned on an underlying phylogeny. The graphical

3
structure, as well as the parameters of the probability distributions, are learned from
the data, with the resulting structure indicating statistical dependencies among vari-
ables. Although these dependencies cannot be interpreted as causal, they provide a
useful means to understanding potential interactions among the variables and suggest
simple experiments that can conﬁrm speciﬁc biological interactions. We describe the
model in detail in
chapter 4
.
There are several possible probability distributions that ﬁt naturally within the
PDN framework. Although we describe the details of the models in
chapter 4
, it
is useful to explore how the distributions work in practice.
chapter 5
is devoted to
the simpler model, which we call the univariate conditional adaptation model, as
it describes the adaptation of one trait in response to a single source of selection
pressure. We use synthetic data sets and real world analyses to explore the properties
of the model and compare it to previous approaches. Interestingly, this simple model
has proven quite useful in practice. We review some recent studies that have utilized
the univariate model in
section 5.6
. The more complete model we propose is the
multivariate conditional adaptation model, of which there are two variations. These
models incorporate multiple sources of selection pressure and are, in principle, better
able to describe dense interaction networks, in which each trait is inﬂuenced by myriad
sources of selection pressure. These models are explored in detail in
chapter 6
.
In
chapter 7
, we apply the PDN to the study of HIV adaptation. In this work,
we combine the two largest antiretroviral na¨ıve cohorts, comprising a diverse set of
HIV sequences and host genetics, into a single analysis. Combined with the PDN,
we have unprecedented statistical power and precision to explore the patterns of HIV
adaptation in response to the immune system. To the biologist, this chapter may be
read as the main result of the dissertation.
For the interested reader, the appendices provide technical discussions of tangen-
tial issues that arose as part of the dissertation research. Appendix
A
provides a
preliminary look at the technical details around extending the models discussed in

4
this dissertation to the case in which individual viral sequences are sampled from
each individual. In Appendix
B
, we consider the statistical characteristics of the false
discovery rate for the simplest probability distribution we encounter: the joint dis-
tribution of two independent, identically distributed binary variables evaluated using
Fisher’s exact test (FET) [
66
].

5
Chapter 2
DETECTING ADAPTATION:
INTRODUCTION AND REVIEW
2.1
Selection and adaptation
Let us begin with a brief, informal introduction to natural selection and adaptation,
as the concepts and terminology will be useful in later discussions. The process of
natural selection and adaptation can be summarized as follows. In any population
of individuals, there is a natural variation among those individuals arising from ge-
netic mutations, which arise randomly and then are passed on to oﬀspring. Most
mutations are deleterious, resulting in individuals who are less ﬁt than the rest of
the population, meaning that, on average, they tend to produce fewer oﬀspring. Be-
cause individuals who have these deleterious mutations produce fewer oﬀspring, over
time, the frequency of these mutations will be quite small or may even be eliminated
from the population. This process is referred to as negative or purifying selection. In
contrast, some mutations actually increase the ability of the individual to procreate.
Over time, individuals harboring these mutations come to dominate the population
and the mutation may even reach ﬁxation, meaning the majority of surviving indi-
viduals have the mutation. This process is referred to as positive selection, and the
mutation (or set of mutations) that resulted in increased ﬁtness is called an adapta-
tion. Finally, many mutations make no discernable diﬀerence to the individual. If
we were to follow the frequency of such mutations over a long period of time, we
would see the frequency of the mutation follow a random walk pattern, referred to
as neutral evolution. The process of selection and adaptation, is thus an interactive
process between the population and the environment. Certain characteristics (traits)

6
of the environment favor some mutations over others. Such traits serve as a source of
selection pressure, and the resulting adaptations are thus in response to that selection
pressure. Thus, when the environment changes, the selection pressures may change,
favoring diﬀerent adaptations.
Where the study of adaptation becomes particularly interesting is when compara-
ble populations are compared in diﬀerent environments. In this scenario, a mutation
that is neutral in environment A maybe be beneﬁcial in environment B. Thus, over
time, this mutation will be far more prevalent in environment B than in environment
A. Furthermore, the implication is that there are speciﬁc characteristics of environ-
ment B that interact (directly or indirectly) with the mutation. In the case of HIV,
these interactions prove vital for the design of an eﬀective vaccine (see
chapter 3
).
To ground our example, suppose we are interested in trait Y , which can take on
values in {0, 1}. In addition, suppose there exists an environmental trait X ∈ {0, 1},
which exerts selection pressure on Y , such that when X = 1, there is a selective
advantage for Y = 1 over Y = 0. Then we have,
Pr {Y = 1|X = 1} > Pr {Y = 0|X = 1} .
If we sample enough individuals from the two diﬀerent environments, we will be able
to see this correlation in the form of statistical dependence.
In this vein, the comparative method [
92
] seeks to identify correlated traits, with
the assumption that correlation implies the interaction of selection pressure and adap-
tation, which may imply a speciﬁc function for the traits involved. In essence, one
simply samples a large number of traits from individuals, as well as a number of envi-
ronmental traits, then tests for correlations among all pairs of traits. Interactions that
achieve some signiﬁcance threshold are then considered candidates for experimental
followup to determine the underlying process of selection and adaptation that leads
to the apparent correlation. The challenge with the comparative method, as elegantly
described by Felsenstein [
61
], lies in the confounding eﬀect of the evolutionary history

7
of the traits, which tends to make traits look more correlated than they really are.
2.2
Phylogeny confounds the comparative method
The relevant question then is how should we test for signiﬁcance? Suppose we are
considering the two binary traits X and Y . To determine whether these two variables
are associated, we could count the number of individuals with and without each trait
and apply a simple statistical test for indepedence, such as Fisher’s exact test. This
procedure, however, ignores the phylogenetic structure among the sequences [
61
].
Suppose these sequences have the phylogeny shown in
Figure 2.1
a. In essence,
there are two clusters of individuals where individuals within a cluster are similar
to each other but quite diﬀerent from those in the other cluster. Now suppose we
observe that traits X and Y are present in the two individuals on the top and absent
in the two individuals on the bottom, as shown in
Figure 2.1
a. The observations of
the amino acid are well explained by the phylogeny alone and should not be treated as
independent observations. Consequently, the application of Fisher’s exact test or some
other test that ignores the phylogenetic structure would overcount these observations
when determining the correlation of X and Y . Such overcounting will lead to an
overestimation of the statistical signiﬁcance, leading to a surprising number of false
positives. In contrast, suppose we make the observations shown in
Figure 2.1
b. Here,
observing the presence and absence of the trait Y in the same branch of the phylogeny
is quite surprising, until the observations of X are taken into account. In this case,
the application of a simple test would undercount the observations when determining
the correlation of X and Y , leading to an underestimation of statistical signiﬁcance
and potentially increasing the number of false negatives.
Simple statistical methods such as Fisher’s exact test assume the data to be in-
ﬁnitely exchangeable or independent and identically distributed (IID). Although se-
quence data and other biological data are IID a priori, they are not IID once we learn
their hierarchical structure. Furthermore, as we have just seen, this structure can

8

y
x
y
x
y
x
y
x
y
x
y
x
y
x
y
x

Download 4,8 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 ... 12