"Frontmatter". In: Plant Genomics and Proteomics


Download 1.13 Mb.
Pdf ko'rish
bet31/87
Sana23.02.2023
Hajmi1.13 Mb.
#1225741
1   ...   27   28   29   30   31   32   33   34   ...   87
Bog'liq
Christopher A. Cullis - Plant Genomics and Proteomics-J. Wiley & Sons (2004)

I
DENTIFICATION OF
G
ENES FROM
S
EQUENCE
D
ATA
E
XPRESSED
S
EQUENCE
T
AGS
One measure of whether a sequence within the genome is a gene is its iden-
tification within an RNA population. Therefore, the cloning and sequencing
7 0
4. G
E N E
D
I S C O V E R Y


of RNAs is one way to identify genes. Short stretches of RNA sequences
derived from cDNAs are referred to as expressed sequence tags (ESTs). Gene
expression can be dependent both on the type of tissue and on the environ-
ment in which that tissue finds itself, so a wide sampling of many tissues in
various growth and challenge conditions must be undertaken to identify all,
or most of, the genes. As described in Chapter 2, the population of sequences
represented in a cDNA library is a reflection of the abundance of the RNAs
present in the tissue sampled. Therefore, genes that are expressed at low
levels may be missed in projects that sequence cDNAs that are generated
from unfractionated RNA. 
T
HE
G
ENERATION OF
EST
S
A pipeline for the informatic analysis of EST sequences is shown in 
Figure 4.1 (adapted from http://www.zmdb.iastate.edu/zmdb/EST/
assembly.html). The RNA is isolated and reverse transcribed into cDNA. The
cDNA clones are sequenced by performing single-pass sequencing reactions
from either the 5¢ or 3¢ ends of the cDNA or from both ends. The sequences
are then clustered to identify a series of tentative unique genes (TUGs) or
tentative contigs (TCs) that are present in the RNA population that is being
sequenced. This clustering will identify the number of different RNAs
present in the initial sample. The TUGs/TCs can then be compared with the
current databases to identify which of these have already been described in
the species under consideration and which are still absent from the current
databases. Where hits to previously reported sequences occur, the new
assembly is collapsed into a single consensus sequence and added to the
database. Where hits occur to ESTs from other organisms, a possible func-
tion may be ascribable to the sequence. 
The sequencing of any given sample is continued until the rate of finding
new sequences drops below an acceptable level (for example 50% of all
sequences are already present in either these data or from previous EST 
collections). 
The cDNAs from various tissues or treatments are sequenced to deter-
mine the level of novel sequences that are in each sample. As before,
sequencing continues until the rate of gene discovery drops below an accept-
able level. This method will generate a huge redundancy of highly abundant
RNAs. What are likely to be missed are those RNAs that are present in low
abundances and those genes that are only expressed in specialized cells.
Therefore, techniques facilitating the isolation of specific tissues or cells, such
as laser capture microscopy and RNA amplification, may help in the identi-
fication of genes that are expressed at low levels or in very few cells. The
dissection or isolation of specific tissues such as the peltate trichome glands
(which are aggregates of 1–9 specialized cells suspended on a stalk above
the aerial surfaces of many plants), where important secondary products are
I
D E N T I F I C AT I O N O F
G
E N E S F R O M
S
E Q U E N C E
D
ATA
7 1


synthesized, should lead to the isolation of the genes involved in these meta-
bolic pathways (Wang et al., 2001). 
The high-throughput EST sequencing approach represents a relatively
low-cost method to identify a large number of transcripts in an organism 
as well as generating information about the patterns of gene expression 
specific to certain tissues, developmental stages, and physiological con-
ditions. The value and importance of ESTs is indicated by the numbers in 
GenBank dbEST (http://www.ncbi.nlm.nih.gov/dbEST/dbEST¢summary.
7 2
4. G
E N E
D
I S C O V E R Y
Leaves
Roots
Disease leaves
SEQUENCES
Trichomes
RNA Isolation
cDNA generation 
Cloning 
Single pass sequencing
Clustering
Assembled ESTs
BLAST
No Hits
Hit(s)
Singlet
Tentative Unique Genes
Contig
Collapse the Contig
FIGURE 4.1.
EST pipeline for the acquisition and assembly of sequences. Contigs
are EST clusters with two or more member ESTs. Singlets are ESTs that are not sig-
nificantly similar to any other ESTs. The combined contigs and singlets and those
with no BLAST hits represent a set of unique EST clusters called tentative unique
genes. They are labeled “tentative” to indicate that they are still subject to changes
as new ESTs are added to the assembly.


html). Release 2/14/2003 lists 14,411,241 ESTs, of which about 2,800,000 are
from plants, with wheat, barley, and soybeans leading the list (Table 4.1).
As mentioned above an inherent problem of EST sequencing projects is
the generation of redundant sequences. One way of reducing redundant
sequencing is to enrich the RNA populations for low-abundance transcripts.
A number of normalization and subtraction methodologies for enrichment
of these low-abundance RNAs before cloning are described in Chapter 2.
Alternatively, abundant cDNA clones can be removed before sequencing by
screening high-density cDNA filters with labeled RNA. The clones that have
strong hybridizations are eliminated, and the minimally-hybridizing clones
are rearrayed and sequenced. There will always be some redundancy irre-
spective of the method of enrichment, but this can be managed informati-
cally. The clustering and assembly of individual ESTs into TUGs/TCs will
result in decreased sequence redundancy and a final consensus sequence
that should be both more accurate and longer than any of the underlying
individual ESTs in the database. The ultimate goal of EST projects is the
development of a unigene set. The unigene set should eventually contain all
the genes for the organism, but this complete compendium is unlikely to be
assembled from just EST data because of the need to sample every possible
tissue and find every transcript. However, the clustering algorithms will
identify all the transcripts from a gene family and generate a consensus
sequence from the EST data. Therefore, any information in the EST data 
that identifies the differing members of a gene family whose expression is
restricted to a particular cell or tissue type will be lost in the development
of the unigene set and have to be recaptured elsewhere.

Download 1.13 Mb.

Do'stlaringiz bilan baham:
1   ...   27   28   29   30   31   32   33   34   ...   87




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling