"Frontmatter". In: Plant Genomics and Proteomics
Download 1.13 Mb. Pdf ko'rish
|
Christopher A. Cullis - Plant Genomics and Proteomics-J. Wiley & Sons (2004)
I
DENTIFICATION OF G ENES FROM S EQUENCE D ATA E XPRESSED S EQUENCE T AGS One measure of whether a sequence within the genome is a gene is its iden- tification within an RNA population. Therefore, the cloning and sequencing 7 0 4. G E N E D I S C O V E R Y of RNAs is one way to identify genes. Short stretches of RNA sequences derived from cDNAs are referred to as expressed sequence tags (ESTs). Gene expression can be dependent both on the type of tissue and on the environ- ment in which that tissue finds itself, so a wide sampling of many tissues in various growth and challenge conditions must be undertaken to identify all, or most of, the genes. As described in Chapter 2, the population of sequences represented in a cDNA library is a reflection of the abundance of the RNAs present in the tissue sampled. Therefore, genes that are expressed at low levels may be missed in projects that sequence cDNAs that are generated from unfractionated RNA. T HE G ENERATION OF EST S A pipeline for the informatic analysis of EST sequences is shown in Figure 4.1 (adapted from http://www.zmdb.iastate.edu/zmdb/EST/ assembly.html). The RNA is isolated and reverse transcribed into cDNA. The cDNA clones are sequenced by performing single-pass sequencing reactions from either the 5¢ or 3¢ ends of the cDNA or from both ends. The sequences are then clustered to identify a series of tentative unique genes (TUGs) or tentative contigs (TCs) that are present in the RNA population that is being sequenced. This clustering will identify the number of different RNAs present in the initial sample. The TUGs/TCs can then be compared with the current databases to identify which of these have already been described in the species under consideration and which are still absent from the current databases. Where hits to previously reported sequences occur, the new assembly is collapsed into a single consensus sequence and added to the database. Where hits occur to ESTs from other organisms, a possible func- tion may be ascribable to the sequence. The sequencing of any given sample is continued until the rate of finding new sequences drops below an acceptable level (for example 50% of all sequences are already present in either these data or from previous EST collections). The cDNAs from various tissues or treatments are sequenced to deter- mine the level of novel sequences that are in each sample. As before, sequencing continues until the rate of gene discovery drops below an accept- able level. This method will generate a huge redundancy of highly abundant RNAs. What are likely to be missed are those RNAs that are present in low abundances and those genes that are only expressed in specialized cells. Therefore, techniques facilitating the isolation of specific tissues or cells, such as laser capture microscopy and RNA amplification, may help in the identi- fication of genes that are expressed at low levels or in very few cells. The dissection or isolation of specific tissues such as the peltate trichome glands (which are aggregates of 1–9 specialized cells suspended on a stalk above the aerial surfaces of many plants), where important secondary products are I D E N T I F I C AT I O N O F G E N E S F R O M S E Q U E N C E D ATA 7 1 synthesized, should lead to the isolation of the genes involved in these meta- bolic pathways (Wang et al., 2001). The high-throughput EST sequencing approach represents a relatively low-cost method to identify a large number of transcripts in an organism as well as generating information about the patterns of gene expression specific to certain tissues, developmental stages, and physiological con- ditions. The value and importance of ESTs is indicated by the numbers in GenBank dbEST (http://www.ncbi.nlm.nih.gov/dbEST/dbEST¢summary. 7 2 4. G E N E D I S C O V E R Y Leaves Roots Disease leaves SEQUENCES Trichomes RNA Isolation cDNA generation Cloning Single pass sequencing Clustering Assembled ESTs BLAST No Hits Hit(s) Singlet Tentative Unique Genes Contig Collapse the Contig FIGURE 4.1. EST pipeline for the acquisition and assembly of sequences. Contigs are EST clusters with two or more member ESTs. Singlets are ESTs that are not sig- nificantly similar to any other ESTs. The combined contigs and singlets and those with no BLAST hits represent a set of unique EST clusters called tentative unique genes. They are labeled “tentative” to indicate that they are still subject to changes as new ESTs are added to the assembly. html). Release 2/14/2003 lists 14,411,241 ESTs, of which about 2,800,000 are from plants, with wheat, barley, and soybeans leading the list (Table 4.1). As mentioned above an inherent problem of EST sequencing projects is the generation of redundant sequences. One way of reducing redundant sequencing is to enrich the RNA populations for low-abundance transcripts. A number of normalization and subtraction methodologies for enrichment of these low-abundance RNAs before cloning are described in Chapter 2. Alternatively, abundant cDNA clones can be removed before sequencing by screening high-density cDNA filters with labeled RNA. The clones that have strong hybridizations are eliminated, and the minimally-hybridizing clones are rearrayed and sequenced. There will always be some redundancy irre- spective of the method of enrichment, but this can be managed informati- cally. The clustering and assembly of individual ESTs into TUGs/TCs will result in decreased sequence redundancy and a final consensus sequence that should be both more accurate and longer than any of the underlying individual ESTs in the database. The ultimate goal of EST projects is the development of a unigene set. The unigene set should eventually contain all the genes for the organism, but this complete compendium is unlikely to be assembled from just EST data because of the need to sample every possible tissue and find every transcript. However, the clustering algorithms will identify all the transcripts from a gene family and generate a consensus sequence from the EST data. Therefore, any information in the EST data that identifies the differing members of a gene family whose expression is restricted to a particular cell or tissue type will be lost in the development of the unigene set and have to be recaptured elsewhere. Download 1.13 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling