"Frontmatter". In: Plant Genomics and Proteomics
Download 1.13 Mb. Pdf ko'rish
|
Christopher A. Cullis - Plant Genomics and Proteomics-J. Wiley & Sons (2004)
I
NFORMATICS T OOLS Once the data have been generated and appropriately stored in a database, it is usually necessary to analyze that data in some form. A large number of 1 7 0 9. B I O I N F O R M AT I C S tools have been, and are continuing to be, developed for this purpose to cover the whole range of genomics data that are acquired. Many of these tools are available free over the Internet, and many of the sites that house the data are also designed to allow access to the various tools so that the analyses can all be performed through that particular site. In addition, many of the tools are designed to give graphical representations/visualizations for ease of display and interpretation of the results. I NFORMATICS AND A NALYSES OF P LANT G ENOMES Genomic approaches to plant biology have provided a tremendous wealth of gene and gene sequence data for a wide variety of plant species. The avail- ability of these large bodies of data provides both opportunities and chal- lenges. The cross-referencing of the sequence data with the expression data available from functional genomics and proteomics projects provides an opportunity to develop a complete catalog of gene function across the plant kingdom. A start has been made on this with the 2010 program at the NSF and its counterparts in Europe and Japan. It is important to integrate diverse data and data types to provide a more complete and consistent view of all this information. Therefore, the annotation of the data must be consistent and identifiable. However, the current crop of bioinformatics software does not generate a single unambiguous conclusion, because they use different criteria either for identifying the features associated with a particular genomic sequence, for example, the identity of splice sites, or for recogniz- ing the component open reading frames in a spliced gene. S EQUENCE A SSEMBLY Sequence assembly is a crucial early step in many genomics projects. This is because most of the initial sequence “reads” are very much shorter than the functional or important region of the genome being characterized. As described in Chapter 3, whole genome sequencing projects use two differ- ent approaches to generate the sequence, whole genome shotgun sequenc- ing and minimum tiling path methods. However, in each case the sequence reads must be assembled into large contiguous regions of genomic sequence. The draft genomic sequences then must be finished. Finishing is the process of turning a rough draft assembly into a highly accurate contiguous DNA sequence with a defined maximum error rate. Therefore, finishing involves closing the remaining gaps, resolving ambiguities, and validating the assem- bly. The aim is that the overall sequence quality conforms to the Bermuda standard of being confirmed by at least two templates, accurate to at least 1 bp in 10,000, and with no gaps. Finishing usually makes use of multiple software tools in an iterative manner to obtain contiguous sequence that meets the standards for double-strand coverage and sequence quality I N F O R M AT I C S T O O L S 1 7 1 (http://www.genome.gov/10001812). Once the finished assembly is com- plete it can be experimentally verified via PCR using selected primer pairs. These primers are often the same ones that were generated to close any gaps in the original assembly. The process of finishing a draft assembly is shown in Figure 9.1. EST A SSEMBLY Sequence assembly is also used to collapse the large libraries of expressed sequence tags (ESTs) to produce tentative contigs (TCs) or unigene sets (TUGs). Such assembly procedures eliminate the redundancy that exists in 1 7 2 9. B I O I N F O R M AT I C S IMPORT DATA DATA STORAGE AND ASSEMBLY RESEQUENCING LOW QUALITY REGIONS READ INTO GAPS GAP CLOSURE Contig 1 Contig 2 Gap Yes No Are clones available that cover gap? Sequence these Submit to GenBank PCR across gap Add to data storage and assembly Sequence from PCR product or from subclones Data Review and quality control FIGURE 9.1. Genome sequence finishing. The rough draft assembly is converted into a fully assembled contiguous sequence by resequencing and gap closure (Adapted from http://www-shgc.stanford.edu/Seq/doepages/methodology.html). the EST data because multiple ESTs may possibly represent different parts of the same gene. The clustering also can extend the length of the sequence and would ideally result in the full length of the gene being included in the final assembled product. Such collapsing of the data sets is important. Thus the 415,000 wheat ESTs in GenBank are unlikely to represent an equivalent number of separate genes given our knowledge of the gene content of other plant genomes. The unigene set that is derived from the assembly of ESTs is not a compete set of genes for that organism because the cDNA libraries used to generate the ESTs will not reflect all the genes in a particular genome. The expression of some genes is too low or too transient to result in capture, whereas other genes will only be expressed in tissues under certain growth conditions or in particular tissues. Therefore, the unigene set is simply the least redundant set of expressed sequences that can be arrived at by using all the available data. The extent of the coding regions included in the unigene set can be extended by the use of gene predictions from complete genomic sequences. The matching region can be aligned and gene prediction programs used to identify the possible transcription and translation start sites. This informa- tion can be used to design primers to be used in reverse transcription PCR (RT-PCR) to test whether the predicted fragments can be amplified from mRNA populations. Again, informatic analysis of the regions 5¢ to the ten- tative expressed region may indicate where and/or when such a transcript may be expressed, thereby narrowing the range of tissues or developmental stages that must be screened to generate experimental evidence for the pre- dicted expression. The 5¢ regions are the least represented transcribed sequences, as cDNA, SAGE, or MPSS libraries are all biased toward the 3¢ ends of the mRNAs because of the oligo-dT priming of the first-strand synthesis. T HE R IGHT T OOL FOR THE R IGHT J OB Many of the database sites contain a range of tools for sequence characterization and identification. The NCBI (http://www. ncbi.nlm.nih.gov/BLAST/) site makes available a range of tools with a regular update of all the documentation describing the programs, their uses, and the interpretation of the resulting data. Many of the specific data- bases also include BLAST as an integral option, although without the full range of functions or the tutorials. BLAST BLAST ® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regard- I N F O R M AT I C S T O O L S 1 7 3 less of whether the query relates to protein or nucleotide sequence. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits (http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html). The BLAST algorithm does not only identify matches that serve to make a correct functional assignment. BLAST has a large set of variable parame- ters that can be altered to produce a range of different matches. The data- base that is the target of the BLAST search changes constantly as the available set of sequenced genes increases. This will also change the output, some- times dramatically. Thus even a program as basic as BLAST will not gener- ate the same results reproducibly unless users carefully coordinate and synchronize all the program’s inputs. Although the NCBI site has a BLAST Program Selection Guide (reproduced in Tables 9.1 and 9.2) and a series of tutorials to aid in the use and interpretation of the data, the data still gen- erally must be manually inspected to determine which particular matches are meaningful. S EQUENCE M ATCHING Sequence similarity is a very general tool that forms the basis of many dif- ferent biological sequence analyses. The basic tool for generating sequence matches is BLAST. It is limited by the traditional alignment presentation style of the results. An alternative program such as Miropeats (http://www.genome.ou.edu/miropeats.html) discovers regions of sequence similarity within or among any set of DNA sequences and then graphically represents the regions that are similar. An example is given in Figure 9.2, which shows the distribution of repeated segments over a short stretch of flax sequence, with a pair of inverted repeats, a tandem oligo repeat, and a palindrome. In Figure 9.3, the same region of the Arabidopsis chromosome 2 (BAC F16P2) that is shown in Figure 1.4 is shown as a Miro- peat pattern rather than with the repeats indicated with directed arrows as in Figure 1.4. The enhancement offered by Miropeats when making con- ventional DNA sequence comparisons is the summary of extensive large- scale sequence similarities on a single page of graphics. Miropeats can handle the comparison of the repeat structures of entire chromosomes, visu- alizing overlapping sequence fragments in a contig assembly project or com- paring the products of different contig assembly programs. A NNOTATIONS OF G ENOME S EQUENCE Once the genome sequence has been assembled, the important features such as genes, transposons, and repeats, must be placed on the sequence. This 1 7 4 9. B I O I N F O R M AT I C S I N F O R M AT I C S T O O L S 1 7 5 TABLE 9.1. W HICH BLAST TOOL IS A PPROPRIATE FOR S PECIFIC S EARCHES ? If your sequence is NUCLEOTIDE Length Database Purpose BLAST Program 20 bp or Nucleotide Identify the query MEGABLAST (accept longer sequence batch queries) Standard BLAST (blastn) Find sequences similar Standard BLAST to query sequence (blastn) Find proteins similar to Translated BLAST translated query in a (tblastx) translated database Protein Find proteins similar Translated BLAST to translated query (blastx) in a protein database 7–20 bp Nucleotide Find primer binding Search for short, sites or map short nearly exact matches contiguous motifs If your sequence is PROTEIN Length Database Purpose BLAST program 15 residues Protein Identify the query Standard Protein BLAST or longer sequence or find (blastp) protein sequences similar to query Find members of a PSI-BLAST protein family or build a custom position- specific score matrix Find proteins similar to PHI-BLAST the query around a given pattern Conserved Find conserved domains CD-search Domains in the query (RPS-BLAST) Conserved Find conserved domains Conserved Domain Domains in the query and Architecture identify other proteins Retrieval Tool with similar domain (CDART) architectures Nucleotide Find similar proteins in Translated BLAST a translated nucleotide (tblastn) database 5–15 residues Protein Search for peptide motifs Search for short, nearly exact matches From http://www.ncbi.nlm.nih.gov/BLAST/producttable.html 1 7 6 9. B I O I N F O R M AT I C S T ABLE 9.2. S PECIALIZED D A T AB ASES A V AILABLE FROM NCBI Specialized Database Sear ches Query Database Purpose BLAST Pr ogram Nucleotide None Compar e the query and second sequence BLAST 2 Sequences or Pr otein dir ectly The NCBI Draft Map the query sequence. Determine the Human Genome BLAST Human Genome genomic str uctur e. Identify novel genes. Mouse Genome Map the query sequence. Determine the Mouse Genome BLAST genomicstr uctur e. Identify novel genes. Rat Map the query sequence. Determine the Rat Genome BLAST page genomic str uctur e. Identify novel genes. Download 1.13 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling