"Frontmatter". In: Plant Genomics and Proteomics
Download 1.13 Mb. Pdf ko'rish
|
Christopher A. Cullis - Plant Genomics and Proteomics-J. Wiley & Sons (2004)
D
ATABASES AND I NFORMATICS The high-throughput methodologies that have been developed for both DNA and proteomics have highlighted the need for sophisticated infor- matics to deal with the data. Additionally, it is also essential to define the appropriate structures for the databases that house all this information so that the data can be accessible both immediately and in an archival form. D ATABASES The creation of biological databases represents a fundamental change in the way biological information is disseminated. Future advances in biology will D ATA B A S E S A N D I N F O R M AT I C S 3 9 4 0 2. T H E B A S I C T O O L B O X — A C Q U I R I N G F U N C T I O N A L G E N O M I C D ATA size pH Mass Spectrometry Determine mass or Determine amino acid sequence Compare with existing databases to find matches and identify proteins Extract proteins from the gel. Split into peptiode fragments Low abundance high Sample 1 Sample 2 FIGURE 2.7. Pr oteomics experimental flow . The isolation of pr oteins followed by their separation by 2-D gel electr ophor esis. The patterns of pr oteins ar e compar ed, and those of inter est ar e excised, fragmented, and separated by mass spectr ometry . The amino acid composition or sequence is determined (depending on the particular MS technology applied), and the databases ar e sear ched to identify the pr otein. The actual amino acid sequence is important for those species for which ther e is little genomic or cDNA sequence. Pr otein abundance fr om sample 1; pr otein abundance fr om sample 2. depend, in large part, on the improvement of critical databases. However, organism-specific researchers want to generate and use data locally, annotate it as needed, and answer very specific questions driven by physical experimentation, even though these experimental data and findings often are shared with the global community. This larger group of researchers, namely the global community, requires access to large amounts of data to address questions that might be of limited interest to its original pro- ducers. Furthermore, the tools and data formats applicable to this high-level sharing may be very different from those applied in the context of the orig- inal data production. The underlying challenge, therefore, is to integrate diverse data and data types in order to provide a more complete and con- sistent view of the information contained therein and to provide the means to increase the utility of these resources as the quantity of data increases in the future. Currently, however, the distributed data resources of biology, and in par- ticular of plant genomics, share a number of characteristics that have made the actual interrogation and analysis problematic. In many instances the data are in flat file structures and there is no separate schema for the metadata or such a schema is not available. The data are addressed through varying call- based interfaces, rather than through a declarative query language. This dis- allows the use of agents and requires human-computer interactions. Indeed, the data collections that result from the genomics revolution represent a need to change the way in which biological data are disseminated. All databases should allow the broadest access possible to accommodate these altered data usages. Therefore, a standard format must be developed to facilitate this access. Included in the consideration of such formats would be that the associated metadata must also be accessible to enable queries by machine agents as well as individuals. Wherever possible, the development and use of organism- and/or database-independent software should be encouraged. This will not diminish the importance of organism-dependent repositories but will enable the data to be used more widely and will increase their importance. The essential information associated with these data repositories, or par- ticular entries within the databases, should also be reported. For example, a list of possible information associated with comparative genetic mapping in plants could include the raw segregation data for the individual mapping populations, explicit criteria that were used to determine whether two markers represent orthologous loci, and the sequences of all DNA-based markers as GenBank files. Trace files for sequence-based polymorphisms should be archived by mapping laboratories for future access, and the under- lying information used to construct the physical maps (e.g., FPC fingerprints, BAC end sequences, BAC hybridization) could be made publicly available in a project database. The structure of the databases and the information contained therein must be supported by appropriate documentation and D ATA B A S E S A N D I N F O R M AT I C S 4 1 standard operating procedures for both experimentation and analysis. A set of criteria somewhat like those developed for microarrays, but perhaps in the form of the minimal information associated with a functional genomics experiment (MIAFGE) would go some way to solving this problem. A start on generating such metrics has been made by the Plant Genome Research Program (http://plantgenome.sdsc.edu/Awardees Meeting/Bioinformatics_and_Databases/). Given a collective effort, the bioinformatics landscape could be trans- formed from a small number of insular database projects to a large number of open, interoperable data services that together would form the fabric of a new biological data infrastructure. This transformation would require a change of emphasis, reducing the efforts on species-specific databases and redeploying them to the development of biological data service infrastruc- ture from which deliverables would be portable, general-purpose software and standards that would be made freely available to the academic com- munity as well as industry. Databases such as ZMDB and Gramene that are developing both the tools and open structures to allow the broadest access are examples of these efforts. I NFORMATICS T OOLS A range of informatics tools is needed to analyze this massive generation of data. For nucleic acids, these tools range from those dealing with sequence quality and assembly to search engines for comparing sequences and also include programs that can be used to annotate DNA sequences that are trained to identify genes. S EQUENCE Q UALITY Generally the sequence quality is checked by the length of the reads and quality score assignment. Phred reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files (Ewing et al., 1998). These files can be in any one of three formats: FASTA, PHD or the SCF. Quality values for the bases are written to both FASTA and PHD format files. These files can be used by the Phrap sequence assembly program to increase the accuracy of the assembled sequence. Phrap is a leading program for DNA sequence assembly. Phrap is used to locate overlapping regions within individual sequences and assemble them into longer contiguous sequences (contigs). Phrap is most commonly used for assembling data from shotgun sequencing but can also be used for EST clustering, genotyping, and identifying sequence polymorphisms. Phrap uses Phred’s quality scores to determine a highly accurate consensus 4 2 2. T H E B A S I C T O O L B O X — A C Q U I R I N G F U N C T I O N A L G E N O M I C D ATA sequence by examining all the individual sequences at a given position. This approach is especially important in regions of low coverage or regions of systematic errors. The quality of the consensus sequence is also estimated from the quality information of individual sequences. A more common program for EST clustering is CLUSTAL W (Thompson, 1994). This is a freely available and portable program for mul- tiple sequence alignment. Because EST projects are essentially one-pass sequencing of cDNAs it is important to cluster the derived sequences into contiguous sets that come from the same gene or from members of a gene family. This will result in a consensus sequence of that transcript. The consensus sequence can be used, for example, to design overgo oligos for screening related BAC libraries and for designing primers to test whether nucleotide polymorphisms within the clustered sequence are the result of single-nucleotide polymorphisms (SNPs) or sequencing errors. The clustered consensus can also be longer than any of the individual ESTs, thereby extending the length of the known transcribed sequence. As well as needing to cluster ESTs, shotgun reads of genomic sequences also must be assembled. The program CAP4 (Huang et al., 2000) utilizes base quality values, forward-reverse constraints, and automatic clipping of poor- quality ends based on overlaps to assist in assembly and production of more accurate contigs. CAP4 generates contigs and consensus sequences that can be viewed and edited with Paracel’s AssemblyView, the University of Washington’s Consed (Gordon et al., 1998), or Staden’s gap4 Contig Editor. Furthermore, CAP4 also generates valuable information concerning scaf- folds, that is, what contigs are linked together based on constraints. This feature is especially important for low-pass sequencing projects to order the contigs and for finishing phases by providing information on which subclones are necessary to bridge the gaps. The whole suite of informatics resources and needs are described in more detail in Chapter 9. Download 1.13 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling