"Frontmatter". In: Plant Genomics and Proteomics


Download 1.13 Mb.
Pdf ko'rish
bet65/87
Sana23.02.2023
Hajmi1.13 Mb.
#1225741
1   ...   61   62   63   64   65   66   67   68   ...   87
Bog'liq
Christopher A. Cullis - Plant Genomics and Proteomics-J. Wiley & Sons (2004)

D
ATABASES
The necessary components of a database’s structure are covered to some
extent in Chapter 2. The large volume of data must be organized in an infor-
mative and relational manner to allow it to be easily linked with other rele-
vant data sets. The data sets also need to be appropriate to facilitate
automated data mining, which is the extraction of hidden predictive infor-
mation from databases. 
The data, and the tools for mining and analyzing the data, can be made
available either through a general data warehouse, for example, the National
Center for Biotechnology Information (NCBI), or one focused on a specific
species or set of species, examples of which are The Arabidopsis Information
Resource (TAIR), Gramene, ZmDB, MaizeDB, and the Legume Information
System. 
The ultimate goal of data management is to provide data access, data
mining, and modeling support. Data mining, also known as knowledge dis-
covery in databases, uses sophisticated statistical analysis and modeling
techniques to uncover predictive patterns and relationships hidden in orga-
nizational databases—patterns that ordinary methods might miss. This
activity usually includes a combination of machine learning, statistical analy-
sis, modeling techniques, and database technology.
The information necessary before setting up a database includes:
∑ The types of data that will be collected and stored—gel images,
sequence data including trace files, phenotypes, genetic maps, etc.
∑ The metadata associated with each form of stored data to be included:
a. That describing the collected data—all the necessary experimental
information relating to the stored images, the growth conditions,
and the developmental stages of the sampled tissue for RNAs used
in microarray experiments, etc. 
b. That describing the derived data—the version of analysis pro-
grams that have produced the reported alignments, the versions
of the databases included in a clustering of ESTs, etc.
If the data are used in a subsequent analysis that modifies the previous con-
clusions, both the original and the new conclusions should be stored and
1 6 8
9. B
I O I N F O R M AT I C S


versioned, so a returning user can trace the changes from his or her previ-
ous download. In this way subsequent searches will be interpretable by
outside investigators. 
Therefore, any specific genomics database must be developed so that it
is compatible with: 
∑ The existing data generation and reporting systems 
∑ The databases to be queried 
∑ The target database for dissemination of results 
The database must also be designed to enable the analysis of the data with
respect to multiple dimensions that would include the integration of varia-
tions associated with development, tissue type, genotype, and growth con-
ditions, among other possibilities, that is, cross-queries must be facilitated
both within and across databases.
Clearly, the ultimate objective of the data is to provide the basis for
further investigations, and therefore the data must be available to the com-
munity of scientists who may be able to make use of it. The need to access
and download data may extend from a single item to a bulk download, for
example, of complete EST collections, so that the database structure must be
sufficiently flexible to accommodate such diverse needs. In addition to the
data accessibility concerns is the problem of defining a common vocabulary
applicable to the data. Thus, if various groups use different terms to describe
the same data, then the comparison of those data sets becomes much more
difficult, especially with respect to automated queries. Therefore, the efforts
at developing gene ontologies, for example, the Gene Ontology (GO) project,
have the goal of producing a dynamic controlled vocabulary that can be
applied to all organisms even as knowledge of gene and protein roles in cells
is accumulating and changing (http://www.geneontology.org/) (Ashburner
et al., 2000). 
Once the data have been generated in a suitable format, then some or
all of it must be stored. Two categories of data, static and dynamic, can be
archived. Examples of static data, or data that do not change frequently, are
sequences, publication records, and germplasm descriptions. Dynamic data,
in contrast, undergo frequent changes and comprise mainly derived data
such as sequence similarity and genetic or physical maps. 
Three possible levels of data storage are:
∑ Individual laboratory or project databases
∑ Specialized databases
∑ Central public or private databases and archives
The individual databases are databases that manage data from individual
groups or projects as they are generated. Here all the detailed information
D
ATA B A S E S
1 6 9


for those specific data is stored. Much of the data in these databases is likely
to be dynamic data, with the static data being transferred to specialized or
public databases.
Specialized databases are frequently either organism-specific databases
or databases focused on a few closely related species. Examples of these spe-
cialized databases are TAIR, Gramene, ZmDB, MaizeDB, and the Legume
Information System. They are likely to contain both some of the static data
and the dynamic data because they also function as a central resource for
that group of organisms. They are designed to house the most current and
accurate data and establish standards for data exchange for complex,
dynamic data types appropriate to that community. They are also restricted
to a subset of all the genomics and genetics data available, because they do
not store the totality of the plant genomics data, even if they do contain all
the data for the representative set of organisms included. However, it is
important that these specialized databases are compatible among themselves
and with the wider universe of databases so that automated queries can be
facilitated. The current state of genomics information and analysis has
resulted in multiple versions of even the specific databases. Therefore, for
Arabidopsis thaliana there are databases that are maintained by TAIR, MIPS,
and TIGR that may have the same static data but different versions of some
of the dynamic data such as the composition of the current unigene set.
GenBank is an example of a public archive, especially for nucleic acid
sequence data, because it is the only place to which all such data are expected
to be submitted. GenBank is expertly managed in the areas of data storage,
handling, Internet access, retrieval, and analysis of nucleic acid and protein
data. 
Assuming that other central databases are developed for storage, the
question that arises is what should be stored and for how long. These same
questions arise with any specialized database as well, although the curation
associated with these specialized databases can probably deal with this 
question adequately. However, how does the long-term storage of the data
housed in individual laboratory or project databases occur, as funding (and
investigators) have finite lifetimes? Once again, the importance of database
standards becomes clear. Provided the structure of any boutique database is
appropriate for its importation into a data warehouse, if necessary, the data
can be relatively easily archived. However, the question as to who curates
the public archives, and who is responsible for supporting this curation and
storage, is not as easily solvable. 

Download 1.13 Mb.

Do'stlaringiz bilan baham:
1   ...   61   62   63   64   65   66   67   68   ...   87




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling