Software for Improving Scientific Data Access Infrastructure Russ Rew Unidata Program Center


Download 501 b.
Sana03.09.2018
Hajmi501 b.


Software for Improving Scientific Data Access Infrastructure

  • Russ Rew

  • Unidata Program Center


Overview

  • Problems in scientific data management

  • Some efforts toward finding solutions

    • Distributing near-real time data
    • A data model for the earth sciences
    • Advancing a metadata standard for the earth system science community
    • Serving metadata and data
    • Visualizing and analyzing geoscience data
  • Thoughts on the value of infrastructure



Thanks to:

  • GFD Dennou Club and members who visited Unidata in 2004

  • Research Institute for Sustainable Humanosphere

  • The Japanese Meteorological Agency

  • National Science Foundation and UCAR

  • Unidata Program Center staff and associated community



Unidata

  • Funded primarily by the U.S. National Science Foundation

  • Mission: To provide data, tools, and community leadership for improving Earth-system education and research

  • At the Unidata Program Center, we

    • Provide access to data (via push and pull systems)
    • Develop open source tools and infrastructure for data access, analysis, visualization, and data management
    • Support users of our technologies: faculty, students, and researchers
    • Help to build, represent, and advocate for a community


Background

  • Science increasingly advances through collaborations and synthesis of data across disciplines

  • Tools are not keeping up with need to analyze and combine data from different disciplines

  • We need to continue improving tools for data distribution, data modeling, metadata, remote access, and visualization

    • Within geosciences
    • Across scientific disciplines
    • For collaborations, including international scale


Problems in the Current Scientific Data Landscape

  • Data volumes are approximately doubling every year

  • Data tools are not keeping pace with data volumes

  • Very large datasets demand new techniques for data management

  • Integrating data across organizations and disciplines is very difficult

  • Input/Output bandwidth is not keeping up with storage capacity

  • No single data model has achieved critical mass in the scientific community, each discipline has their own

  • There are many metadata models for each discipline and a lack of common conventions for metadata



Some Earth Science Data Characteristics

  • Large multidimensional arrays from forecast models

  • Coordinate systems to access data by region and time

  • Need for near-real time data

  • Read-only or read-mostly access to existing data

  • Security often a minor issue: data is usually freely available and shared, but data systems must

    • Protect data from accidental or malicious changes
    • Provide restricted access for a few data collections
    • Protect from denial of service caused by users asking for too much


Five Areas of Unidata Involvement

    • Distributing near-real time data
    • A data model for the earth sciences
    • Advancing a metadata standard for the earth system science community
    • Serving metadata and data
    • Visualizing and analyzing geoscience data


Distributing Near-Real Time Data

  • Unidata’s Internet Data Distribution system

    • Delivers near-real time data: model outputs, surface, radar, upper-air, satellite observations, lightning, aircraft, observations from aircraft, observations from mesoscale networks, …
    • A collaboration of universities and other institutions
    • No data center, data products are injected from multiple sources
    • Unidata’s part includes development of client-server software, providing support, training workshops, coordination, negotiating data agreements


Why Not Just Use FTP?

  • Designed to send whole files, not many small products

  • If client pulls data from server, delays result from repeatedly asking server “is new data available yet?”

  • If server pushes data to clients, server must maintain connection information and state for each client

  • FTP can be slow for sending many small data products

  • In spite of these shortcomings, FTP is still used successfully in many data distribution systems, when:

    • Delay not an issue
    • Small number of clients per server
    • Only large products in whole files are distributed


Unidata’s LDM (Local Data Manager)

  • An alternative to FTP for data distribution

  • Protocols and client-server software for capturing, distributing, and organizing data in near-real time using reliable, event-driven data distribution

  • Supports subscriptions to subsets of data feeds

  • Suitable for pushing many small products, as well as large products

  • Highly configurable: can inject, distribute, capture, filter, and process arbitrary data products

  • Requires Unix system

  • Heart of the Internet Data Distribution system



IDD (Internet Data Distribution)



Real-Time Data Flows

  • 30 data feeds provide radar, satellite, text bulletins, lightning, model forecasts, surface and upper air observations, …

  • LDM-6 commonly handles 5 GB/hour input, with as many as 140,000 products/hour

  • LDM-6 was recently selected for data collection for the THORPEX Interactive Grand Global Ensemble (TIGGE)

  • A cluster LDM configuration can handle 400 downstream connections

  • Currently over 300 machines at 170 sites run LDM-6 continuously

  • Redundant feeds support reliability in case of failure of “upstream” machines or network

  • The National Weather Service uses LDM-6 to collect and relay NEXRAD level 2 radar data operationally for over 150 radars



IDD 2007

  • Participants

  • United States

  • Canada

  • Puerto Rico

  • Costa Rica

  • Barbados

  • Venezuela

  • Chile

  • Brazil

  • Argentina

  • England

  • Portugal

  • Spain

  • Austria

  • Russia

  • Vietnam

  • China (Hong Kong)

  • South Korea

  • Antarctica (incipient)





The IDD-Brasil

  • Began as a collaboration under Meteoforum project.

    • Participants: Unidata Program Center/UCAR, CPTEC/INPE, UFRJ, UFPA and USP
    • Inaugurated in January of 2004 with 4 nodes


IDD-Brasil Participation

  • The working paradigm for IDD-Brasil is:

    • You get free access to global data, tools and support
    • You give free access to your data-sets, provide infrastructure and support
  • Free data access and cooperation is a major topic of discussion in every Brazilian meteorological meeting



CPTEC’s Data ingesting in IDD-Brasil

  • ETA regional Model, 40 Km resolution (operational)

  • Automatic data-collecting network (operational)

  • GOES satellite imagery, full-resolution for South America (under testing)

  • Global T213 model (under testing)

  • Ensemble T126 Global model – 15 members (under testing)



INPE´s automatic network: Data Collecting Platforms

  • More than 524 automated stations from INPE and cooperating Institutions

  • 50 Stations reporting Atmospheric Pressure

  • These were the first new data shared through the IDD-Brasil, soon they will be also on GTS (in BUFR format)





Conclusions I:

  • The IDD extension to Brazil (IDD-Brasil) is changing Brazilian Meteorology through:

    • Easier access to Global Data
    • Free availability of good analysis tools
    • Spreading ideas and practices


Conclusions II:

  • The IDD-Brasil shows a sharply growing rate

    • Today Brazil is the largest international IDD-user community
  • Numerical models of Brazilians institutions distributed by the IDD/IDD-Brasil are easily available to national and international users.



Conclusions III:

  • The data from several Brazilians mesonets may be distributed by the IDD/IDD-Brasil

    • These data are not available on GTS
    • They are very important because the data network in South America is sparse.
  • As a result of IDD expansion to Brazil, more Brazilian data are becoming available for International community.



Contact Information

  • Waldenio Gambi de Almeida gambi@cptec.inpe.br

  • CPTEC/INPE cptec.inpe.br

  • Maria G.A. Justi da Silva justi@igeo.ufrj.br

  • UFRJ www.meteorologia.ufrj.br

  • Tom Yoksas yoksas@unidata.ucar.edu

  • Unidata/UCAR www.unidata.ucar.edu



    • Distributing near-real time data
    • A data model for the earth sciences
    • Advancing a metadata standard for the earth system science community
    • Serving metadata and data
    • Visualizing and analyzing geoscience data


How Adequate is the Relational Database Model for Scientific Data?

  • Designed and optimized for

    • Data in tables
    • Online transaction processing systems
    • Other business and enterprise problems
  • Successful in Geospatial Information Systems integration, such as ESRI ArcGIS

  • Also very successful in aother disciplines, such as astronomy (Sloan Digital Sky Survey, U.S. National Virtual Observatory)

  • Not adequate for earth sciences data

    • N-dimensional arrays
    • Event-oriented systems, such as sensor webs or high-speed data streams
    • Indexing unstructured data, for example metadata in XML form
    • Supporting scientific analysis and visualization tools


Open Questions in Modeling Scientific Data

  • For how wide a realm is the relational database model adequate?

  • Is any data model that unifies data collections from many disciplines too complex to be useful?

  • Can one scientific data model be useful across many scientific disciplines?

  • Is one scientific data model even practical for earth sciences?



Network Common Data Form

  • A simple data model for scientific datasets

  • A format for portable, self-describing data

  • A programming library that uses efficient direct access and efficient subsetting of multidimensional arrays

  • Several programming interfaces: C, Fortran, C++, Java, Python, Perl, Ruby, ...

  • Support for appending, sharing, and archiving data



The NetCDF-3 Data Model



Limitations of the NetCDF-3 Data Model

  • Too simple to represent some data structures and relationships

  • No real data structures, just scalars and multidimensional arrays

  • No “ragged arrays” or nested structures

  • Only one shared unlimited dimension, along which data can be appended

  • A flat name space for dimensions and variables

  • No strings, just arrays of characters

  • A limited set of numeric types

  • Only ASCII characters in names



The NetCDF-4 Data Model



NetCDF’s Future

  • NetCDF-4 integrates netCDF with HDF5, another major standard format and data model

  • Parallel netCDF has proved suitable for high-performance computing

  • NetCDF-4 data model (CDM) improves interoperability with other scientific data representations

  • NetCDF-Java has advanced features, including access to remote data



NetCDF-4 Features

  • User-defined compound types (portable structs)

  • User-defined variable-length types

  • Groups for nested scopes

  • Multiple unlimited dimensions

  • String type

  • Additional numeric types



The Unidata Common Data Model

  • For a common subset of abstractions in OPeNDAP, HDF5, and netCDF-4

  • User-defined compound types (portable structs)

  • User-defined variable-length types

  • Groups for nested scopes

  • String type

  • Additional numeric types

  • Prototype implemented in netCDF-Java

  • Attempts a balance between simplicity and power of representation



NetCDF-Java

  • 100% Java library has advances compared to C-based interfaces

  • Prototype implementation of Common Data Model for access to netCDF-4, OPeNDAP, HDF5

    • Provides netCDF interfaces to other formats: Grids (GRIB1, GRIB2), Radar (NEXRAD, NIDS, DORADE), Satellite (DMSP, GINI), Point Observations (BUFR)
    • Provides uniform coordinate systems layer
  • Includes access to THREDDS inventory catalogs



Common Data Model



    • Distributing near-real time data
    • A data model for the earth sciences
    • Advancing a metadata standard for the earth system science community
    • Serving metadata and data
    • Visualizing and analyzing geoscience data


Some Metadata Issues

  • Every scientific discipline has their own metadata standard. Is any convergence likely?

  • How do you choose among the multiple candidates for metadata standards?

  • How can metadata be improved for existing data without rewriting the data?



Climate and Forecast (CF) Conventions

  • A widely used metadata standard for atmospheric, ocean, and climate data, based on netCDF

  • Specifies coordinate systems used in models, data cell properties and methods, packing, standard names for quantities, and grid mappings

  • CF-aware software can automatically determine space-time location of data variables

  • Originally intended for climate model output conventions, but use has broadened to weather and ocean models and observational data

  • Community governance structure now in place for maintaining and advancing the CF conventions, WMO Working Group on Coupled Modeling (WGCM)



Libcf

  • Purpose: to ease the creation and use of datasets conforming to the CF Conventions

  • In early stages of development and testing

  • C and Fortran interfaces available from Unidata in alpha release



Udunits (Unidata Units)

  • Library for manipulating units of physical qualities.

    • Conversion of unit specifications between formatted and binary forms
    • Arithmetic manipulation of unit specifications
    • Conversion of values between compatible scales of measurement
  • C, Fortran, and Java interfaces

  • Required by CF conventions

  • May soon be available as part of netCDF release



NcML (NetCDF Markup Language)

  • An XML representation of netCDF metadata, similar to CDL

  • A schema language for Earth science data

    • To get NcML from netCDF data, use ncdump -x or Java ToolsUI program
    • To create netCDF from NcML, use ToolsUI or (eventually) ncgen
  • Provides a way to add to or change metadata without rewriting, by referencing and overriding metadata in a file

  • Also supports aggregation of multiple files



    • Distributing near-real time data
    • A data model for the earth sciences
    • Advancing a metadata standard for the earth system science community
    • Serving metadata and data
    • Visualizing and analyzing geoscience data


Open-source Project for a Network Data Access Protocol, see opendap.org

  • Open-source Project for a Network Data Access Protocol, see opendap.org

  • A discipline-neutral protocol to get remote scientific data and metadata (not files)

  • Allows requests for subsets and aggregations

  • Software reference implementations for many kinds of data: netCDF, SQL (databases), HDF, FITS, JGOFS,

  • In use in earth sciences, astronomy, medicine, …

  • Serves IPCC model output



FTP (File Transfer) versus OPeNDAP for Access to Remote Data

  • FTP accesses only whole files

  • OPeNDAP includes services for

    • Selected subset of data from a file
    • Aggregation of data in multiple files
    • Selected subset of aggregated data


Protocol uses URLs and HTTP

  • Protocol uses URLs and HTTP

  • Unidata provides OPeNDAP support

  • Several OPeNDAP servers available: pyDAP, FDS, GDS, DAPPER, THREDDS Data Server

  • OPeNDAP clients include: Ferret, GrADS, Matlab, IDL, ArcGIS, netCDF-Java, IDV

  • OPeNDAP version 2 now a NASA standard

  • Version 4 under development with a test version available: adds XML, new types, new functions, THREDDS catalogs, SOAP, outputs in HTML and ASCII

  • Will add authentication, more server-side processing



Thematic Real-time Environmental Distributed Data Services (THREDDS)

  • For data providers, implements data catalogs to present to users and applications

  • Catalogs are XML documents (metadata) describing and pointing to datasets accessible via client/server protocols (OPeNDAP, ADDE, WCS, HTTP)

  • Datasets may be found by discovery centers (master directories, digital libraries, data portals) via catalogs

  • Catalog hierarchy provides places to hang common metadata

  • Unidata coordinates THREDDS activities, community implements servers

  • Many partners as data providers, tool builders, interoperability experts from academia, government, industry



Motherlode Portal Catalog of Catalogs



NCDC Server



NCEP NAM Individual Run



THREDDS Data Server (TDS)

  • Serves data, THREDDS catalogs, and metadata

  • Reads and serves several kinds of data through a uniform CDM interface: netCDF, OPeNDAP, HDF5, GRIB, NEXRAD, …

  • Adds Earth-location coordinate systems to data

  • Provides OPeNDAP access and subsetting of any data readable with NetCDF-Java library

  • An integrated server provides data access through the OpenGIS Consortium Web Coverage Service (OGC/WCS)

  • Easy to install, 100% Java, freely available

  • Supports dynamic generation of catalogs



THREDDS Data Server



    • Distributing near-real time data
    • A data model for the earth sciences
    • Advancing a metadata standard for the earth system science community
    • Serving metadata and data
    • Visualizing and analyzing geoscience data


Integrated Data Viewer (IDV)

  • Unidata’s newest scientific analysis and visualization tool

  • Freely available 100% Java framework and reference application

  • Provides 2- and 3-D displays of geoscience data

  • Stand-alone or networked application

  • Integrates data from different sources

  • Provides End-to-end test for technologies



Some IDV Features

  • Client-server data access from remote systems

  • Suite of data probes for interactive exploration (slice and dice)

  • Animations (temporal and spatial)

  • HTML interface for pedagogic materials

  • XML configuration and bundling allows collaboration with other educators

  • Java-based framework supports Extensions built via plug-ins: for example, geosciences network (GEON) solid earth community



Catalog of catalogs in IDV (Catalog from within a Client)



Summary and Tentative Conclusions

  • Database “One Size Fits All” databases are not a comprehensive solution for scientific data or metadata.

  • Similarly, the old file-FTP approach is running out of steam for distributed data systems.

  • There is a limited time for establishing interdisciplinary data tools and services, before each discipline crystallizes on its own solutions.

  • Islands of non-interoperability that result would be unfortunate.

  • Flexibility to be ready adapt to better solutions is required, maybe even before the best solutions are evident.



A Few Last Thoughts on Infrastructure …



What is Infrastructure?

  • The basic facilities, services, and installations needed for the functioning of a community

    • Utilities: water and power lines
    • Transportation and communications systems
  • Good infrastructure is reliable, sturdy, useful, long lasting, standardized, widely used, and invisible



Infrastructure: Stones in a Wall

  • Higher layers are built on lower layers

  • Stones may be replaced with other stones of similar size and shape

  • From the top, lower layers are invisible



Cyberinfrastructure: the Middle Layers



Is Developing Infrastructure Rewarding?

  • It’s abstract, so hard to explain at a party

  • You can’t take a picture or movie about it

  • If it works well, it is invisible

  • End users are often not aware of it

  • It doesn’t get referenced in scientific papers

  • It can be expensive to evolve and support

  • If not maintained, it eventually crumbles

  • You can’t sell it, so you have to give it away



Earth Science Infrastructure: Bricks in a Wall of Acronyms



Visible and Invisible Infrastructure



What Is Good Infrastructure?

  • Provides a useful service

  • Makes abstractions at the right level

  • Cloaks invisible details with a simple interface

  • Binds loosely to other infrastructure

  • Behaves reliably

  • Adapts easily to changes



An Example of Great Infrastructure: Popular Programming Languages

  • Base of huge collection of higher layers of infrastructure

  • People continue to build on top of this infrastructure

  • The opportunity to create a long-lasting and popular programming language is rare

  • Jim Backus (Fortran), John McCarthy (Lisp), Dennis Ritchie (C), Bjarne Stroustrup (C++), James Gosling (Java), Yukihiro “Matz” Matsumoto (Ruby)

  • Other great infrastructures: Unix, TCP/IP, HTTP, …



Rewards of Developing Infrastructure?

  • It “raises the level” for other developers

  • Beautiful and useful new layers and applications are built on top of it

  • You can feel a part of everything it supports

  • If it’s long lasting and widely used, you have made a difference for future generations

  • So, it’s one way to get closer to immortality

  • Infrastructure is abstract, but rewards can also be real

  • … like this trip to Japan!



For More Information

  • http://www.unidata.ucar.edu/

  • support@unidata.ucar.edu

  • russ@ucar.edu




Do'stlaringiz bilan baham:


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2017
ma'muriyatiga murojaat qiling