GEOPHYSICAL DATA STEWARDSHIP IN THE 21 ST CENTURY AT THE NATIONAL GEOPHYSICAL DATA CENTER ( NGDC )

The World Data Center for Geophysics in Boulder, Colorado is hosted by the National Geophysical Data Center (NGDC). NGDC’s vision is to be the world's leading provider of geophysical and environmental data, information, and products. NGDC's mission is to provide long-term scientific data stewardship for geophysical data, ensuring quality, integrity, and accessibility. Faced with ever expanding data volumes and types of data, NGDC is developing more innovative techniques for science data stewardship based in part on data mining and fuzzy logic. Use of these techniques will allow NGDC to more effectively provide data stewardship for its own scientific data archives and perhaps the broader World Data System.


INTRODUCTION
The National Geophysical Data Center (NGDC) (http://www.ngdc.noaa.gov/ngdcinfo/aboutngdc.html) is one of three data centers operated by The National Atmospheric and Oceanic Administration (NOAA) to archive and disseminate data collected in executing its environmental mission.NGDC has two primary science divisions each focused on a different domain.The Solar and Terrestrial Physics (STP) division, which focuses on space related and space derived products and information, and the Marine Geology and Geophysics (MGG) Division, which focuses primarily on data from the sea floor as well as main field magnetics.A sample listing of the data and applications from each is available in Table 1.

Table 1. Sample data products and their application areas
A primary component of NGDC's mission is to provide scientific stewardship for the data archived at the center.
Here "scientific stewardship" means that in addition to preserving the data for the long term, NGDC focuses on providing calibrated data sets that can reach a broader audience, creating products from raw data and thereby exposing the data to a larger audience; providing long term quality control for data sets to create "research quality holdings"; and finally propogating the knowledge derived from the data to the community at large.As can be seen in Figure 1, because each of the higher level activities is labor intensive, it is performed on a proportionally smaller percentage of the overall data archive, thereby reducing the return on investment made in archiving the data.NGDC is developing tools and techniques that allow the center to address more of the data at a higher level without increasing overall staff even in the face of increasing data volumes and diversity.The goal is to develop automated "expert systems" that provide stewardship functions without the need for direct staff involvement.The sections below describe the NGDC vision and some early implementations in pursuit of more automated and improved data stewardship.

DATA MINING
Data mining is one possible solution in support of stewardship activites.By data mining we mean using mathematical and computational tools to extract previously unknown, and potentially useful, information from the archived data.Data mining uses techniques such as machine learning and statistical analysis to summarize and present knowledge in a form that is easily comprehensible to humans.By filtering through the vast achives and pointing trained scientists to the more interesting bits of information, data mining enables management of larger and more diverse archives.Some possible applications of these techniques are summarized in Table 2.
The first two are addressed with specific examples below.

HUMAN LINGUISTIC TRANSLATION
When attempting to mine data for information, we find that natural language is not easily translated into the more computer-friendly terms of simply 0s and 1s.However, natural language is typically how scientists prefer to ask questions when interacting with data: Is the sample "hotter" on average?;Is this observation outside of the "norm"?;Is the sample "changing" with time?Fuzzy logic lets us map human thought and language into  computer functions much closer to the way the brain works.We can aggregate data and form a number of partial truths, which we consider when certain thresholds are exceeded, initiating an action such as flagging the data as suspect or identifying a significant trend.Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the concept of partial truth --truth values between "completely true" and "completely false".It was introduced by Dr. Lotfi Zadeh (Zadeh, 1965) of UC/Berkeley in the 1960s as a means to model the uncertainty of natural language The use of "fuzzy" logic allows automated systems to capture some of the natural thought process of a data manager and to apply it to an archive.Applying these techniques, one can search an entire 40-year archive for events described by "high" winds, "average" temperature, and "about" 60% humidty (perhaps a storm description) and quickly identify when such events are occuring, detect any changes over time, and display the results to a user (Figure 2).Notice that because the language used is natural, the same query would work for data in Alaska or Florida although what constitutes "average" temperature is obviously quite different between the two.Natural language processing is key to handling large and diverse data volumes and will be expanded at NGDC as ever more automated systems are fielded.

DATA QUALITY CONTROL
The Space Weather Reanalysis (SWR) (Kihn, 2007) is a long term reanalysis of space weather data that requires careful quality control of a huge volume of diverse data.The SWR involves taking raw observational data and processing it through linked physical models that produce a higher order product capable of summarizing the state of the space environment.A single instance of bad data can have ripple effects throughout the entire model run.Working with satellite and station data in particular can be tricky, with spikes, baseline shifts, and dropouts all prominent in the data stream (Figure 3).In a typical small scale study it would be possible for a researcher to hand screen the data, but here the data volume requires the application of "intelligent" computer techniques, based on fuzzy-logic, neural computing, and other mathematical functions.In particular for this application of data quality control, NGDC developed a system capable of "peer matching"; that is, each station was analyzed to determine a group of peer stations based on location, instrument type, and dynamic range.The data mining application was then set to look at the entire 15 year data stream for instances when a given station observed data "unlike" its peers.This much smaller subset of data could then be reviewed directly by an analyst.Notice that in this instance the data mining helps in two ways: by determining a set of peer stations and by allowing a linguistic search for data "unlike" its peers.Application of these techniques allowed for integration of over 15 years of data into the model runs but also left behind a vastly improved data archive, with each station and observation having been screened for quality.NGDC will look to expand usage of such systems as data volumes and diversity increase.

Figure 1 .
Figure 1.The stewardship pyramid showing decreasing volume for higher order products

Figure 2 .
Figure 2. A sample search for a typical weather event

Table 2 .
Applications of data mining Data Science Journal, Volume 12, 30 April 2013 WDS194