WHAT IS PALEOSTRAT?

  

PaleoStrat is a community digital information system for sedimentary geology, stratigraphy, biostratigraphy, and paleontology, and related geochemical, geochronologic, and other data. PaleoStrat is a research environment and facility to support Earth science research that is being built by scientists for scientists. PaleoStrat is designed to provide: 1) a database for all relevant data and metadata necessary for the investigation Earth processes as recorded in stratigraphic successions, 2) user-friendly ways to capture these data,  3) a secure area for users and projects to store and work with unpublished data, 4) simple, but powerful ways to find the information, 5) convenient mechanisms for users to extract, synthesize and work with data, including access to analytical and visualization tools, and 6) educational modules that will use the dynamic research data to help educate and inform undergraduates and a broad public audience.  

      No other such comprehensive, integrated data system exists, and the challenges for design, implementation, maintenance, and migration are immense.  Therefore we are working particularly closely with other geoinformatics efforts to share ideas and technologies, and where possible utilize partner’s services.  In direct collaboration with SedDB (www.seddb.org); SESAR (www.geosamples.org), GEON (www.geongrid.org ), EarthChem (www.earthchem.org),  and also Paleobiology Database (www.pbdb.org), CHRONOS (www.chronos.org), iGeoInfo (www.igeoinfo.org), and others, we are working to insure that PaleoStrat is part of a larger system. PaleoStrat supports research and education in the broad categories of sedimentary geology and paleobiology, but these data are also critical for many allied studies in structure, tectonics, paleoclimate, and other issues dealing with the systematics of Earth processes. Finally, whereas PaleoStrat is being developed explicitly for the terrestrial Earth science community we have built partnerships with our marine counterparts, specifically SedDB, to insure that data from the ancient marine systems are available to everyone regardless if they are working in the PaleoStrat or SedDB environments.

  

This is your System

It is important to emphasize that this is your geoinformatics system.  It is designed as a service to the community.  This means we need your feedback on how to make improvements.  This means we need your data. Please feel free to contact us at any time!

  

SCIENCE DRIVERS

  

PaleoStrat is a platform that supports science - it does not do science. Our purpose is to make available integrated, multidisciplinary data sets, and with our partners, tools to visualize and analyze these data. PaleoStrat enables scientific inquiry and provides the level of data required for the next generation of cutting-edge science research. It allows for the capture of highly resolved data within their complete geologic context.  Without this detail, without this ability to capture data and place it in its full geologic context, without a reliable way to insure data integrity, and provide the reproducibility and level of resolution necessary, these science issues cannot move forward as effectively or efficiently as needed. In short, PaleoStrat helps to enable the researcher to conduct detailed analysis of multifaceted data within their full geologic and biologic context.

  

PaleoStrat helps to democratize science, in the sense that the best minds can be brought to bear on research questions, not merely researchers and organizations that happen to hold their own large data sets and resources. It should be emphasized that no such comprehensive data system exists now in the U.S. or elsewhere. Also, agencies such as the USGS, on their own, won’t be able to capture the relevant data because of their mandated missions - missions that typically do not include serving the academic community.  Thus, PaleoStrat provides a critical research platform for the paleontologic, geochemical, geochronologic, and sedimentary geology community, including building a bridge between biologic and paleobiologic, and paleoclimate and present-day climate research, and others. It also provides support for structural geology, tectonics, geochronology, petrology, paleomagnetics, and other fields related to understanding the tectonostratigraphic history of the lithosphere through time. Below are several examples of specific science drivers, but it is not an exhaustive review of any one topic, or a thorough listing of all possible science drivers.

  

Time calibration of the geologic record and EARTHTIME 

A broad array of cutting edge research topics require the development of a more highly resolved time scale. This research demands that we be able to date events and determine the rates of deep-time geologic processes with a resolution of hundreds of thousands of years even 400 million years ago.  EARTHTIME (http://www.earth-time.org) is an unprecedented interdisciplinary effort by geochronologists, stratigraphers, biostratigraphers, chemostratigraphers, and magnetostratigraphers to accurately and precisely develop a numerical time scale at a level of resolution of 0.1% or better. This is now possible because of advances in analytical methods and of the realization that volcanic tuffs/tuffaceous sediments are more common in the stratigraphic record than previously appreciated. This time scale calibration requires a disciplined approach by geochronologists to standardize their methodologies, develop common analytical standards, conduct interlaboratory calibrations, rectify differences between geochronologic methods (e.g., U-Pb and Ar-Ar), and reach agreement on what data and metadata must be provided with each analysis. Stratigraphers, biostratigraphers, chemostratigraphers, and magnetostratigraphers, too must learn to standardize their approach to recording stratigraphic successions in outcrop and core, providing accurate sample location information and a complete geologic context of each succession, and agreeing on what data and metadata are required for these “calibration” sections.

  

How is PaleoStrat contributing to this endeavor? The EARTHTIME effort requires a data system to handle this wide array of data. The system must be comprehensive; hosting only the geochronologic data is insufficient, as it is the rock record that is being calibrated, thus, geochronologic samples must be placed within their complete geologic context.  The data system must be as easy to use as possible, yet provide the ability to capture the full breadth of data at the resolution necessary for the task. Some data could be captured in the laboratory, other in the field as the sections are being described and sampled; all must be available over the web, and EARTHTIME cannot proceed without such a system in place. PaleoStrat is working with geochronologists, bio-, physical, and chemo-stratigraphers to continue to modify and expand the system to provide just these services.

  

Biologic Extinction, Origination, Radiation, and Diversity

Lane et al., (20011) summarize a workshop attended by over 100 international scientists where several major paleontologic questions were discussed that remain unsolved today:

  

What Rules govern biodiversity dynamics and do those rules apply at all temporal and spatial scales?

  

Why are major evolutionary innovations unevenly distributed in space and time?

  

How have biological systems influenced the physical and chemical nature of the Earth's surface, and how has biogeochemical cycling changed through time?

  

How does the biosphere respond to environmental perturbations at regional and global scales?

  

It is perhaps surprising to the nonpaleontologist that we still do not fully understand the processes that drive biodiversity, extinction, origination, radiation, faunal dispersal, nor the manner in which the biological realm interacts with and influences Earth's physical and chemical worlds.  As for issues of climate change, the Earth's deep-time record of life is critical for understanding recent and future environmental changes. The paleontologic record uniquely houses information about the biosphere's reaction to environmental change that must be factored into our musings about the modern world and predictions about future environmental changes.

  

What these scientists concluded is that they must sample the fossil record at spatial and temporal levels of sufficient resolution to allow the problems to be adequately addressed. This is precisely what PaleoStrat provides - a data system that can capture the breadth of paleontologic data within the full sedimentologic and stratigraphic context at resolutions from global to microscopic. In this regard, PaleoStrat is complementary to data provided from museums and that stored in the Paleobiology Database (paleodb.org). Indeed one goal is to allow a researcher working at the Paleobiology Database site to seamlessly extract relevant data from PaleoStrat into their working environment - but that will have to wait for the next phase of development. PaleoStrat also recognizes the community's need to assess and correlate these data and thus we will provide access to tools at CHRONOS (www.chronos.org; age-depth plots, constrained optimization and statistical packages), graphic correlation from GEON (www.geongrid.org) and others as appropriate.  The output from such analytical packages is highly dependent on the quality of data utilized in the analysis - and PaleoStrat's function is to allow the user to assess and set the level of quality desired for the particular study being undertaken. PaleoStrat will provide seamless link to these and other resources as we continue to work with the paleontologic community to improve the system.

  

1R.H. Lane, F.F. Steininger, R.L. Kaesler, W. Ziegler, and J.H. Lipps (eds.). 2001. Fossils and the Future: Paleontology in the 21st Century. Senckenberg-Buch Nr. 74, 290 pp.

  

Deep-time Paleoclimate Research & GeoSystems

Earth’s climate system is complex - far more complex than the modern, or even Quaternary, records would suggest. Over 99% of the record of Earth’s climate-system behavior is housed in the pre-Quaternary record, including the record of the processes driving the current and foreseeable-future states of atmospheric CO2. This “deep-time” record showcases environmental disturbances unknown from the recent and that are beyond the comfort zone of human experience, but are within the comfort zone of Earth experience. Study of this rich and diverse record offers hope that a more holistic understanding –and thus prediction—of Earth’s climate system are within reach.

  

GeoSystems (www.geosystems.org) is an interdisciplinary, community-based initiative stemming from the growing recognition that a full understanding of Earth's climate system - and our climate future - lies in examining the wealth of "alternative-Earth" climatic extremes archived in the pre-Quaternary geologic record. The GeoSystems approach recognizes and embraces the importance of the deep-time perspective for understanding the complexities of Earth's atmosphere, hydrosphere, biosphere and surficial lithosphere using climate as the nexus. Recent and powerful developments in our abilities to read, date, and model Earth's deep-time history are enabling reconstruction and study of past scenarios at unprecedented detail. Only the deep-time (pre-Quaternary) climate record provides comprehensive clues to the climate patterns and drivers of a “greenhouse” Earth with tropical temperatures at high latitudes, an “icehouse” Earth with continental glaciers in the tropics, climate-altering gas hydrate releases, and mass extinctions. Through anthropogenic activities, we appear to be forcing the planet into a greenhouse mode during an icehouse interval. If we are to truly understand and predict potential catastrophic climatic changes, we have to look at the only record of such events that we have - the deep-time record. 

  

Investigation of these deep-time events requires integration of many types of records - sedimentological, paleobiological, and geochemical - archived in sediments on land and sea. As summarized by the GeoSystems effort, we need a system that: 1) allows for the enhanced development and resolution of paleoclimate proxy data; 2) provides better geochronologic resolution, because abrupt changes are more common than previously imagined; 3) acquires interdisciplinary, high-resolution data from key stratigraphic successions; 4) allows climate modelers to develop and refine models applicable to deep-time slices and make available the necessary data to test these models; and 5) promotes interdisciplinary communication and collaboration.  PaleoStrat is providing the database to support this initiative.

  

Cyclostratigraphy and the Deep-time Extrapolation of Orbital Parameters

Cyclostratigraphy has emerged as a powerful tool to “astronomically tune” the age of stratigraphic sections and even the geologic time scale into approximately 20,000-year intervals. As these high-resolution subdivisions are extrapolated from the Recent record downward into deep time, they appear to be an exciting new means to track the rates and duration of many Earth processes.   The key will be the independent testing to resolve some disagreements between the estimates of age and duration of sedimentary cycles delineated by cyclostratigraphy versus that provided by independent physical stratigraphic, biostratigraphic, and geochronologic data. Evaluation of the cyclostratigraphic estimates of ages and cycle durations requires more detailed temporal and spatial data than are presently available.  In particular, independent age estimates need to be provided for cycles and each cycle must be fully documented with respect to lithology, sedimentology, paleontology, geochemistry, paleomagnetic values, etc.  More importantly, other workers need to be able to add to existing data to for each section to easily and efficiently build on previous results; doing this manually, where every researcher must “rediscover” previous data, simply is not an acceptable approach. PaleoStrat has been designed to accommodate the complete spectrum of physical stratigraphic data, from lithostratigraphic to sequence stratigraphic and cyclostratigraphic data. This coupled with biostratigraphic, magnetostratigraphic, chemostratigraphic, and chronostratigraphic data provide a complete suite of data types for stratigraphic time series analysis. By accommodating these data in such a comprehensive public database, PaleoStrat can provide the critical platform to rigorously test the cyclostratigraphic concept of astronomical tuning.

  

Dynamics of Orogenic Systems

Sedimentary successions harbor critical data about the evolution of continental crust that cannot be found elsewhere within orogenic systems, however, these data have been underutilized in our studies of crustal deformation. Most large sedimentary basins are linked parts of orogenic systems. Some basins are peripheral to the core mountain belt - as foreland or fore-arc basins, intra-arc or “piggy-back” basins, and some cap the main phase of tectonic activity as so-called “successor” basins. The creation of the sedimentary basin itself records significant to dramatic crustal deformation. Locked within the sedimentary succession are critical data about the crustal deformation: detrital zircons can track unroofing; helium isotopes combined with fission track and Ar/Ar dating of apatite and zircon define rates of uplift through multiple isotherms. Tectonostratigraphy can provide timing of deformation events not well recorded in the igneous and metamorphic assemblages of the orogenic core; volcanic flows and tuffs mark magmatic episodes. In addition, paleoclimate proxies may suggest the magnitude of uplifts and biotic changes may reflect the construction or destruction of physical barriers for biotic migration or change the nature of the marine water mass, etc. Whereas the general linkages between sedimentary systems and crustal dynamics are well known, here the emphasis is on the more subtle signatures of tectonism recorded in the sedimentary record of the basins that are peripheral to the orogenic heartland.  For example, by using a very high-resolution stratigraphic and biostratigraphic data set, Trexler et al. (2003, 2004) have demonstrated that the upper Paleozoic of the western U.S. was not, as commonly believed, a period of tectonic quiescence.  This is but one example of the value of more detailed data in terms of detecting previously unrecognized tectonic episodes. This entire tectonostratigraphic approach needs to be expanded, tested, and applied to other orogenic systems, but it demands a level of detail in stratigraphic, petrologic, geochemical, geochronologic, and paleontologic data typically not yet described from many sedimentary successions, particularly in those that are highly deformed.  PaleoStrat is explicitly being developed to host such detailed, interdisciplinary data sets - data sets not now available, but which will allow the researcher to peer through the deformation and pull out tectonic signatures that have not been previously recognized. Also, the explosion in the number of zircons coarsely dated by SHRIMP technique requires an archival system that also allows the interpretation of these data in light of the best biostratigraphic controls of age, like PaleoStrat.

  

 THE USER

  

What does the user get out of PaleoStrat?

How does PaleoStrat support the community?

  

As outlined in our mission statement, PaleoStrat is designed to provide:

1) a database that includes all relevant data and metadata types for sedimentary geology, paleontology and related disciplines,

2) user-friendly ways to capture these data, 

3) a secure area for users and projects to store unpublished data,

4) simple, but powerful ways for the user to find the information they need, and

5) access to the analytical and assessment tools necessary to address thematic science questions, and

6) convenient ways for the user to save and download data

  

Although it can be stated that no other such comprehensive, integrated data system exists for sedimentary geology and paleontology - what does this mean for the user?  What does the user get out of it? How does PaleoStrat support the community? (These questions are separate from those of how the user will work with the system; see: "Working with the System").

  

The working environment

PaleoStrat is dedicated to providing a comprehensive working environment for the researcher. First, that means that we will continue to work with others to provide easy access to other data and tools, thus providing the user with seamless access to global data sets. In particular, we are closely working with SedDB, PetDB and EarthChem, JANUS, the Paleobiology Database, and CHRONOS to insure that the researcher access to the full suite of data they desire whether that be from the PaleoStrat site, or, for example from the Paleobiology site.  Internationally we are founding members of iGeoInfo (www.igeoinfo.org) and will work with the Commission for the Management and Application of Geoscience Information (CGI; www.bgs.ac.uk/cgi_web) of the International Union of Geological Sciences (IUGS; www.iugs.org) to better link our efforts to the global needs of the science. Similarly, we are working with the US Geological Survey, Association of American State Geologists, and other state and federal agencies to help provide a two-way flow of sedimentary geology and paleontology information.  We will host data sets specifically tailored for geoscience groups and initiatives, including:  GeoSystems, the International Commission on Stratigraphy (ICS; www.stratigraphy.org) and three of its subcommissions (Carboniferous, Permian, and Triassic). We welcome working with all groups and individual to meet their geoinformatics needs in sedimentary geology and paleontology.

  

Crossing the waterline

We have partnered with SedDB to build a combined system to seamlessly "cross the waterline" and integrate data from the modern oceans with that from the marine rocks now incorporated in terrestrial settings. This is, of course a reciprocal agreement - users studying, for example, the chemistry of Miocene siliceous sedimentary rocks from SedDB, will be able to pull information from such rocks that are hosted in PaleoStrat without having to leave the SedDB site. In addition, the researcher can readily intermix secure data from their personal workspace with large amounts of publically available data as they address their particular research problems.

  

Supporting user science

What does PaleoStrat provide the user in terms of their science research, in terms of the quality of their science? The following table summarizes the answer to this question. Some of these are conceptual and inherent parts of a data system, others are specific features.

  

Reproducibility and the complete geologic context

Data integrity and authenticity

User-driven data quality and reference data sets

Legacy data capture

      Mechanisms for data input, personal work space, searching and data assessment

      Long-term data stewardship     

Support of publication of data sets

Meeting agency and project data requirements

      Time to think

  

Note: some of the following features are only in the process of being implemented, others will require continued community input and  work with other groups, organizations, agencies, and publishing houses to accomplish.  Nevertheless, PaleoStrat is committed to working with the community and geoinformatics partners to accomplish all these goals.

  

Data, Reproducibility and the Complete Geologic Context

Science runs on data - the more data that supports a given interpretation, the more robust that interpretation is; the opposite is also true. The amount of data and metadata that can be presented in a published article is limited whereas that which can be stored and accessed in a data system is virtually unlimited. Data systems such as PaleoStrat provide a mechanism to link together the published synthesis and discussion with vastly more data than can appear on the "printed" page.  It also facilitates the reproducibility of the results presented in the paper. The cornerstone of all science is that reproducibility. However, without access to all data and metadata that were utilized in a published work, key portions of those data often must be unnecessarily recompiled by subsequent researchers, wasting time and precious research funding. Linkage between publications and databases will become even more important as all major journals move to digital, online publication, and with that, a more efficient and effective communication of science and science results. Data input into PaleoStrat by the researcher become a permanent and expansive record of the research and because the system allows for the capture of the full context of the research, meets the needs of reproducibility.

  

In this way, a data system such as PaleoStrat fosters more robust publications by both  archiving data and by providing a means to present a more complete data/metadata set than can be possible in the publication itself. It is also important to note that all data within PaleoStrat is also hosted at the GEON site, thus insuring the permanent archiving of all data. Permanent identifiers are established between PaleoStrat and the publication thus providing a link that can be utilize to always retrieve the data associated with each particular published article. 

  

Data integrity and authenticity

Data integrity is something the average geoscience user does not want to think about or deal with.  Data integrity means that data systems must be designed such that sufficient data and metadata types are captured to fully attribute the science that is being represented by the database. Authenticity means that each bit of data is fully attributed with is source and nature of its limiting characteristics (this is primarily metadata).  The connections among various data and metadata types must therefore be established correctly and maintained.  It is far too easy for a geoscience user to think: "I am only interested in X or Y - so why all this other information (required or not)?"  That other information and the relationships among all the bits of data/metadata are precisely what define the integrity of the data. Without this integrity the quality of all the data is suspect and reliability can be questioned.  This underscores the fact that building an integrated, broadly based database is not a simple task.  It also emphasizes that merely storing stacks of Excel spread sheets can't provide the structure needed to capture and maintain data integrity.

  

Data quality

"Data quality" is a thorny issue because it means different things to different people. Some data sets are compiled by select groups of people to insure a consistent and defined set of quality standards are met.  Other systems indiscriminately capture all available data. Still others might stand in the middle ground where standards are in place that must be met before data are accepted from the public.  Note that here "standards" refer to the much more rigorous, but undefined standards of the science and scientists versus those of oversight groups such as the Federal Geographic Data Committee (FGDC).

  

To meet its mission, PaleoStrat must be a true, open, community data system. Although we will have people on staff to input data, we cannot do so for all geoscience users - the staff requirements would be prohibitive. Therefore, we must encourage and help researchers to input their own data (note: this will allow them to easily meet their funding agency requirements).  In short we cannot and will not act as gate-keepers for the system. A more robust, and perhaps open way to address data quality is to allow users to make those assessments from data available within the system.  For example, if certain metadata are missing from a data set, those data can be assigned a lower level of quality by the user. We will also assist users and user groups in the development of reference data sets, which is another approach to insuring the quality of the data. Our challenge is to make such assessments and reference set construction as easy as possible - and this will be a central part of PaleoStrat's next phase of development - one that will require close collaboration with the Earth science community, computer scientists, and experts in knowledge system analysis.

  

Legacy data 

PaleoStrat is dedicated to helping the community capture critical legacy data in risk of being lost and that is not readily available to the scientific community - please see separate section on this subject ("Legacy Data").

  

Long-term data stewardship

The long-term curation of all data input to PaleoStrat is guaranteed by our partnership with GEON / San Diego Supercomputing Center (www.geongrid.org).  All PaleoStrat data are backed-up and archived at GEON to insure no data will be lost.

  

Agency and project data requirements

One of PaleoStrat's goals is to provide the user with a convenient mechanism to meet data policies of NSF and other agencies and those of specific programs and projects. Researchers who do not meet these policies may jeopardize the funding of new proposals.  PaleoStrat can make it as easy as possible for the user to meet these existing and future policies. For additional discussion of this issue, please see "The Agency".

  

Data input

Data input is perhaps the most onerous task of a data system. Users don’t want to be bothered, and staffing limitations can limit the rate data can be input via PaleoStrat staff. Our goal is to make this task as easy as possible for the user and with as much personal help as is possible given staffing constraints.  We therefore provide web-based forms, Excel template download and upload, and will work with individuals to help get their data into the system.

  

The Personal/Project Working Space

PaleoStrat provides monitored, restricted access to personal and project data. This helps with various moratorium periods for public access to data generated by NSF funds.  During these moratorium periods, the user may access their own and/or project data plus all public data for analysis.

  

Searching

Searching complex data sets is a difficult task. Forcing the user to write SQL statements won’t work; providing predetermined searches has limited applicability. Hence, PaleoStrat provides multiple ways to search our inherently very complex data set.  We allow for simple to nested complex queries, the ability to save and reuse a search, and automatic updates of searches (future feature).

  

Data assessment and analysis

It is clear that the community needs to be able to assess and analyze data extracted via a search in as easy a manner as possible.  As geoinformatics moves forward, and the community becomes fully engaged, these activities will include both the familiar and unfamiliar approaches which is why all data systems must be open to user suggestions.  The tools for this assessment and analysis will be available within the PaleoStrat working environment, but most will be provided through seamless, online links to our partner geoinformatics efforts. We will continue to work with the community to insure we meet needs. For example, we need to discover what desk-top programs users currently utilize for their analysis, and then custom build the ability to translate PaleoStrat hosted data into the formats required by these software packages.

  

Time to think

The digital information age has already dramatically changed our everyday lives - think of the Internet, of Google. So to is geoinformatics beginning to do so for the geosciences - although, until now, at a much slower pace and in a rather inconsistent way. The situation is changing, but the change won’t be easy because of cultural issues - the “I’ve never needed this before”, or “I don’t have time for this” opinions.  These opinions are real, but are not valid. So to is the notion that “we should spend money on science not geoinformatics”. Why, because geoinformatics is the platform for our future science - without it, much cutting edge science simply cannot be done (see “Science Drivers”). It does cost to provide these facilities - but it is an investment we must make (see “Agency” for more discussion on cost). How often do we, as researchers spend too much time compiling and recompiling information for a publication or a proposal? Wouldn’t it be far better to not have to compile the data, but merely select, assemble and then synthesize data quickly and easily? Why spend time manipulating and re-manipulating data when that time is better spent thinking about the data and its implications. Time spent crafting better manuscripts and proposals.  Geoinformatics also democratizes the science.  This means that every scientist, regardless of where they work, has access to the level of data and information typically available mostly to the larger research institutions. This allows more creative minds to think about the science, to push it forward. PaleoStrat is excited to be part of this bigger picture, to be part of the change in how we are doing science. It will happen, it is happening, and it will help push our science to new levels.

  

  

DATA, METADATA AND ONTOLOGIES

Data, metadata and ontologies; what are they and why are they important? 

  

Data and Metadata: Data and metadata provide the core context of a database system. There is a difference between "data" and "metadata", but to the average domain science user (i.e., the geoscientist) of PaleoStrat they are not all that important. The Federal Geographic Data Committee (FGDC) describes metadata as "data about data" that describe the content, quality, condition, and other characteristics of data. That includes such items as the author of a reference, the name of the originator/owner of a sample, name of a stratigraphic section, date a sample was collected, location, what standards and constants were used in a batch of analyses, etc. However, to some, perhaps most users, the distinction between "data" and "metadata" is moot - it doesn't matter as long as all relevant information is captured that allows the system to provide a fully attributed suite of data. PaleoStrat's approach is to not burden the user with worrying about is or is not metadata, rather to have a minimum level of "required" information (metadata). Nevertheless, in the background we will continue to work with others to insure that the metadata captured is sufficient to meet national and international standards, and to provide the basis for the interaction of PaleoStrat with the larger geoinformatics system. Most importantly, through our working groups, we will continue to work with the user community to allow them to define metadata standards for their subdisciplines.

  

Ontology: The definition of "ontology" is difficult - there are many variations. However, with respect to an Earth science database, an ontology is a formal definition of the concepts and relationships that exist among the various data elements. This can be as simple as the relationships between the framework grains, cement and matrix in a sandstone. Another way to think of this is as a formal set of relationships that define the data model for the database. Again, this is generally something the Earth scientist is not concerned with - but it is important because the development of comprehensive ontologies facilitate data mining and discovery by the user, database interoperability, and aid the development of tools to better model the data. The challenge with an ontology is to capture as much of the meaning of the science as possible - and until you start to build one, you generally are not aware of the full complexity all of the relationships.  Databases and ontologies are thus intertwined in the sense that both require a fundamental understanding of the logic of the science they are attempting to describe. Databases can be constructed without having first developed a formal ontology simply because a database implicitly utilizes the logic of the domain science to develop the data tables and data fields and the relationships among the tables. It is the complexity of the data being captured that makes this either a relatively easy or extremely difficult task. The quantity of data to be captured is not the problem, the complexity generates the challenges. Although PaleoStrat data model was not explicitly built thinking about ontologies, it was constructed by Earth scientists to capture the data and metadata they felt were needed. The next generation of the PaleoStrat data model will be built around an ontology, one that we will develop in partnership with other databases, such as SedDB, PetDB, Paleobiology Database, and utilize the expertise at GEON.

  

Conceptual Model for a Sedimentary-Ancient Life Information & Modeling System

  

Relationships Among Earth Science Data

  

A fully implemented geoinformatics system must be comprehensive, modular, and extensible. The modularity and extensibility is necessary to accommodate growth and evolution of thought in both the Geosciences and Computer Sciences. To better allow for this modularity and to more clearly communicate the needs of the geoscientists to the computer scientists, a conceptual model of the geosciences (in this case for sedimentary-paleontology systems) must be developed that is suitable for designing the information system.  This conceptual model should group and connect the relevant data and metadata into a logical geologic framework.

  

This does not have to be a “final” model - indeed it cannot be so, simply because the model must grow with the science. Nor does it have to encompass the entire breadth of the geosciences - it can be done in a “modular” way, provided that the linkages to other modules are articulated.

  

With this in mind, we outline here a basic conceptual model for sedimentary-ancient life systems. Topically it encompasses:

  

paleontology, biostratigraphy, paleobiology, lithostratigraphy, sequence stratigraphy, cyclostratigraphy, chemostratigraphy, chronostratigraphy, magnetostratigraphy, paleogeography, basin analysis, tectonostratigraphy, and tectonics.

  

The idea is to keep the conceptual model simple enough to be useful for database and tool design, but sufficiently comprehensive to encompass all of the above disciplines - in total or in part.  It is important to separate the database from the more thematic questions, such as life evolution (speciation, extinction), earth’s chemical evolution, deep-time sea level changes, etc.  Such themes require only two components: 1) a database that includes all relevant data and metadata types, and 2) the analytical and assessment tools necessary to address the thematic questions.  Thus, the conceptual model must accommodate not only the physical, chemical, and biologic features of the rock record, but accommodate the intended uses for these data.  Again, because it is impossible to make a “final” statement on the latter, the system must be extensible. Furthermore, this model is put forward as an initial attempt that can only be improved as subdiscipline experts make corrections and add more detail.

  

The basis for the conceptual framework presented reflects the following:

  


  

1.         scale dependence

2.         time series

3.         spatial patterns

4.         age and location

5.         time proxies


6.         samples and features

7.         physical, chemical, and

           biologic characteristics

8.         temporal "location"

9.         spatial distribution


  

 

  

  


Scale dependence: "Scale dependence" means that data types, physical, chemical and biologic characteristics, resolutions, and relationships among data types are scale dependent; it can be addressed using the “zoom-in” and “zoom-out” metaphor; the closer the zoom, different data types and more detail is seen; zoom-out to provide more spatial coverage, but fewer details and a different suite of data types.

  

Time series: "Time series" reflects the fact that the vertical succession of sedimentary strata record a time series of events. Overlying strata, and all its contents, are younger than the underlying one.  Hence a time sequence is recorded within the stratigraphic stack, and all objects and data collected through the stack reflect a time series.  This is complicated when crustal deformation (fold, faults) disrupts this ideal layer-cake geometry and may reverse the age order in present-day geometries; i.e., faulting may place older rocks on top of younger rocks.

  

Spatial patterns: "Spatial patterns" encapsulates the notion that there are lateral and vertical changes in physical, chemical, and biologic characteristics of sedimentary strata. Spatial patterns reflect the fact that laterally, along any one time line through the rock record (think of a single stratigraphic plane or one card in the deck), objects change in a systematic, predicable ways - even if that change is chaotic. In a vertical spatial sense (assuming no structural disruption; "time series"), we have similar changes in geologic characteristics as geologic conditions change through time. Think of a beach facies that migrates seaward as sea level drops, overstepping the adjacent nearshore marine facies.  This may or may not encompass a simple “time series” of features.                 

  

Age and location: "Age and location" underscores the fact that every object collected or described from the rock record has an “age” of its origination and a “location” in both modern coordinates and a “paleo-location”. “Age” is apparent from "Time series". “Location” refers not only  to location in present-day coordinates, but to paleo-coordinates within the original sedimentary basin and on the globe.  This is important for reconstructing the history of each basin, mountain system, continent, etc. Location can be recorded in decimal degrees or in stratigraphic meters from an arbitrary point within a stratigraphic section, well, or drill hole (which are themselves located via latitude and longitude). Present-day locations do not typically represent the origination location. Present-day locations have to be extrapolated backward to reconstruct the migration pathway from the point of origin of any given feature or object (this is paleogeography).

  

Time proxy: "Time proxy" refers to the fact that the rock record is an incomplete proxy for time; the completeness of the record varies laterally as well as vertically. This says that there are gaps in the rock record - that the geology at any one location is an incomplete record of that location's geologic history, and hence, the rock record in general is an incomplete proxy of time.  Because these gaps are different at different places, one goal of sedimentary geology is to correlate the geologic records of many sites, globally if possible, to better piece together a more complete geologic history of the Earth.

  

Samples and features: This notion reflects that sedimentary data can generally be divided into  "samples" or "features", each attributed with different characteristics. Samples of rock are taken for a variety of purposes (fossils, geochemistry, petrographic analysis, etc.).  Features include those of cyclostratigraphy (cyclic changes in lithology, organic content, geochemistry, etc.), event stratigraphy (discrete physical and/or biological stratal units or surfaces; e.g., storms beds (tempestites), sediment gravity flow deposits, fossil accumulations (mass kills to longer-term winnowing), ommision surfaces, ash fall deposits),sequence stratigraphy (systems tracts, parasequence boundaries, etc.), and tectonostratigraphy (faults, folds, unconformities, etc.). These attributes are data or metadata, and can be virtually anything that is associated with a particular sample or feature. They are scale-dependent. Populations of samples can be depicted on graphs and charts; diagrams and photographs may be used to depict of features and samples. Geologic maps, cross-sections, stratigraphic sections, and well-logs also depict features - geologic features that intersect the land surface, a vertical slice through the crust, or an effective 1-dimension line through the crust (sedimentary succession). These must also be attributed - e.g., formations, strikes and dips, description of photo, location photo or diagram, etc.  Annotations are forms of attribution.

  

Physical, chemical, and biologic characteristics: Samples and features may have, in addition to time and location “attributes”, physical, chemical , and/or biologic characteristics or “attributes” obtained by description, measurement or analysis. Physical characteristics include: geometry, texture, fabric, etc.  Chemical characteristics have values that reflect analytical method + specific object analyzed, e.g., radiogenic or stable isotopes; trace and major elements; whole rock, mineral, grain, fossil, matrix, "objects".  Biologic characteristics include, taxon types and abundance, details of taxon morphology, preservation modes, etc. These types of characteristics record the processes that have dictated the origin, transport, and deposition of the sediment. Any model or hypothesis that purports to understand these processes, must be able to successfully predict all of these characteristics at all their spatial and temporal scales.

  

Temporal location: "Temporal location" as a point or an interval of time reflects a combination of issues raised in the discussion of "time series", "spatial patterns", and "age and location". Fundamentally, "temporal location" can be denoted as a "point" in time or as "intervals" of time.  A time "point" connotes exactness - a precise and accurate age determination (typically with stated uncertainity). An "interval of time" reflects: a) an exact range in age for an object, or b) an age range that reflects less exact levels of precision or accuracy.  For example, a radiogenic isotope age of 354 " 2 m.y. (million years) indicates that the mean age is 354 m.y., but the range (or interval) is 356 to 352 m.y.  Similarly, a fossil may be given an age of "lower Sakmarian" (an "exact age") or "lower Sakmarian to upper Artinskian" (an age range or interval). Of course "lower Sakmarian" is itself an "interval", approximately encompassing 1.5 to 2 m.y. Similarly, ages in millions of years appear to be "exact" in that they reflect a discrete number, but it depends on the definition of "exact" - if your level of desired accuracy is 100,000 years, then an age of 354 m.y. is not "exact".  Hence, the user must accept "exact" to reflect that of the level of desired precision.  "Sakmarian" is a stage of the Permian system - the accuracy is then assumed to be within the boundaries of the stage as defined by some time scale (which is why the time scale used must be specified). For radiometric ages, it is a bit more difficult, because an analytical precision can be stated as 354.25 "  1.25 m.y., but that does not necessarily reflect the "accuracy". Accuracy is influenced by many factors, not only analytical issues, but also by sampling errors, sample handling errors, etc. The geochronology community needs to address these issues in terms of data handling and interpretation.

  

Spatial Distribution: The "spatial distribution" of an object can vary from microscopic to local to global. That features and samples have spatial distributions should be obvious, but it is important to separate "known" global distributions from presumed ones.  For example, sequence stratigraphy assumes the global distribution of features related to sea level rise and fall. The issue is to develop an independent data set that can test such assumptions rather than merely compile such interpretations; in the latter case it is far too easy to forget, and begin to think of interpretations as "data".

  

Processes yield Products and Products have Properties

  

A host of geologic/geophysical/biologic processes produce the products we see, describe, measure and sample in the rock record.   Theses processes include plate motions and interactions, magmatism, metamorphism, deformation in all its types and scales, climate, the constructions, filling, and destruction of sedimentary basins, rise and fall of sea level and its imprint on the stratigraphic record, faunal origination, radiation, migration, and extinction, formation of faunal provinces and the fossils they produce via taphonomic processes, sedimentary facies, and the diagenesis of these sediments to form sedimentary rocks.

  

You can describe a hierarchy to the products generated by these processes.  These products are interrelated, to various degrees, on some spatial/temporal scale. Products have properties than can be recorded.

  

Hierarchy of Products

  

Sedimentary_system: encompasses all of the lower classes - it does not, by itself have any properties.  It really means: “sedimentary-paleobiologic system” because you simply cannot separate the fossil from the physical sedimentary record - they are genetically linked through a number of processes, and they are spatially and temporally linked in the rock record.

  

Tectonic_Setting: defines the larger-scale setting of sedimentary- paleobiologic system; it evolves through time, and encompasses sedimentary basins of all scales.

  

Tectonic_Events: Tectonic events, singularly, in succession, or repeated over longer spans of time produce and disrupt the sedimentary basins that host the sedimentary and paleobiologic records.  In the modern oceans, subduction may be the first important tectonic event to disrupt the stratigraphic succession, in continental foreland basins, development of the hinterland fold-thrust belt, will progressively deform the adjacent basin and its sedimentary succession.

  

Sedimentary_Basin: (Or just “Basin”); these represent the crustal bowls that accumulate the sediments and their fossils.  For significant thickness of strata, they must be tectonically derived, hence the superclass link to “Tectonic_Setting” and “Tectonic_Events”.  Non-tectonic basins do exist, but do not accumulate more that a few 100 meters of strata, and are rarely preserved in the rock record.

  

Geologic_Units: These are the “defined” units which are defined from within the stratigraphic succession of a sedimentary basin. These include classic lithostratigraphic units (e.g., formations, etc.), but are not restricted to that.  Newer sequence stratigraphic (allostratigraphic), cyclostratigraphic, magnetostratigraphic, and chemostratigraphic units - these are the formally “defined” and described units.  Also, typically at smaller scales are “genetic units” - those that reflect an interpretation (for physical, chemical and biologic features) of their origin, e.g., lithofacies, biofacies, water mass facies, and perhaps a “climate facies.”

  

Earth_Material_Samples:   The term “earth materials” comes from the USGS-Canadian Geological Survey’s draft version of a geologic map ontology, and seems appropriate as a modifier for “Sample”.  Samples merely reflect the smaller scale features of the sedimentary rocks - ones we typically take physical samples of, or measure in the lab or directly on the outcrop.  In this regard, a strike and dip are types of “samples” - but that is not relevant to the sedimentary system we are addressing here.

  

The Properties

  

Typically we immediately jump to those properties we are familiar with, such as “lithology” , or some “geochemical” analysis number.  In an ontological sense however, we need to view properties in two ways: their physical, geological relationships to each other, as well as their “types”.

  

Physical_Geological_Relations: This encompasses the fact that physical properties have the spatial, temporal, directional and magnitude values.  Temporal includes that derived from geometric, stratigraphic, intrusive, structural relationships as well as “geologic age” as measured from fossils, radiometrically, etc.  You can probably see that all of these yield a series of other values that can be dealt with as outlined in “primary values.”

  

Property_Types: There are only 3 general property types: primary, derived, or defined. Primary includes physical, chemical, and biologic, and these have “primary_values” (see below).  Models of various types, explicit or implicit, are used to produce “derived” properties, these include, charts, diagrams, numeric, statistical values, geological maps (arguably a model or visualization), and others.  A “defined” property is one where a general definition exists and is applied by the worker to a product (feature); for example, “lithofacies” could be a “delta front facies” - there is a definition for this facies that is then used for the interpretation.

  

Primary_Values: These include: location, described, estimated, measured, calculated, and portrayed.  Note that location is highlighted separately for emphasis only, it is either a measured and/or described value.  Age is similarly fundamental to the sedimentary system, but is not highlighted in this discussion is a critical "primary value witht is either a measured or derived product.  Measured includes a variety of “values” such as those listed.

Metadata: Data about data; these include items such as cruise or expedition (field trip) during which a hole (core) was drilled, a stratigraphic section measured, a sample was taken, sample storage location, person who took the sample, etc., a wide range of information necessary to track data and assess its quality and completeness. Often, metadata can be and is encompassed in the categories above, but there may be some information that cannot so be accommodated.

  

All of the above, are encompassed in the ontology as specific items, but their roots as a physical relationship or property type are implicit.

  

LEGACY DATA

  

The Problem of Legacy Data

Legacy data are perhaps the most difficult data to capture in a database.  Nevertheless, a motto for PaleoStrat could be: “Built for the future - but working to capture the past.”  This reflects the importance not only of providing a mechanism for capturing and delivering data from new research, but also of capturing the legacy data from past research. Much of the legacy remains as valuable today as when it was generated. Some is irreplaceable when descriptions cannot be reconstructed because the author has died, notebooks have been lost, or samples cannot be recollected because outcrops no longer exist or are inaccessible. Our legacy is one of incomplete data sets - and most of these missing data and metadata may not be recoverable. In part, this is because historically we have been limited by the length of published articles, often without the benefit of “supplemental data” archives. The advent of geoinformatics has begun to remove scientists’ reluctance to share unpublished information, but missing data and metadata are still difficult to recover.  Often, the researchers who generated the data are, as they must be, very busy doing new science, teaching, administrative work, and management.

 

Some other databases, such as SedDB, PedDB, NAVDAT,and PaleoBiology Database have taken on this challenge by loading data and what metadata are available directly from the published literature. This is a huge task for the principal investigators on these database projects, but it has proven fruitful. The challenge is to find, load and then quality-check the data; often the author of the publication must be contacted for additional information.  It is a heroic effort, but the approach has proven workable because of the vision and efforts of the principal investigators, although burdened perhaps by the lack of sufficient funding for the people who load and quality-check the data.

  

The Solution for Legacy Data in PaleoStrat

One motivation for developing PaleoStrat was the need to eliminate the tiresome duplication of effort in compiling and recompiling the same old data into many new projects. Most projects still need to combine information from both printed and electronic sources, but NSF and journals are more actively encouraging the electronic archiving of these data.  Thus there is a passive and effective prioritization in effect – those legacy data which are most useful will be the most likely to be archived. PaleoStrat has had limited success in attracting voluntary contributions of legacy data, but the experiment has been in place for only a very short time.  We propose various strategies for the more efficient capture of legacy data to make the contribution of data as attractive as possible for the scientists. To simplify the process, we continue to offer web-based forms and Excel templates that can be downloaded and uploaded. However, while some may take advantage of these resources, many others will only have isolated Excel files; old, home-grown databases (e.g., in Access); or merely paper copies of published articles and paper data in their file cabinets.  In these instances, we also offer logistical support by setting up a data loading system comprised of a supervisor and undergraduate geoscience students. We will work with all interested scientists, but for such data loading based on the availability of our resources. Finally, we are enthusiastic partners of the SESAR project (spearheaded by K. Lehnert and S. Goldstein, Lamont-Doherty Earth Observatory) to develop an international system for a unique sample identifier (IGSN - International Geologic Sample Number – www.geosamples.org), which will greatly improve the quality of captured legacy data.

  

  

THE GOVERNMENT AGENCY

How do data systems such as PaleoStrat help government agencies and why should the community user care?

                                                                                                     

Agency and Project Data Policies

The return on investments by government agencies in science research is minimized by the reliance solely or mainly on the printed article as a means of documenting research results. Digital information data systems are important mechanisms that will insure the maximum return on those investments. These systems, which are more than just databases, must be distributed to insure that adequate expertise is available to correctly construct, maintain, and migrate them to new technologies. This expertise must be combined teams of domain and computer scientists and must be assembled to meet the different agency missions and mandates. We are just now seeing the initial development of such systems by the user scientific community, and it is this user community which is putting pressure on the funding agencies for support. Many federal and state agencies are therefore in the middle of assessing how to deal with these issues - issues for which they have little prior experience.

  

      Some research programs funded by government agencies have always required the placing of first-level data in databases, e.g., the U.S. component of the International Ocean Drilling Project (ODP and now IODP) requires data gathered during or shortly after a cruise to be placed into Janus Others have data policies in place but these policies are effectively not enforced, and others have no policy.  The US National Science Foundation (NSF) is one agency where many of its research programs already require that all data generated by their funds be placed in public data repositories. It now appears that NSF as a whole, as well as other US federal agencies, are rapidly moving toward similar requirements. However, many of the existing data policies are not fully enforced, thus much of the investment in science research is lost (see "Legacy Data").  This is exacerbated by a general lack of data systems that can host these data and metadata. Such systems, when designed to capture the full context of the science research, are extremely complex, difficult to build, maintain and migrate to new technologies, and simply have not yet been put in place in part because of the cost (see "Cost").

  

      One of PaleoStrat's goals is to provide the user with a convenient mechanism to meet both NSF's data policies and the needs of the science research. The data policy statements for NSF's Earth Science (EAR) and Ocean Sciences (OCE) divisions, and that of Polar Programs (OPP) are provided for reference. Although EAR is PaleoStrat's main home, we are working with partner geoinformatics efforts and researchers where primary or significant portions of funding comes from OCE and OPP (e.g., SedDB, CHRONOS, etc.); this cross-division effort helps to build a common platform and interface.

  

      NSF's National Science Board (NSB; www.nsf.gov/nsb) has started to address agency-wide data issues. The NSB is an independent policy body that oversees and guides the activities of, establish policies for NSF, and serves as an independent national science policy body providing advice to the President and Congress on policy issues related to science and engineering. The prepublication draft the NSB position on long-lived data sets underscores the growing awareness of the importance of digital data and the need to gather, maintain, and make accessible vast amounts of NSF-sponsored data (see:  http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf).  This NSB document also clearly demonstrates that NSF's data policies will be strengthened in the near future.

  

"The National Science Board (NSB, the Board) recognizes the growing importance of these digital data collections for research and education, their potential for broadening participation in research at all levels, the ever increasing National Science Foundation (NSF, the Foundation) investment in creating and maintaining the collections, and the rapid multiplication of collections with a potential for decades of curation."

  

In the United States, Executive Order 12906 calls for the establishment of the National Spatial Data Infrastructure (NSDI) defined as the technologies, policies, and people necessary to promote sharing of geospatial data throughout all levels of government, the private and non-profit sectors, and the academic community. The Federal Geographic Data Committee (FGDC; www.fgdc.gov/nsdi/nsdi.html) is implementing this order. Circular A-16 issued by the Office of Management and Budget (OMB) of the Executive Office of the President explains the Executive Order (www.whitehouse.gov/omb/circulars/a016/a016_rev.html). Circular A-16 designates the US Geological Survey with designing and implementing methods to handle geologic data: ". . . geologic spatial data theme includes all geologic mapping information and related geoscience spatial data (including associated geophysical, geochemical, geochronologic, and paleontologic data) that can contribute to the National Geologic Map Database as pursuant to Public Law 106-148." The USGS is carrying out this task in collaboration with other agencies and the Association of American State Geologists. Whereas these data to not encompass all that NSF-based researchers utilize or produce, it is very important that PaleoStrat, and all geoinformatics efforts, insure that our data systems are compatible with that of the USGS, including meeting minimum standards set forth by the FGDC.

  

PaleoStrat's role

It is PaleoStrat's responsibility, working with NSF management and other geoinformatics efforts, to insure that all rules and policies are met, whether they be from NSF, the FGDC or are additional OMB and/or US Congress mandated activities.  This insulates the geoscience researchers from unnecessary burdens while keeping them, and their research, in compliance.  Thus, PaleoStrat represents a repository that is being developed to meet NSF's current and future data policy standards.  Because PaleoStrat is community-focused, this helps insure that the system meets not only agency and project requirements but the needs of the larger research community.  By working closely with not only NSF's management, but cyberinfrastructure efforts by other agencies that host sedimentary geology-paleontology (sensu lato) data, PaleoStrat will insure that what we build for our community is interoperable and meets the standards of not only NSF, but other federal agencies, in particular the US Geological Survey and state geological surveys (through the Association of American State Geologists).

  

Cost

This discussion applies only to the US National Science Foundation (NSF), but similar issues are faced by other federal and state agencies.

  

The issue

The development, implementation and stewardship of data systems is expensive not because of the technology issues, but because it is a people-intensive endeavor. There is no way to circumvent this reliance on people.  In particular this is the case for the next 10 years, as the intensive interaction of the domain scientist and their computer science colleagues is required to construct the complex data systems required by the domain science research. It is, and will continue to be an iterative process that will progressively produce a mature product, but not quickly, and it won't be easy.

  

A "data system" is here defined as the database and all associated technologies necessary to capture, store, deliver and assess data. In NSF's initial report on "cyberinfrastructure" (Atkins et al., 2003; http://www.nsf.gov/od/oci/reports/atkins.pdf). This report was suggested that $1 billion would have to be invested to develop cyberinfrastructure.  Although data systems were not fully assessed in that report, they are a significant part of such expenditures. It is important to emphasize that data is the core of cyberinfrastructure (which, for the geosciences, we refer to as "geoinformatics").  For example, without data, the continued development of better methodologies and technologies for database interoperability, global connectivity, data mining and knowledge concepts, etc., are not justified. The need for high-end computing is significantly lessened.  Furthermore, it is not just about large volumes of data derived from sensors, sensor arrays and other forms of remote sensing, but all those data generated by the vast majority of scientists that spans field work to detailed laboratory analyses. And, it is about capturing not only data generated by future research but the vast amount of legacy data we are at risk of losing (see "Legacy Data"). In fact, it can be argued that the biggest cyberinfrastructure challenge is with the development of data systems - for here is where the need to deal with the full spectrum of the science complexity occurs and where, simultaneously, you must deal with the human interface issues.

  

How much?

A thorough assessment of costs has not yet been done, in part because the nature of the geoinformatics system has not been fully defined. Nevertheless, here, for the Geosciences Directorate, it is estimated that each division will need to spend an additional $40M/yr to achieve minimum goals for the overall cyberinfrastructure.  The data systems portion could easily be $20M/yr or more of that total.

  

The cost for each data system varies depending on what aspects of the science and/or data assessment tools each focuses upon. Again, it is important to stress that a centralized approach will not work, at least for the next 10 years, until the overall system matures to the point that such centralization is feasible. A distributed structure is required because of the needs for capturing the complexity of the science. Each component must, at least initially, be as close to the domain science subcommunity as possible. It is thus, both fiscally and practically more efficient to start with a distributed, but coordinated suite of efforts, and to then migrate them as necessary and desirable to a more centralized structure in the future via wise and concerted agency management.

  

The challenge for the geoscience community

There are many challenges placed in front of the geoscience community re geoinformatics - and perhaps the most challenging of these are cultural issues. But, assuming that both the directorate and a sufficient core of the community understand the importance of cyberinfrastructure/geoinformatics to the future of their science, the question becomes:

  

How do we raise the additional funding necessary to support the geoinformatics efforts?

  

This question directly confronts the view of many scientists that they would prefer to see all available budgetary money directly supporting their science research.  But this is a mistaken view - at best a short-sighted one.  NSF has always funded facilities that support the science research.  Without such facilities, the science could not be done - ships, analytical facilities, airplanes, experimental facilities of various types, etc.  A certain portion of NSF's portfolio thus is mandated for facility funding. Geoinformatics is facility, a distributed one, but a facility nonetheless.  It does not, any more than the other supported facilities, compete with doing science - indeed, it is required for the conduct of that science. So the question is not "should we", but "how do we" support the geoinformatics facilities?

  

It is clear that NSF will be compelled to move toward support for cyberinfrastructure. The notion of return on investment introduced in the first paragraph is real.  For less than 3% of what NSF has spent on science and engineering research over the last 20 years, they can capture both the legacy and future data streams to insure the maximum return on that investment. This estimate focuses on the needs for data and related computation at the directorate level - the level where effective cyberinfrastructure for the domain science (e.g., geoinformatics in our case), must occur.  Some people in the US Congress are already aware of this.  But that too is an overly simplistic statement, because it is a huge number on a per year basis. If NSF was to spend, over the next 10 years, the stated $1B/yr to develop and deploy cyberinfrastructure that would capture data from 1995 as well as new data streams, that would be a cost of $10B versus total NSF budgets of approximately $100B - 10% of the total, and 20% of the yearly. This will require an infusion of new money to NSF to accomplish. It is up to the community to support NSF as it seeks that budget increase.