Skip to topic | Skip to bottom
Home
Main
Main.DataStandardsr1.1 - 15 Jan 2005 - 03:12 - JeremyCothrantopic end

Start of topic | Skip to actions

DataStandards



Intro

This is a copy of the original post at http://twiki.sura.org/bin/view/Main/DataStandards

By data standards I am intending to talk more conceptually about data organization rather than a final implementation format such as relational database, netCDF, HDF, XML, etc.

How are data, metadata and data transport standards different?

Data, Metadata, Data transport.

Data as it exists in its structure and implementation format. Data as it might be described or associated. Data as it might be transported.

Both DODS/OPeNDAP and OGC standards(WMS, WFS, WCS) seek to provide a common descriptive method of retrieving data(data transport). DODS/OPeNDAP works on this by developing server side software which tries to provide a common http client for all possible underlying data implementations(a netCDF server, RDBMS server, Matlab server, etc). OGC standards work on this by specifying a set of web services which are left to the community to implement as needed.

Neither DODS/OPeNDAP nor OGC clients address specifics about the underlying data structure or data implementation format(netCDF, RDBMS, XML, etc.) This is by design to maintain these tools at the abstract layer of data description and transport and not become tied to any particular data structure or format.

But in the complete data management workflow, the questions of data structure and format become critical as well.

Developed programmatic conventions on data structure can reduce needs for data transformation, increase storage efficiency and increase the longevity of developed processes for observational and modeling communities.

Data format(netCDF,HDF,RDBMS,XML,CSV) will continue to be a difficult choice as the resources on hand for collection, storage and application create a broad number of decision scenarios. Data format is closely tied to application and performance issues. But tools to abstract and map data structure from data format should continue to be pursued.

Case Study: Southeast Atlantic Coastal Ocean Observing System (SeaCOOS?)

http://www.seacoos.org

SEA-COOS presented the chance to define data standards and integrate in-situ observational, modeling and remote sensing data products from several institutions and programs. The GIS visualization provides immediate feedback and validation as to the effectiveness of integration efforts while also raising further questions about the quality and display of data.

The in-situ and remote sensing layers are presented in the MapServer? GIS and allow for raster query - drilling through data layers to obtain the point values. This allows calibration/validation between these products. GIS animation and time series graphs are also provided for all products.

For the in-situ and modeling products, each involved institution was able to setup a DODS/OPeNDAP netCDF server to share a netCDF file representing the past 2 weeks worth of data. While this data is available via DODS/OPeNDAP, the amount of data being transmitted was only a few kilobytes, so for performance reasons when aggregating the data at the central relational database the file was uploaded directly(not utilizing the DODS/OPeNDAP API) and parsed with perl netCDF libraries http://www.carocoops.org/bb/viewtopic.php?t=104 .

In-situ observational

MapServer? GIS http://nautilus.baruch.sc.edu/portal_obs

DODS/OPeNDAP data access http://trident.baruch.sc.edu/dods/seacoos

CSV (Comma Separated Value/Excel) files and perl generation programs http://nautilus.baruch.sc.edu/seacoos_data/CSV/

After many telephone and email discussions between institutions, a few data standard proposals for a SeaCOOS? in-situ netCDF format were developed. While the format currently focuses on a few different cases(point, moving point, grid), an example of this format can be seen at http://seacoos.marine.usf.edu/cgi-bin/nph-dods/data/seacoos_rt/

This site lists all the canonical forms which were under consideration
http://nccoos.unc.edu/data/test_cdl/

Here is documentation on the current SEA-COOS CDL v2.0 format
SEACOOSv2.0

Charlton's perl programs which gather the data from providers on a periodic basis (which has come to be termed the data 'scout') is listed here http://nautilus.baruch.sc.edu/seacoos_data/CSV/data_scout/

One of the main points developed was to allow for the most complex data model and let everything else fall out as a subset of that case. With this thinking in mind, the best way to model the data is to make all variables(including latitude, longitude and depth) a function of time. The other data forms allowed for programmatic 'shortcuts' based on the types of dimensions presented in the file - for instance if latitude and longitude were each a dimension of 1 point, then the file was processed as a fixed station. Most of the debates centered around whether description should be carried in the variable attributes or in the dimension or variable naming convention. COARDS, CF conventions were adopted where possible.

Modeling

SeaCOOS? model output integration website(3 merged area domains covering the Southeast) http://nautilus.baruch.sc.edu/model_rs

The following is a link to the model output for the top-right section
http://nccoos.unc.edu/data/nc-coos/model_data/quoddy/forecast/

Modelers needed to resolve within their group resolution and interpolation issues regarding time and other variables as they related to the model output, model region overlaps and their display. Display of the results via the GIS helped articulate various projection, alignment and resolution issues. Since all the model data was homogenous area/field oriented data, deciding on a common netCDF representation was fairly straightforward.

RemoteSensing?

MapServer? GIS http://nautilus.baruch.sc.edu/rs
Remote sensing listserv http://caro-coops.org/cgi-bin/wilma/remotesensing

Aggregation Process

In-situ and Modeling

The process for in situ and model data which is being formatted to a netCDF file is:

1)Pick a physical variable of interest (like wind speed&direction, sea surface temperature)

Each variable is defined within it's own table(one record for each measurement). So there is a table which contains station id, time, latitude, longitude, depth, wind_speed, wind_direction, associated metadata fields. Another table which contains row_id,station id, time, latitude, longitude, depth, sea_surface_temperature, associated metadata fields. And so on...

SQL allows us to join separate table data into views of combined data according to user query needs(so far we haven't needed to do this). Currently the separate variable tables are used to generate separate variable data maps and these maps can be superimposed via the GIS.

2)Define how the variable will be defined in time and location for SeaCOOS?, everyone has decided to use the same time frame of seconds(this can be a floating point number to indicate subsecond intervals) since 1970-01-01 for location considerations SeaCOOS? has developed a framework for datatypes(relating to the degrees of locational freedom of the measurement point(s)) which gives guidelines on how this should be defined in a netCDF file via dimensions and attributes

3)Considerations for display purposes - metadata fields are added which take into consideration the measurement and - whether it should be shown in the display - can be normalized given an agreed upon normalization formula - orientation as it relates to the agreed group locational framework - how data is interpolated and chosen for display

The above steps result in the SeaCOOS? netCDF convention as a specific syntax of dimensions and attributes which can be programmatically parsed by Charlton's perl script (scout) which downloads the netCDF files via HTTP from data providers and populates the variable tables or alerts the provider when there is a problem with the data being input.

Charlton then uses MapServer? functions to convert from the table data to a raster image which is displayed as a data layer in the GIS.

For each new physical in situ variable to be added, aside from what the naming convention that will be for the new variable, step 3 is the only step which should be repeated. Steps 1&2 are a one time initial group discussion/decision processes which are subject to periodic consideration and revision if need be.

Step 3 also tends to take the product(GIS in this case) considerations into mind, whereas the work accomplished in steps 1&2 should be universally applicable for aggregation needs across a variety of products.

RemoteSensing?

The process for remote sensing data using PNG files is:

The PNG files themselves contain locational metadata used for placement/interpretation of the raster image within the GIS and the filenames contain timestamp information and this timestamp is used by Charlton's code to determine which image should be displayed for given temporal conditions. No underlying relational database support is needed for the initial display of the images within the GIS.

Relational database support is currently used to support raster query functions(the raster image is converted to a table lookup of RGB values which correspond to variable measurement). This functionality should be temporary as MapServer? should better support raster query in the future.

Biological

The process for biological data is still up for discussion. One way of thinking of datatypes is as point, raster and vector/coverage. I think we have working models for working with point(in situ) and raster(PNG, etc), but with biological datasets, I'd say my thinking would be to possibly add a time dependent vector/coverage which defines a collection or population locational boundary. As a table representation, this would mean that each population/collection would have its own time indexed table with a link to a vector/coverage subtable. This would keep time as a rigid searchable index and allow a population to be defined as a collection of points or a boundary of an area. Populations could still be described at a point or as part of a raster(image), the vector/coverage datatype would be more to accommodate population datatypes which don't fit well into our existing framework of point and raster datatypes.

Data Dictionary

SEA-COOS listing SEACOOS_Data_SetDictionary.xls - includes standard names that are coordinate-related, time_date names, atmospheric data, physical oceanographic, chemical oceanographic, electromagnetic radiation, ADCP, sensor-related, platform-related, M_aeri data-related, and navigation. For each name, there is a definition, short name, long name, data type, data class, units, axis, valid range, parameter reference, instrument general, equivalent standard name, and standard. Very thorough, but only in spreadsheet form.

also see DataDictionary

Application Performance

By and large the application response for this system is good. We are able to levarage off of

  • a Dell PowerEdge2650 dual Xeon processor with 2 gigabytes of RAM
  • a spatially indexed relational database ( PostgreSQL? + PostGIS? )
  • a relatively simple and straightforward table design which either support in situ measurements or raster images(see below topic about 'Data Structures' on this).

and all software components are open source to boot.

New data is collected periodically from all of the remote distributed sites, but queries against the GIS application are run against the already collected data on the central server database.

Our overall architecture incorporates 3 separate servers - a webserver which alternately points to one of two database servers. The two database servers switch roles between 1)collecting and populating the database(write operations) and 2)servicing queries(read operations). This allows us to keep a steady query response time not impacted by data upload.

Update: We are currently working towards developing our setup with an AMD dual opteron based system with 8 Gigabytes of RAM and hoping to optimize whatever software we can(PostrgreSQL? has 64-bit optimizations for example) and see how this compares to our other systems. We are trying a Linux Gentoo OS.

Data Structures / Canonical forms

The structure of temporal, geospatial data as it is stored in various formats should hopefully be capable of having its structural elements described in a handful of forms. Describing and labeling these forms(deciding what should be abstracted away as well) are the beginning steps before automated programmatic conventions, labels and processing can be utilized in data transformation.

As an example, two predictable forms for storing buoy data are

  • 'by station' where the tablename is that of the station and each row corresponds to all the variable reading for a given time measurement
  • - or -
  • 'by variable' where the tablename is that of the variable measured and each row corresponds to a measurement time, station id and possibly lat, long and depth describing the measurement point and the measurand value.

Currently the GIS favors a 'by variable' approach which corresponds to variable data layers. I appreciate this format for its conciseness and amenability to query and resultset packaging(the ability to mix and match variables which have a similar reference scheme on each variable table). Issues of varying temporal sampling resolutions across multiple stations are better handled in this form. We are developing programs to convert other table formats to this format.

see also Going from Datalogger files to 'Point' form on a relational database

pointForm

(by Variable form, used with point and moving point data)

The following represents a basic table layout which might be implemented on a postgresql database.

CREATE TABLE <my_table>
(row_id SERIAL PRIMARY KEY,
 row_entry_date TIMESTAMP with time zone,
 row_update_date TIMESTAMP with time zone,
 station_id VARCHAR(32) NOT NULL,
 sensor_id VARCHAR(32),
 measurement_date TIMESTAMP with time zone,
 <my_var> FLOAT,
 latitude FLOAT,
 longitude FLOAT,
 z FLOAT,
 qc_level INT,
 qc_flag VARCHAR(32));

<my_table> and <my_var> could be the same for simple variables like 'sst'( <my_table> = 'sst', <my_var> = 'sst'> )
or different for tandemly collected variables like where ( <my_table> = 'winds', <my_var_1> = 'speed', <my_var_2> = 'direction', <my_var_3> = 'gust' ).

row_id, row_entry_date, row_update_date are my preferences for internal table row references

station_id (or platform, container id) – a string label with value as foreign key 

sensor_id - not neccessarily required but adding an additional degree of detail and flexibility at the sensor level

measurement date - the time the variable was measured with time zone reference

<my_var> - the value(s) of the measured variable(s)

latitude, longitude - decimal degree reference
z - depth or height reference, unit meters

qc_level - integer value of [0,1,2,3,-9] corresponding to high-level qc process
qc_flag - space separated listing of qc flags which may apply to this measurement

Adding a general index(assuming a station level reference to individual variables) to prevent duplicate measurements.

CREATE UNIQUE INDEX i_<my_table> ON <my_table> (station_id, measurement_date);

Adding a PostGIS? 2-dimensional geospatial reference column type(see this link on PostGIS? enabling a PostgreSQL? database http://www.carocoops.org/misc/phpBB2/viewtopic.php?t=229)

-- add the geometries
select AddGeometryColumn('<my_dbinstance>','<my_table>','the_geom',-1,'POINT',2);

-- add geometry index
create index <my_table>_gist on <my_table> using gist(the_geom);

Here are database descriptions of the wind and sst tables which Charlton currently utilizes. There are matches to the structure listed above with additional fields added for

  • GIS display purposes (for example, should this be displayed in the GIS or not)
  • unit converted fields (for example, celsius to fahrenheit conversions)
  • frame of reference (for example, does a positive z value correspond to up or down)
  • contact or standards metadata (for example, provider URL, standard XML fields for WMS, WFS or other standards metadata)

We make efforts to keep from 'normalizing' the table into subtables preferring a single table approach with redundancy in certain fields. Since the storage needs are low to begin with this means we can keep things conceptually and operationally simple. Table preformance can be further optimized by partitioning, and familiarity /use of VACUUM, COPY, CLUSTER commands and other indexing schemes which can be applied similarly across these repeated table stuctures.

sea_coos_obs=# \d wind_prod
                                      Table "public.wind_prod"
           Column            |            Type             |               Modifiers                
-----------------------------+-----------------------------+----------------------------------------
 station_id                  | character varying           | not null
 time_stamp                  | timestamp without time zone | not null
 z                           | double precision            | not null
 wind_speed                  | double precision            | 
 wind_from_direction         | double precision            | 
 label_theta                 | double precision            | 
 label_char                  | character(1)                | 
 lon                         | double precision            | 
 lat                         | double precision            | 
 title                       | character varying           | 
 institution                 | character varying           | 
 institution_url             | character varying           | 
 institution_dods_url        | character varying           | 
 source                      | character varying           | 
 refs                        | character varying           | 
 contact                     | character varying           | 
 report_time_stamp           | timestamp without time zone | 
 show                        | integer                     | 
 secs_from_report_time_stamp | integer                     | 
 secs_from_time_stamp        | integer                     | 
 the_geom                    | geometry                    | not null
 wind_speed_knots            | double precision            | 
 can_be_normalized           | character varying           | 
 normalized_wind_speed       | double precision            | 
 normalized_wind_speed_knots | double precision            | 
 normalized_label_char       | character(1)                | 
 positive                    | character varying           | 
 label_z                     | double precision            | 
 wind_from_direction_compass | character varying           | 
 wind_speed_mph              | double precision            | 
 normalized_wind_speed_mph   | double precision            | 
 label_char_knots            | character(1)                | 
 label_char_mph              | character(1)                | 
 normalized_label_char_knots | character(1)                | 
 normalized_label_char_mph   | character(1)                | 
 value                       | character varying           | 
 value_knots                 | character varying           | 
 value_mph                   | character varying           | 
 normalized_value            | character varying           | 
 normalized_value_knots      | character varying           | 
 normalized_value_mph        | character varying           | 
 seq                         | integer                     | default nextval('wind_prod_seq'::text)
Indexes:
    "pk_id_time_stamp_z" primary key, btree (station_id, time_stamp, z)
    "wind_prod__gist" gist (the_geom)
    "wind_prod__id_z_rep_time" btree (station_id, z, report_time_stamp)
    "wind_prod__oid" btree (oid)
    "wind_prod__report_time_stamp" btree (report_time_stamp)
    "wind_prod__seq" btree (seq)
    "wind_prod__show" btree ("show")
    "wind_prod__station_id" btree (station_id)
    "wind_prod__time_stamp" btree (time_stamp)
Check constraints:
    "$1" CHECK (srid(the_geom) = -1)
    "$2" CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL)
    "can_be_normalized_yes_no" CHECK (can_be_normalized::text = 'yes'::character varying::text OR can_be_normalized::text = 'no'::character va
rying::text OR can_be_normalized::text = ''::character varying::text)
    "positive_up_down" CHECK (positive::text = 'up'::character varying::text OR positive::text = 'down'::character varying::text OR positive::
text = ''::character varying::text)
Triggers:
    wind_labels_and_times_trig BEFORE INSERT OR UPDATE ON wind_prod FOR EACH ROW EXECUTE PROCEDURE wind_labels_and_times()

sea_coos_obs=# \d sst_prod
                                      Table "public.sst_prod"
            Column            |            Type             |               Modifiers               
------------------------------+-----------------------------+---------------------------------------
 station_id                   | character varying           | not null
 time_stamp                   | timestamp without time zone | not null
 z                            | double precision            | 
 temperature_celcius          | double precision            | 
 lon                          | double precision            | 
 lat                          | double precision            | 
 title                        | character varying           | 
 institution                  | character varying           | 
 institution_url              | character varying           | 
 institution_dods_url         | character varying           | 
 source                       | character varying           | 
 refs                         | character varying           | 
 contact                      | character varying           | 
 report_time_stamp            | timestamp without time zone | 
 show                         | integer                     | 
 secs_from_report_time_stamp  | integer                     | 
 secs_from_time_stamp         | integer                     | 
 the_geom                     | geometry                    | not null
 positive                     | character varying           | 
 label_z                      | double precision            | 
 temperature_fahrenheit       | double precision            | 
 value_temperature_celcius    | character varying           | 
 value_temperature_fahrenheit | character varying           | 
 seq                          | integer                     | default nextval('sst_prod_seq'::text)
Indexes:
    "pk_id_time_stamp" primary key, btree (station_id, time_stamp)
    "sst_prod__gist" gist (the_geom)
    "sst_prod__id_z_rep_time" btree (station_id, z, report_time_stamp)
    "sst_prod__oid" btree (oid)
    "sst_prod__report_time_stamp" btree (report_time_stamp)
    "sst_prod__seq" btree (seq)
    "sst_prod__show" btree ("show")
    "sst_prod__station_id" btree (station_id)
    "sst_prod__time_stamp" btree (time_stamp)
Check constraints:
    "$1" CHECK (srid(the_geom) = -1)
    "$2" CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL)
    "positive_up_down" CHECK (positive::text = 'up'::character varying::text OR positive::text = 'down'::character varying::text OR positive::
text = ''::character varying::text)
Triggers:
    sst_labels_and_times_trig BEFORE INSERT OR UPDATE ON sst_prod FOR EACH ROW EXECUTE PROCEDURE sst_labels_and_times()

collection of points

biology – population tracking

to be discussed...

collection of particle trajectories

We've started to think along creating a netCDF format for particle trajectories in SeaCOOS? where the number of particles is not consistent. Here's my discussion on the matter. I can also provide if requested some example quick/dirty perl code which I use to transform existing gridded ASCII model data to a netCDF format.

-- JeremyCothran - 01 Jul 2004

The reasons which come to mind for using CF-1.0/SeaCOOS compliant netCDF instead of ASCII are as follows:

- standard time and spatial dimension references and conversion libraries
- existing applications such as ncBrowse which allow visualization of the netCDF file online directly
- existing applications such as the DODS netCDF servers which allow query and subsetting of the netCDF file online directly
- data validation in conversion to format itself
- compressed storage (binary over text) - not really currently an issue, but potentially one as archiving is addressed
- whatever else is listed at Unidata(the keeper's of netCDF) http://my.unidata.ucar.edu/content/software/netcdf/guidec/guidec-6.html#HEADING6-0

For a particle format where time and the number of particles can vary I would suggest something along the following format

row_id - relational database only
time
particle_id
lat
long
depth - optional
various metadata fields related to this particle and it's display

If the number of particles was the same throughout the fileset, then you could treat each particle as it's own dimension and maintain a regular array - but since the number of particles is changing, all particles are folded into the 'particle_id' field which is regular while ranging over all the possible particles.

We can always supply simple conversion programs/tools as well to facilitate model ASCII output to a chosen netCDF format.

-- JeremyCothran - 01 Jul 2004

spaceForm

(raster form, used with HDF remotely sensed raster .PNG files)

row_id (sequence_id)
entry_date
update_date

measurement_date
binary file (.PNG raster file for example)

static world file

  • station_id (or platform, container id – a string label with value as foreign key reference)
  • bounding latitude
  • bounding longitude
  • bounding depth
  • variable measured (related variables as well)

  • Associated metadata for display, contact…

collection of spaces

biology - species collection

to be discussed...

Future directions

  • tools for generating netCDF, XML, CSV from relational databases in convention forms
  • by_variable/by_data_layer time sequenced data as 'atomic' form and programmatic expectations for modeling and visualization/analysis
  • breaking things into modular, pipelinable web services

Related posts:

Technology Soup

Charlton's SeaCOOS and Mapserver presentations (August 2004)

-- JeremyCothran - 15 Jan 2005
to top


You are here: Main > DataStandards

to top

Copyright © 1999-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding DMCC? Send feedback