How should we publish our Open Data?

Now that the MELODIES project is past the half-way stage, and a number of the services are starting to produce new and valuable open datasets, we are thinking seriously about how we should publish these to the wider community. We are aiming to produce 5-star Linked Open Data as best we can, which means that we will publish in open, machine-readable formats using standards like RDF, using appropriate open licences and linking to other relevant datasets. We will also be registering data in the GEOSS DataCORE.

But this still leaves many questions open. Here they are, together with some of our current thoughts. 

  1. Where should we host data that we publish? We'd like a long-term home for the data and, ideally, the site in question would support web service access to the data, e.g. via SPARQL. In some cases, the solution is already available to us (e.g. output from WP3 can be hosted in the NERC data centre at the Centre for Ecology and Hydrology), but in other cases the solution is less clear. Suggestions (used previously by the partners in projects such as TELEIOS) include datahub.io, but I don't think we yet know what our best option is.
  1. What vocabularies should we use to describe our data? There are a number of RDF vocabularies that are able to describe datasets, each aiming at a different community. The interesting thing about RDF is that we can potentially use all of them at the same time! We are looking at:
  1. How do we make data discoverable by both specialist and mass-market search engines? To solve this we need answers to both (1) and (2), and we will also need to register our data with search engines such as the GEOSS.
  1. What licence(s) should we use? I'm not an expert here, but it seems that Creative Commons licences (at least version 4.0) will do the job for us here. This article from the Open Data Institute describes nicely how CC has adapted for sharing data and databases.
  1. How do we handle data that is not easily described in RDF (such as raster data)? I think the only answer to this is to describe the metadata in RDF and point to means of distribution (e.g. FTP, OPeNDAP, Web Coverage Service) that offer access to the data in more efficient forms. The RDF Data Cube Vocabulary could theoretically describe the data, but is not likely to be efficient enough for practical use with large datasets.

Have you tackled any of these problems in your domain? Your feedback in the comments is very welcome!

Add new comment