As Mikelangelo is focusing on the performance issues in Cloud and HPC technology stacks, we’re in a position where traceability of our experiments is of significant importance. We need to facilitate both ourselves in the later stages of the project as well as others to reproduce our experiments in a fully controlled way. To this end, the Mikelangelo consortium approached the Open Data Pilot as something, that is very practical to have, enables commitment in the project and gets us to get ducks in order.
So, to get practical - The EC Guideline requires the following 5 attributes for each of the datasets:
- name
- description
- standards used
- sharing
- archiving and preservation
Fulfillment of these five attributes may be a rather simple task, but still - the good folks at the Digital Curation Centre have provided us with a very good online tool that will let go through all of your datasets and really make you provide all the required answers, while offering very useful help (for example, they have a repository of standards used in various fields, easily linked from the online tool).
Another thing a lot of people may overlook is the contractual obligation that we actually need to provide the data in a research data repository (Article 29.3 of the Grant Agreement). This directly means that one should go for a well-known online repository, which will take the data, papers, etc. and do the storage and sharing for us. However, choosing the one to be used requires some considerations based on the requirements one needs for publishing the data. Some of these tools originate from the more standards-oriented areas of research and may make us adopt some of the standards we’re not particularly fluent in - one such example would be the OAI-PMH harvesting standards.
Enter Zenodo.org, an OpenAIRE funded repository, which overcomes this obstacle in an elegant way. Whatever we upload and sufficiently describe (according to the DMP plan, created at DCC), is then offered to OpenAIRE and solves a lot of practical issues on our/your side.
Finally, there’s a last quirk. While targeting scientific repeatability of results, all the datasets, used for published results, must have a permanent link. Thankfully, Zenodo will assign Digital Object Identifier (DOI) to each of the public uploads. This, in turn, gets us to two smaller issues:
- upload of datasets must be done before the paper is published allowing authors to properly cite the data used
- changing datasets becomes a cumbersome task (change of DOI is not trivial).
To get around these two issues, we’ve decided and elaborated in the Data Management Plan to employ a local data staging mechanism - our Mikelangelo OwnCloud instance will host datasets deemed not stable enough and only after the paper is published, the paper along with the final dataset will be pushed to the Zenodo.org.
How do you intend to contribute to the Open Data Pilot? Feel free to share in the comments.