Data provenance

Data processes convert input data to output data. They can contain simulations, but also data conversion, evaluation, aggregation, and visualisation. They should be atomic, i.e. not consist of sub processes but this is not a requirement.

In SciMesh, data processes are of class “Process” just like experimental processes. You can chain them with “cause” relations, meaning that a data process has as its potential input all the data produced by its predecessors.

Bulk data

With “bulk data”, we mean an opaque octet stream of data at a specific URL. In order to be referred to in a SciMesh graph, the response from the web server should include a correct content type. Moreover, the URL must contain the checksum of the data. If the protocol schema itself does not provide this (e.g. IPFS URLs do), the URL fragment (the part behind the “#”) must contain a hash using Multiformats. In particular, the format is:

<base>base(<version><multihash>)

In other words, the binary <multihash> is encoded by the function “base()” (e.g. base32), and the character <base> denoting that function (“b” in case of base32) is prepended. <version> is always the byte 0x01.

Data input

In order to see the exact data that is used, you have to have a deeper look into the process (e.g. by inspecting the inputs manifest in the processing program). In SciMesh, URLs to bulk input data are not explicit. (Of course, you can make them explicit with your own vocabulary.) Analogously to physical samples, the input is the whole graph of processes (and in particular, their data outputs) that led to this process.

While technically, the program that does the data processing can download any input data, a valid SciMesh graph makes sure that all of that is output of a preceding process. Violating this is like not including all sample-influencing parameters into a physical process.

In some cases that could mean that you have to create a preceding process just to connect it with bulk output data URLs. Just do so, it is fine.

Data output

Any data output is represented by URIs that resolve to retrievable URLs with that data, which are connected with the process using custom vocabulary (as it is with measurement data for experimental processes). The process must be the subject of such triples.

_images/bulk_data.svg

Fig. 12 Representation of bulk output data in SciMesh. Here, “sm” is the namespace “http://schema.org/”.

Fig. 12 shows