SciMesh at a glance¶
Introduction¶
SciMesh represents scientific insights as a knowledge graph. In its current realisation, it focuses on sample-based workflows, in which samples (physical ones or data artefacts) undergo a sequence of processing and measurement steps. However, it is not limited to that. Most generally, it represents scientific insight by declaring a relationship between cause and effect. In other words, if certain prerequisites are set, then certain observations will be made.
Fig. 1 shows a graph illustrating this line of thought. Starting from an initial state called “nil”, processes change this state. There is a monotonic increase in time from left to right. Consequently, this time axis also is inherent in a chain of processes. A process may be “take silicon substrate out of rack”, “heat the sample”, “mix two substances together”, or “wait for solar eclipse”. It is really that general. Every process has got one or more causes. If it is a single one, it may be the initial state. The initial state is totally void. It bears no information at all, which is why the chain of processes must define the states in-between as completely as necessary to be useful for scientific conclusions. Conversely, one process may be the cause of multiple others. This way, chains can be branched off, possibly by different scientists years after the work on the main trunk.
Fig. 2 shows how the essence of scientific work, the relation of effects to causes, is represented in the graph: The processes up to a certain point are the cause, and the observations at that point are the effect. Therefore, it is valid to call that point an “insight”, which is a graph node of its own. Two things are important here. First, the process to which the observational data is attached is a measurement, and quite frequently, a measurement does not change state. Still, it is a process proper, with all bells and whistles. We think that it is not necessary to distinguish between measurements and processes which actually change something. Besides, a measurement does change state, albeit the change may not be significant. And secondly, a certain graph of processes might have many insights pointing to it. In particular, the processes following a measurement may lead to a second measurement with new results, meaning a new insight.
Speaking of insights, Fig. 3 shows the relationships of things that may point to a certain state (a.k.a. process) in a graph. All of them are insights but if it is about a chain of concrete processes at certain points in time, this should be labelled as an experiment. If all of this happened to the same sample, the experiment may be identified with this sample.
If the processes are not concrete but generic blueprints of processes, the experiment is actually a recipe for experiments. Or, by bringing cause and effect together, it is a scientific hypothesis.
To illustrate experiment versus hypothesis, consider the following two examples, respectively:
On 21 October 2019, John threw a ball from the Eiffel Tower, and it moved downwards.
A body is released in a gravitational field, and it moves according to the force field.
Current scope of SciMesh¶
The currently specified RDF data model of SciMesh is narrower than the very general concept outlined above. The reason is very simple: Our work is in its early stages, and we have to focus on specific domains of research in order to avoid frittering away our resources. Since we work in task area “Caden” of NFDI4Ing, the domain of research of choice are sample-based workflows.
We hope to be able to give guidelines for how to extend SciMesh to other domains, eventually.
Data model overview¶
The RDF data model is build around two concepts: Sample and process. Both are meant in their broadest senses. A sample may represent a state of a physical specimen as well as a data set. A process may create a new sample state (i.e., change the sample, for example an etching process), or create new data (e.g. a measurement on a sample), or both.
Fig. 4 gives an overview of the anatomy of a knowledge graph in SciMesh. It is a very simple graph yet contains most of the basic concepts. At its heart, there is a sequence of processes (at the bottom) that work on the sample. The first process, “substrate”, creates the sample (the sample starts its life as a bare substrate). Its RDF properties determine the basic physical properties of the sample (e.g. material and size). Then, further layers of material are deposited on the substrate in the deposition process. Common properties of most processes are name, method, timestamp, operator, and comments.
Note that three different namespace domains are involved:
The domain of the ELN instance: the processes, the process types, the sample and its intermediate instances. This is in blue.
The domain of the ELN software: the sample type. This is in green.
External domains like BFO/OBO and RDF. This is in red.
Note
Some may have noticed a striking resemblance to Git’s data model. Indeed, there is a very close mapping possible between both. The processes are commits, and the sample is a branch. You may split a sample in half, creating a new branch. At the same time, one process may have more than one cause. This way, pouring two chemical substances corresponds to merging in Git.
SciMesh vocabulary¶
Default namespace: http://scimesh.org/SciMesh/
- Process (class)
- Insight (class)
- Experiment (class)
A specialised insight (see Fig. 3) which represents a concrete experiment. It points to a graph of equally concrete processes, i.e. processes that happened at a certain place at point in time, with certain ingredients and result state and/or observational results. Normally, all such processes bear a timestamp.
- Hypothesis (class)
A specialised insight (see Fig. 3) which represents a general hypothesis, in contrast to a concrete experiment. It points to a graph of non-concrete processes which represent a sequence of generic state changes and observations.
- Recipe (class)
A specialised insight (see Fig. 3) which represents a general recipe, in contrast to a concrete experiment. It points to a graph of non-concrete processes which represent a sequence of generic state changes.
- Sample (class)
A sample (see also Fig. 3), which may be a physical specimen as well as a simulation result artefact. It is represented by the process it points to, and all of its ancestors. This process chain may also contain process after the sample. Those are not part of the sample.
- Concurrent (class)
This is a concurrent process, a subclass of Process. This means that in absence of schemas, you have to tag an entity as a Concurrent by tagging it as both Concurrent and Process. A concurrent process does not represent a well-defined state. Therefore, you should not refer to its URI from elsewhere for communicating something that others can or should reproduce. It can only be used as a container for further process data for another process that points to it with a “cause” relation. See section “Concurrency” for further information.
- cause (property)
This contains a cause of the subject in the object. Both subject and object are of type “Process”.
The object may also be
rdf:nil
. In this case, it must be the only cause apart from concurrents, and it documents that no causes of this process are known and will ever be known.- state (property)
This connects an insight to a process of its process chain, in particular, the latest process. If the insight is a sample, the latest process represents the current state of the sample.
However, there may be more than one state of an insight. Multiple sates help coping with gaps in the process graph (e.g. because external servers are down).
- operator (property)
The operator of a process. The object may be a string with the person’s name or email address, or the URI of the operator.
- timestamp (property)
The point in time when the process was done. The object of this triplet is a blank node which must have at least the property
time:inXSDDateTimeStamp
. Thetime
namespace stems from https://www.w3.org/TR/owl-time/. Concretely, in Turtle, it may look like this:<http://inm.example.com/5-chamber_depositions/14S-005> a sm:Process, … sm:timestamp [ time:inXSDDateTimeStamp "2014-10-02T14:10:00+00:00"^^xsd:dateTime ] .
External vocabulary used in SciMesh¶
- http://www.w3.org/2000/01/rdf-schema#label (property)
The name of a process or sample instance. It should be human-readable.
Detailed description of the data model¶
The core¶
The basic component of SciMesh is the Process. It describes some action that leads to a new state of a certain specimen. This can be different or not to the state before. For example, if the specimen hasn’t been actually changed, it is probably a non-invasive measurement process with data output. In either case, each process has got a distinct URI. This URI also represents the state that that process generates.
Processes are connected to all of their parameters, data outputs, measurement devices, methods, operators, timestamps etc. The process must be the subject of such triples. [1]
Moreover, processes are chained. Every “Process” points with “cause” relationships to their immediate cause processes, which the current process is an effect of. This reflects the cause–effect relationships of nature.
At some point in the past, no further causes are known. There are some, of course, because only the creation of the universe has no cause. (Probably.) But the are no known causes, and if it is sure that they will never be known, one can make this explicit in SciMesh by adding a “cause” relationship to rdf:nil.
At the other end of the chain, i.e. in the present, there may be a sample pointing with a “state” relationship to the current state of the sample, which is the latest process that happened to the sample. This way, the current state as well as its complete provenance is documented in detail.
This data model is simple. Too simple, therefore, we introduce an additional concept: Concurrent processes.
Concurrency¶
So far, we’ve had sequential processes. Each process has a timespan which does not overlap with any other process. While this reflects the course of time perfectly, it is flawed for mere practical reasons: Sometimes, we need to document actions which work on a specimen at the same time in different processes. For instance, if some processes take place in the same vacuum, the vacuum should be its own process running parallelly to the things happening in it. Then, vacuum parameters can be connected with that process and need not be replicated.
This is expressed with a “cause” relation to a subclass of Process: “Concurrent”. This is a process the running time of which overlaps with all of its effect processes. In the following, \(P\) denotes a Process proper (i.e., not a Concurrent), \(C\) a Concurrent, and \(X\) a Process or Concurrent. Further, \(T(X)\) is the timespan of \(X\), i.e. the set of points in time at which \(X\) has effect. \((s, p, o)\) is an RDF triple consisting of subject, predicate, object. Then:
If the cause is a Concurrent, SciMesh makes no stronger assertion! You may consider \(C\) a super- or a subprocess of \(X\), and \(C\) may start before, with, or during \(X\). It may even be that only a subset of \(T(C)\) is known, as long as the condition (1) holds.
A Concurrent does not represent a well-defined state. Thus, do not put a reference to a Concurrent anywhere if you want people to reproduce it. They couldn’t. Instead, cite the URI of the closest following (in time) Process proper.
The reason why a Concurrent is not a well-defined state is because it holds only partial aspects of the subsequent Process(es) proper. While its influence on those Processes is well-defined and leads to states, the Concurrent alone does not. For instance, a vacuum is a well-defined ambient condition for many kinds of processing, and may be one ingredient for a concrete, referenceable sample. However, just the vacuum itself does not end in a referenceable result.
Application of concurrency: Subprocesses¶
Fig. 6 contains a process \(C_3\) which runs at least while \(P_1\) and \(P_2\) are running. Moreover, because there is a “cause” relationship between \(P_{1,2}\) and \(C_3\), \(C_3\) is also acting on the sample.
Note that the “cause” relationship from \(C_3\) to \(P_0\) only makes sense if \(C_3\) is significantly (in the sense of this research) influenced by the sample. For example, if \(C_3\) contains sensor data of sample properties, the relation is necessary. Otherwise, leave it out. You even must not add this relation if \(C_3\) contains multiple samples (see compound process below) because then you would intertwine the history of all the samples.
Also note that the Concurrent has two processes pointing to it, which a visualising agent may interpret as a relationship between a super-process (\(C_3\)) and subprocesses (\(P_{1,2}\)).
Application of concurrency: Compound processes with multiple samples¶
SciMesh’s explicit states for each sample make an extra step necessary for compound processes. With “compound process” we mean a process that has been applied to more than one sample. Since every sample has its very own provenance, this means that the compound process needed to be duplicated multiple times. This would be wasteful and should be avoided. Instead, place a so-called sample-specific process in the process graph which only contains information that is really special to the sample, e. g. its position in the apparatus. Then, give all these sample-specific processes a common Concurrent in a “cause” relation, with all the details of the compound process.
Properties of the sample-specific process like the above mentioned “position in the apparatus” must be in this sample-specific process even if in a particular run only one sample was processed. It might be tempting to merge the sample-specific process into the concurrent and don’t create a concurrent in the first place. However, this is only possible of (1) only one sample was processed and (2) none of the fields of the sample-specific process are needed. Otherwise, queries on the graph would have to be different in both cases.
Fig. 7 gives an example. The “only position” processes contain only the position of the sample (e. g. in the fridge). At the same time, the compound process \(C_0\) contains fridge brand, temperature, whether the fan was active, etc. Cooling duration may be attached to the compound process or the sample processes \(P_{2,4}\), whatever makes more sense.
Warning
Do not make the compound process the effect of older sample processes as it is done by the “cause (optional)” relation in Fig. 6! This would have to be done with all samples in the fridge, and this would mix together the provenance of all samples. It is highly unlikely that this is what you want. Besides, it is wrong conceptually: While indeed all sample are together in the fridge and will have influence on each other in an extremely slight way, they do not affect each other from the point of view of the respective research. And SciMesh maps the latter.
Application of concurrency: Sample splits¶
Fig. 8 shows how to represent a sample split. The concurrent \(C_0\) must be the effect of the last process of the parent sample. Then, there are detail processes \(P_{1,2}\), each of which represents the state of each child after the split, respectively. As with other applications of concurrents, the concurrent contains general properties, and the sample processes individual properties of the split. See the figure caption for examples.
Unknown chronological order¶
If it is unknown which process comes first in a groups of processes, they are organised in parallel in the graph. Note that this documents that information is missing. In particular, if the processes are invasive, i. e. they may change the sample significantly enough to alter results of subsequent processes, this means that the trustworthiness of the following processes is reduced.
Prototype: JuliaBase¶
JuliaBase is a Python/Django framework for creating ELNs, or same databases, with a high degree of customisability. Because it realises a process/sample-based workflow in a highly-structured manner, it is a good candidate for prototyping SciMesh.
Fig. 10 shows a simple sample data sheet. In chronological order, you can see what has been done with the sample. In this case, only one thing: The “5-chamber deposition” is the only experiment here. It consists of three layers of silicon which have been deposited on the substrate, each with its setup configuration (temperature, gas flow rates).
Although this sample data sheet is so simple, its RDF representation is rather
complex, see Fig. 11. This RDF is in turtle format, which is human
readable (well, with some experience). After the namespace prefixes (the lines
that start with @prefix
), which serve merely to abbreviate common prefixes
with very short names, you can see the sample with its properties, living in
the jb-s
namespace. The cause
property points to the last process made
with this sample.
This process is the only one at the same time (sm:cause: ()
). It is the
deposition process. It ends with the empty line. By the way, the
instance-specific entities (in other words, the things that are special for the
respective institute like the experimental methods) live in the namespace
ns1
.
What follows is the first layer with its data. It links to its deposition with
the jb:isSubprocess
property. The data of the second layer and the
complete third layer are elided here for clarity.
Give it a try!¶
The prototypical implementation is kept in sync with this document. It is made against the JuliaBase software in its graphs branch. There is also a short howto for getting RDF data out of a JuliaBase test instance.
Note about content addressing¶
“Content addressing” means naming a content after its … well … content. By using the checksum of the content as the name, name and content become one. This is the most persistent identifier for data you can get. No ambiguities, no manipulations.
Such identifiers can be used to cite an element of a SciMesh graph. For example, one can cite a SciMesh insight in a paper. Or, one can use such identifiers in a blockchain to document absolutely reliably that a certain scientist made a certain discovery at a certain point in time.
The previously mentioned Git, for example, uses hash values for naming each commit. Those hashes are not random: They are the checksum of the current source tree content, and all changes that have led to this content. So, the complete history of a project is encoded into the commit’s name. This way, one cannot tamper with a Git repository without changing all names, too. Conversely, if I refer to a certain Git commit by its name, I can reliably check that I got unmanipulated content. Content addressing at its best.
Similar methods are used in IPFS or various (all?) blockchains.
However, in an RDF triplet store, you can change anything as you like at any time (given you have the necessary permissions).
In order to get content addressing in SciMesh, one may capture a snapshot of a SciMesh graph in IPLD. IPLD would return a name for it, which will be persistently linked to that graph with exactly that content. However, this method has never been tried so far, so for the time being, content addressing and immutability in SciMesh should be considered future tech.