SciMesh at a glance
===================


Introduction
------------

SciMesh represents scientific insights as a knowledge graph.  In its current
realisation, it focuses on sample-based workflows, in which samples (physical
ones or data artefacts) undergo a sequence of processing and measurement steps.
However, it is not limited to that.  Most generally, it represents scientific
insight by declaring a relationship between cause and effect.  In other words,
*if* certain prerequisites are set, *then* certain observations will be made.

.. figure:: graph.*
   :width: 70%
   :name: graph

   Example graph of processes.

:numref:`graph` shows a graph illustrating this line of thought.  Starting from
an initial state called “nil”, processes change this state.  There is a
monotonic increase in time from left to right.  Consequently, this time axis
also is inherent in a chain of processes.  A process may be “take glass
substrate out of rack”, “heat sample”, “mix two substances together”, or “wait
for solar eclipse”.  It is really that general.  Every process has got one of
more causes.  If it is a single one, it may be the initial state.  The initial
state is totally void.  It bears no information at all, which is why the chain
of processes must define the states in-between as completely as necessary to be
useful for scientific conclusions.  Conversely, one process may be the cause of
multiple others.  This way, chains can be branched off, possibly by different
scientists years after the work on the main trunk.

.. figure:: cause_effect.*
   :width: 70%
   :name: cause_effect

   Example graph of processes, showing cause and effect.

:numref:`cause_effect` shows how the essence of scientific work, the relation
of effects to causes, is represented in the graph: The processes up to a
certain point are the cause, and the observations at that point are the effect.
Therefore, it is valid to call that point an “insight”, which is a graph node
of its own.  Two things are important here.  First, the process to which the
observational data is attached is a measurement, and quite frequently, a
measurement does not change state.  Still, it is a process proper, with all
bells and whistles.  We think that it is not necessary to distinguish between
measurements and processes which actually change something.  Besides, a
measurement does change state, albeit the change may not be significant.  And
secondly, a certain graph of processes might have many insights pointing to it.
In particular, the processes following a measurement may lead to a second
measurement with new results, meaning a new insight.

.. figure:: insight.*
   :width: 50%
   :name: insight

   Is-a relations of insight-like entity classes.

Speaking of insights, :numref:`insight` shows the relationships of things that
may point to a certain state (a.k.a. process) in a graph.  All of them are
*insights* but if it is about a chain of concrete processes at certain points
in time, this should be labelled as an *experiment*.  If all of this happened
to the same sample, the experiment may be identified with this *sample*.

If the processes are not concrete but generic blueprints of processes, the
experiment is actually a *recipe* for experiments.  Or, by bringing cause and
effect together, it is a scientific *hypothesis*.

To illustrate experiment versus hypothesis, consider the following two
examples, respectively:

1. On 21 October 2019, John threw a ball from the Eiffel Tower, and it moved
   downwards.
2. A body is released in a gravitational field, and it moves according to the
   force field.


Current scope of SciMesh
------------------------

The currently specified RDF data model of SciMesh is narrower than the very
general concept outlined above.  The reason is very simple: Our work is in its
early stages, and we have to focus on specific domains of research in order to
avoid frittering away our resources.  Since we work in `task area “Caden” of
NFDI4Ing`_, the domain of research of choice are sample-based workflows.

We hope to be able to give guidelines for how to extend SciMesh to other
domains, eventually.

.. _`task area “Caden” of NFDI4Ing`: https://nfdi4ing.de/archetypes/caden/


Data model overview
-------------------

The RDF data model is build around two concepts: *Sample* and *process*.  Both
are meant in their broadest senses.  A sample may represent a state of a
physical specimen as well as a data set.  A process may create a new sample
state (i.e., change the sample, for example an etching process), or create new
data (e.g.‌ a measurement on a sample), or both.

.. figure:: model.*
   :width: 90%
   :name: model

   Simplified example topology of a SciMesh knowledge graph.  Colours denote
   namespaces: Blue is the ELN namespace, green is the SciMesh namespace, and red
   are external namespaces (RDF, OWL, OBO etc).

:numref:`model` gives an overview of the anatomy of a knowledge graph in
SciMesh.  It is a very simple graph yet contains most of the basic concepts.
At its heart, there is a sequence of processes (at the bottom) that work on the
sample.  The first process, “substrate”, creates the sample (the sample starts
its life as a bare substrate).  Its RDF properties determine the basic physical
properties of the sample (e.g. material and size).  Then, further layers of
material are deposited on the substrate in the deposition process.  Common
properties of most processes are name, method, timestamp, operator, and
comments.

Note that three different namespace domains are involved:

1. The domain of the ELN instance: the processes, the process types, the sample
   and its intermediate instances.  This is in blue.
2. The domain of the ELN software: the sample type.  This is in green.
3. External domains like BFO/OBO and RDF.  This is in red.

.. note::

   Some may have noticed a striking resemblance to Git’s data model.  Indeed,
   there is a very close mapping possible between both.  The processes are
   commits, and the sample is a branch.  You may split a sample in half,
   creating a new branch.  At the same time, one process may have more than one
   cause.  This way, pouring two chemical substances corresponds to merging in
   Git.


SciMesh vocabulary
==================

Default namespace: ``http://scimesh.org/SciMesh/``

**Process** (class)
  A process in the sense of :numref:`graph` and :numref:`cause_effect`.

**Insight** (class)
  An insight in the sense of :numref:`graph` and :numref:`cause_effect`.

**Experiment** (class)
  A specialised insight (see :numref:`insight`) which represents a concrete
  experiment.  It points to a graph of equally concrete processes,
  i.e. processes that happened at a certain place at point in time, with
  certain ingredients and result state and/or observational results.  Normally,
  all such processes bear a timestamp.

**Hypothesis** (class)
  A specialised insight (see :numref:`insight`) which represents a general
  hypothesis, in contrast to a concrete experiment.  It points to a graph of
  non-concrete processes which represent a sequence of generic state changes
  and observations.

**Recipe** (class)
  A specialised insight (see :numref:`insight`) which represents a general
  recipe, in contrast to a concrete experiment.  It points to a graph of
  non-concrete processes which represent a sequence of generic state changes.

**Sample** (class)
  A sample (see also :numref:`insight`), which may be a physical specimen as
  well as a simulation result artefact.  It is represented by the process it
  points to, and all of its ancestors.  This process chain may also contain
  process *after* the sample.  Those are not part of the sample.

**Concurrent** (class)
  This is a concurrent process, a subclass of Process.  This means that in
  absence of schemas, you have to tag an entity as a Concurrent by tagging it
  as both Concurrent and Process.  A concurrent process does not represent a
  well-defined state.  Therefore, you should not refer to its URI from
  elsewhere for communicating something that others can or should reproduce.
  It can only be used as a container for further process data for another
  process that points to it with a “cause” relation.  See section
  “`Concurrency`_” for further information.

**cause** (property)
  This contains a cause of the subject in the object.  Both subject and object
  are of type “Process”.

  The object may also be ``rdf:nil``.  In this case, it must be the only cause
  apart from concurrents, and it documents that no causes of this process are
  known and will ever be known.

**state** (property)
  This connects an insight to a process of its process chain, in particular,
  the latest process.  If the insight is a sample, the latest process
  represents the current state of the sample.

  However, there may be more than one state of an insight.  Multiple sates help
  coping with gaps in the process graph (e.g. because external servers are
  down).

**operator** (property)
  The operator of a process.  The object may be a string with the person’s name
  or email address, or the URI of the operator.

**timestamp** (property)
  The point in time when the process was done.  The object of this triplet is a
  blank node which must have at least the property ``time:inXSDDateTimeStamp``.
  The ``time`` namespace stems from https://www.w3.org/TR/owl-time/.
  Concretely, in Turtle, it may look like this:

  .. code-block:: turtle

    <http://inm.example.com/5-chamber_depositions/14S-005> a sm:Process,
        …
        sm:timestamp [ time:inXSDDateTimeStamp "2014-10-02T14:10:00+00:00"^^xsd:dateTime ] .


External vocabulary used in SciMesh
-----------------------------------

**http://www.w3.org/2000/01/rdf-schema#label** (property)
  The name of a process or sample instance.  It should be human-readable.


Detailed description of the data model
======================================

The core
--------

The basic component of SciMesh is the *Process*.  It describes some action that
leads to a new state of a certain specimen.  This can be different or not to
the state before.  For example, if the specimen hasn’t been actually changed,
it is probably a non-invasive measurement process with data output.  In either
case, each process has got a distinct URI.  This URI also represents the state
that that process generates.

Processes are connected to all of their parameters, data outputs, measurement
devices, methods, operators, timestamps etc.  The process must be the subject
of such triples. [#]_

Moreover, processes are chained.  Every “Process” points with “cause”
relationships to their immediate cause processes, which the current process is
an effect of.  This reflects the cause–effect relationships of nature.

At some point in the past, no further causes are known.  There are some, of
course, because only the creation of the universe has no cause.  (Probably.)
But the are no *known* causes, and if it is sure that they will never be known,
one can make this explicit in SciMesh by adding a “cause” relationship to
rdf:nil.

At the other end of the chain, i.e. in the present, there may be a sample
pointing with a “state” relationship to the current state of the sample, which
is the latest process that happened to the sample.  This way, the current state
as well as its complete provenance is documented in detail.

.. figure:: simple_chain.*
   :width: 65%
   :name: simple_chain

   Basic process chain with sample.


This data model is simple.  Too simple, therefore, we introduce an additional
concept: Concurrent processes.


.. _Concurrency:

Concurrency
-----------

So far, we’ve had *sequential* processes.  Each process has a timespan which
does not overlap with any other process.  While this reflects the course of
time perfectly, it is flawed for mere practical reasons: Sometimes, we need to
document actions which work on a specimen at the same time in different
processes.  For instance, if some processes take place in the same vacuum, the
vacuum should be its own process running parallelly to the things happening in
it.  Then, vacuum parameters can be connected with that process and need not be
replicated.

This is expressed with a “cause” relation to a subclass of Process:
“Concurrent”.  This is a process the running time of which overlaps with all of
its effect processes.  In the following, :math:`P` denotes a Process proper
(i.e., *not* a Concurrent), :math:`C` a Concurrent, and :math:`X` a Process or
Concurrent.  Further, :math:`T(X)` is the timespan of :math:`X`, i.e. the set
of points in time at which :math:`X` has effect.  :math:`(s, p, o)` is an RDF
triple consisting of subject, predicate, object.  Then:

.. math::
   :label: concurrent_timespan

   (X, \text{cause}, C) &\Rightarrow T(X) \cap T(C) \ne \varnothing, \qquad\text{(concurrent)}\\
   (X, \text{cause}, P) &\Rightarrow T(X) \cap T(P) = \varnothing. \qquad\text{(sequential)}

If the cause is a Concurrent, SciMesh makes no stronger assertion!  You may
consider :math:`C` a super- or a subprocess of :math:`X`, and :math:`C` may
start before, with, or during :math:`X`.  It may even be that only a subset of
:math:`T(C)` is known, as long as the condition :eq:`concurrent_timespan`
holds.

A Concurrent does not represent a well-defined state.  Thus, do not put a
reference to a Concurrent anywhere if you want people to reproduce it.  They
couldn’t.  Instead, cite the URI of the closest following (in time) Process
proper.

The reason why a Concurrent is not a well-defined state is because it holds
only partial aspects of the subsequent Process(es) proper.  While its influence
on those Processes is well-defined and leads to states, the Concurrent alone
does not.  For instance, a vacuum is a well-defined ambient condition for many
kinds of processing, and may be one ingredient for a concrete, referenceable
sample.  However, just the vacuum itself does not end in a referenceable
result.


Application of concurrency: Subprocesses
........................................

.. figure:: concurrent.*
   :width: 60%
   :name: concurrent

   Basic concurrency example.

:numref:`concurrent` contains a process :math:`C_3` which runs at least while
:math:`P_1` and :math:`P_2` are running.  Moreover, because there is a “cause”
relationship between :math:`P_{1,2}` and :math:`C_3`, :math:`C_3` is also
acting on the sample.

Note that the “cause” relationship from :math:`C_3` to :math:`P_0` only makes
sense if :math:`C_3` is significantly (in the sense of this research)
influenced by the sample.  For example, if :math:`C_3` contains sensor data of
sample properties, the relation is necessary.  Otherwise, leave it out.  You
even must not add this relation if :math:`C_3` contains multiple samples (see
compound process below) because then you would intertwine the history of all
the samples.

Also note that the Concurrent has two processes pointing to it, which a
visualising agent may interpret as a relationship between a super-process
(:math:`C_3`) and subprocesses (:math:`P_{1,2}`).


Application of concurrency: Compound processes with multiple samples
....................................................................

SciMesh’s explicit states for each sample make an extra step necessary for
compound processes.  With “compound process” we mean a process that has been
applied to more than one sample.  Since every sample has its very own
provenance, this means that the compound process needed to be duplicated
multiple times.  This would be wasteful and should be avoided.  Instead, place
a so-called sample-specific process in the process graph which only contains
information that is really special to the sample, e. g. its position in the
apparatus.  Then, give all these sample-specific processes a common Concurrent
in a “cause” relation, with all the details of the compound process.

Properties of the sample-specific process like the above mentioned “position in
the apparatus” must be in this sample-specific process even if in a particular
run only one sample was processed.  It might be tempting to merge the
sample-specific process into the concurrent and don’t create a concurrent in
the first place.  However, this is only possible of (1) only one sample was
processed and (2) none of the fields of the sample-specific process are needed.
Otherwise, queries on the graph would have to be different in both cases.

.. figure:: compound_process.*
   :width: 55%
   :name: compound-process

   Representation of a compound process: Each sample has only sample-specific
   process data in its direct graph (:math:`P_{2,4}`, here only the position in
   the apparatus, e. g. a fridge).  These sample-specific processes have a
   common predecessor :math:`C_0` which contains all process details
   (temperature, heating time etc).

:numref:`compound-process` gives an example.  The “only position” processes
contain only the position of the sample (e. g. in the fridge).  At the same
time, the compound process :math:`C_0` contains fridge brand, temperature,
whether the fan was active, etc.  Cooling duration may be attached to the
compound process or the sample processes :math:`P_{2,4}`, whatever makes more
sense.

.. warning::

   Do not make the compound process the effect of older sample processes as it
   is done by the “cause (optional)” relation in :numref:`concurrent`!  This
   would have to be done with all samples in the fridge, and this would mix
   together the provenance of all samples.  It is highly unlikely that this is
   what you want.  Besides, it is wrong conceptually: While indeed all sample
   are together in the fridge and will have influence on each other in an
   extremely slight way, they do not affect each other from the point of view
   of the respective research.  And SciMesh maps the latter.


Application of concurrency: Sample splits
.........................................

.. figure:: split.*
   :width: 55%
   :name: split

   Representation of a sample split: Each sample has only sample-specific
   process data in its after-split processes (:math:`P_{1,2}`, the position
   within the parent sample, the size after the split, etc).  These
   sample-specific processes have a common predecessor :math:`C_0`, which
   contains all general split properties (split method, whether there still is
   remaining material of the parent sample, etc).

:numref:`split` shows how to represent a sample split.  The concurrent
:math:`C_0` must be the effect of the last process of the parent sample.  Then,
there are detail processes :math:`P_{1,2}`, each of which represents the state
of each child after the split, respectively.  As with other applications of
concurrents, the concurrent contains general properties, and the sample
processes individual properties of the split.  See the figure caption for
examples.


Unknown chronological order
---------------------------

If it is unknown which process comes first in a groups of processes, they are
organised in parallel in the graph.  Note that this documents that information
is missing.  In particular, if the processes are invasive, i. e. they may
change the sample significantly enough to alter results of subsequent
processes, this means that the trustworthiness of the following processes is
reduced.

.. figure:: unknown_order.*
   :width: 60%
   :name: unknown-order

   Representation of unknown chronological order: The affected processes are
   placed in parallel in the graph.  All arrows are “cause” relations.


Prototype: JuliaBase
====================

`JuliaBase`_ is a Python/Django framework for creating ELNs, or same databases,
with a high degree of customisability.  Because it realises a
process/sample-based workflow in a highly-structured manner, it is a good
candidate for prototyping SciMesh.

.. _Juliabase: https://juliabase.org


.. figure:: Sample_14S-005.*
   :width: 60%
   :name: data-sheet

   Data sheet of sample “14S-005”, as seen in a JuliaBase instance by the
   browser.

:numref:`data-sheet` shows a simple sample data sheet.  In chronological order,
you can see what has been done with the sample.  In this case, only one thing:
The “5-chamber deposition” is the only experiment here.  It consists of three
layers of silicon which have been deposited on the substrate, each with its
setup configuration (temperature, gas flow rates).

.. figure:: turtle.*
   :width: 100%
   :name: turtle

   Turtle representation of sample “14S-005”.

..
   Convert turtle output to SVG: pygmentize -o toll.svg -l turtle sample.rdf

Although this sample data sheet is so simple, its RDF representation is rather
complex, see :numref:`turtle`.  This RDF is in turtle format, which is human
readable (well, with some experience).  After the namespace prefixes (the lines
that start with ``@prefix``), which serve merely to abbreviate common prefixes
with very short names, you can see the sample with its properties, living in
the ``jb-s`` namespace.  The ``cause`` property points to the last process made
with this sample.

This process is the only one at the same time (``sm:cause: ()``).  It is the
deposition process.  It ends with the empty line.  By the way, the
instance-specific entities (in other words, the things that are special for the
respective institute like the experimental methods) live in the namespace
``ns1``.

What follows is the first layer with its data.  It links to its deposition with
the ``jb:isSubprocess`` property.  The data of the second layer and the
complete third layer are elided here for clarity.


Give it a try!
--------------

The prototypical implementation is kept in sync with this document.  It is made
against the JuliaBase_ software in its `graphs branch`_.  There is also a
`short howto`_ for getting RDF data out of a JuliaBase test instance.

.. _Juliabase: https://juliabase.org
.. _graphs branch: https://github.com/juliabase/juliabase/tree/graphs
.. _short howto: https://github.com/juliabase/juliabase/blob/graphs/Graph%20export.rst


Note about content addressing
=============================

“Content addressing” means naming a content after its … well … content.  By
using the checksum of the content as the name, name and content become one.
This is the most persistent identifier for data you can get.  No ambiguities,
no manipulations.

Such identifiers can be used to cite an element of a SciMesh graph.  For
example, one can cite a SciMesh insight in a paper.  Or, one can use such
identifiers in a blockchain to document absolutely reliably that a certain
scientist made a certain discovery at a certain point in time.

The previously mentioned Git, for example, uses hash values for naming each
commit.  Those hashes are not random: They are the checksum of the current
source tree content, and *all* changes that have led to this content.  So, the
complete history of a project is encoded into the commit’s name.  This way, one
cannot tamper with a Git repository without changing all names, too.
Conversely, if I refer to a certain Git commit by its name, I can reliably
check that I got unmanipulated content.  Content addressing at its best.

Similar methods are used in IPFS or various (all?) blockchains.

However, in an RDF triplet store, you can change anything as you like at any
time (given you have the necessary permissions).

In order to get content addressing in SciMesh, one may capture a snapshot of a
SciMesh graph in IPLD.  IPLD would return a name for it, which will be
persistently linked to that graph with exactly that content.  However, this
method has never been tried so far, so for the time being, content addressing
and immutability in SciMesh should be considered future tech.


.. [#] The underlying reason is that SciMesh-Relations always point to the
       past.  This makes it easier to detect whether additions to a graph
       manipulate causality in retrospect.

..  LocalWords:  SciMesh SciMesh’s p̶o̶t̶t̶y̶
..  Local Variables:
..  eval: (auto-fill-mode)
..  eval: (ispell-change-dictionary "en_GB")
..  eval: (flyspell-mode)
..  eval: (flyspell-buffer)
..  End: