On deriving data flows from scientific workflows – By Enrico Daga

Workflow models are designed with the purpose of supporting the reuse of processing components in multiple executions.

This approach to encode process knowledge found particular success in the research community of open science as one way of supporting the recording and reuse of scientific experiments.

One example of this is the myexperiments.org portal.

Workflows are centered on the notion of reusable processing units, that can be linked between each other by the means of input and output ports, in a graph-like structure that starts from the initial main input to the final output.

workflow-1052

“LipidMaps Query” from http://www.myexperiment.org/workflows/1052

 

However, there are scenarios in which what we need to represent is the data and how they are affected by the actions of the workflow.

Data flows were recently introduced as representations that focus on data nodes, connected by arcs aimed to express the semantic relation between these data nodes.

The Datanode ontology (http://purl.org/datanode/ns/) is a hierarchy of the possible relations between data objects, that might occur within a workflow execution, that can be used for example to reason on the propagation of policies (see also https://dsg.kmi.open.ac.uk/data-exploitability-how-to-achieve-it/).

workflow

In this post we focus on how to derive such data flows from the representation of existing workflows.

The approach we will take is to learn these relations from the descriptions of the actions in the traditional workflow model.

Our assumption is that there is a relation between the features of the actions and the possible data-to-data annotations.

However, learning processes are based on the existence of a training set, that we do not have.

How to approach the cold start problem?

Our proposal is a supervised annotation process based on an incremental learning method that relies on Formal Concept Analysis (FCA) and an association rule mining (ARM) technique.

The process is structured as follows. We generate the data flow structure from the workflow, by extracting all the pairs of input/output from each action. The resulting graph would be composed by data nodes linked by anonymous arcs (the I/O port pair) (see the figure).

dataflow-snippet

 The item to be annotated will be each single I/O port pair (arc), that we enrich with a set of features extracted from the description of the related action in the workflow model.

At the beginning, the user won’t have any support on choosing the annotation, that she will select from the entire Datanode hierarchy.

As soon as annotations are produced, a FCA lattice is augmented with the sets of related items associated with features and annotations.

FCA cluster items having the same attributes in so called closed item sets.

From these sets, association rules can be derived in order to produce recommendations.

For this purpose, the lattice is reconstructed on each iteration to generate new association rules to be exploited for future recommendations.

However, generating all association rules on each iteration can be a waste of resources.

Therefore, our approach includes a rule mining technique that retrieves only the rules useful to produce the association rules for a given I/O port pair to annotate, ranked by three measures:

Support s(b → h): the ratio of I/O pairs satisfying b ∪ h to all the I/O pairs in the lattice

Confidence k(b → h): the ratio of I/O pairs satisfying b ∪ h to the I/O pairs satisfying b

Relevance r(b → h): the degree of overlap between the body of the rule b and the set of features of the candidate I/O pair.

The hypothesis we make is that the quality of the recommendations will improve in time!

We designed an experiment to evaluate our approach, implemented in a tool named Dinowolf (Datanode in workflows) – http://github.com/enridaga/dinowolf .

We asked six users to annotate 20 workflows downloaded from www.myexperiments.org, for a total of 260 I/O port pairs.

We observed that the time required to choose an annotation, sometimes more the 5 minutes, decrease substantially after the first 30/40 items annotated.

Indeed the effort of annotating workflows is reduced thanks to the recommendations.

Evolution of the time spent by each user on a given annotation page of the tool before a decision was made.

Evolution of the time spent by each user on a given annotation page of the tool before a decision was made.

The cold start problem is tackled as the users seemed to select annotations from the recommendations as soon as they become available.

Moreover, the rank of the picked recommendations improves in time, as well as the related relevance score, demonstrating how the quality of the recommendations increases in time.

Supporting users on annotating workflows with data-to-data relations with recommendations is problematic because of the lack of an initial training set (cold start problem).  We tackled this issue by means of an incremental learning process that leverages FCA and an information retrieval approach to ARM. 

Future work include the integration of this approach in Data Hub metadata management to support policy propagation, also described in this related post.

Moreover, it would be interesting to study the quality and consistency of annotations between users, including agreement and disagreement in the usage of Datanode.

Finally, the solution is domain independent, and we expect it could be easily applied to other scenarios.

This work was presented at the 20th International Conference on Knowledge Engineering and Knowledge Management – EKAW, Bologna, Italy 19-23 November 2016.

http://link.springer.com/chapter/10.1007/978-3-319-49004-5_9