Learning to assess Linked Data relationship using Genetic Programming – By Ilaria Tiddi

A very well-known issue for  tasks such as text-mining, named-entity disambiguation, ontology population or query expansion is how to identify and measure how strongly two entities are related. We tackled this problem in our work  “Learning to assess Linked Data relationships using Genetic Programming”, to be presented at the 15th International  Semantic Web Conference (ISWC 2016) in Kobe, Japan.

In the Web of Data, such relationship could be identified as a semantic path (i.e. a chain of entities and properties) between two entities, and graph search techniques could be used to reveal them. Applying such techniques to the Linked Data graph is challenging, because entities and properties of a path might come from a number of different, unknown data sources.

The natural approach to avoid indexing and local pre-processing is to rely on link traversal (Based on the fourth Linked Data principle), which allows to incrementally (and agnostically) explore the graph from entity to entity until paths between them are found. In other words, finding relationships between entities in the Linked Data graph requires a uniformed (or blind) search, which does not need pre-computation or knowledge over the entire graph. However, an uniformed search needs a “cost-function” to drive the search through the most promising paths in the space.

We focused this work on designing a cost-function to drive a Linked Data blind search that identifies the best relationships between two entities. We investigated the hypothesis that some Linked Data structural information, such as  topological and semantic features of the entities and properties traversed, could be used by a cost-function to efficiently find the best paths, i.e. the strongest relationships, between entities.  While one could intuitively think that the shortest paths reveal the strongest connections, this assumption does not necessarily hold within the Linked Data space, where entities of different datasets are connected by multiple paths of similar lengths.

To achieve this, we proposed to use a supervised method based on Genetic Programming [1] with the aim of learning the cost-function to integrate in the blind search. Because the results of a Genetic Programming are cost-fuctions that we could analyse and directly implement in our graph search, this approach was more suited when compared to other supervised learning techniques (e.g.  SVMs, Neural Networks, Linear Regression or learning-to-rank). We started from a randomly generated population of cost-functions created from a set of topological and semantic characteristics of the Linked Data graph (e.g. node indegree and outdegree, variety of namespaces, type of RDF properties of a URI), and used the evolutionary algorithm to reveal which cost-functions best compared with human evaluations. The cost-functions that were learnt were then compared in the experiments to some baselines found in the literature. All the results can be found online.

screencapture-mail-google-mail-u-0-1473710779809

Our work obtained three major results. First, we revealed a measure to detect strong entity relationships, which can be integrated in uninformed searches over Linked Data (therefore avoiding data pre-processing). Second, we demonstrated that a cost-function can be derived empirically, rather than being domain-specific or manually-defined as in the state-of-the-art. Finally, our results showed that a cost-function based on the basic topological features of the nodes of the paths as they are being traversed obtains good results, but that these can be improved through introducing a very small amount of knowledge about the vocabularies used to label the edges connecting the nodes.

[1] Riccardo Poli, William B. Langdon,Nicholas F. McPhee, John R. Koza, “A Field Guide to Genetic Programming” (2008)