MINTE: Semantically Integrating RDF Graphs

Abstract

The nature of the RDF data model allows for numerous descriptions of the same entity. For example, different RDF vocabularies may be utilized to describe pharmacogenomic data, and the same drug or gene is represented by different RDF graphs in DBpedia or Drugbank. To provide a unified representation of the same real-world entity, RDF graphs need to be \textit{semantically} integrated. Semantic integration requires the management of knowledge encoded in RDF vocabularies to determine the relatedness of different RDF representations of the same entity, e.g., axiomatic definition of vocabulary properties or resource equivalences. We devise MINTE, an integration technique that relies on both: knowledge stated in RDF vocabularies and semantic similarity measures to merge \textit{semantically equivalent} RDF graphs, i.e., graphs corresponding to the same real-world entity. MINTE follows a two-fold approach to solve the problem of integrating RDF graphs. In the first step, MINTE implements a 1-1 weighted perfect matching algorithm to identify semantically equivalent RDF entities in different graphs. Then, MINTE relies on different fusion policies to merge triples from these semantically equivalent RDF entities. We empirically evaluate the performance of MINTE on data from DBpedia, Wikidata, and Drugbank. The experimental results suggest that MINTE is able to accurately integrate semantically equivalent RDF graphs.

Introduction

The original vision of the Semantic Web put strong emphasis on the distributed, federated nature of this next evolution of the Web. While there have been some efforts to go in that direction, such as federated SPARQL queries, semantic search or (meta-)data registries, we still deem that there is an imbalance and a large part of semantic technologies are mimicking traditional data management techniques. Knowledge about entities is spread across different datasets on the Internet or intranets of organizations. Information about chemical components and drugs is published by different data providers, e.g., DrugBank or Kegg.