While we’ve been testing Neo4j with all FlyMine data and with PhytoMine to verify how well it performs and scales with big databases, we started exploring another open source implementation for graph databases: Blazegraph.
Blazegraph is a open source high-performance graph database supporting the RDF data model.
RDF is a model to describe and store data: in this model, you express facts, also known as “statements”, composed by three parts knowns as triples. Each triple is composed of a subject (the resource), the predicate (the property name of the resource) and the object (the property value). For this reasons, Blazegraph is also called a “triples store”.
Blazegraph supports SPARQL (pronounced “sparkle”), a rich and expressive query language for RDF, which is extremely standardized. Using query operations like union, sort, filter and aggregation, the user can query the data in a very flexible way. With federated queries, the user can aggregate information executing queries distributed over different SPARQL endpoints and consequently discover more data across the web.
Blazegraph provides a SPARQL endpoint where the user can remotely explore, access, and download the data stored using SPARQL language; Blazegraph workbench provides a graphical interface for the REST APIs.
Blazegraph and Neo4j: different graph modelling
In Neo4j, a node in the graph corresponds to an entity in a domain. A node, but also the relationships between the nodes, can contain properties describing the object that it represents.
By contrast, in Blazegraph, the nodes don’t contain properties but primitive data like string, integer, date.
In Neo4j we’ve represented the gene entity and its relation with the organism in this way:
In Blazegraph the same concept will be represented as:
with the following statements:
Only one statement represents the relation between the gene and the organism (that one containing the predicate hasOrganism), the others describe the properties of the two entities.
The resources represented in RDF are identified by unique HTTP URIs (in the example http: //flymine.intermine.org/flymine/1007664).
Exporting FlyMine data: Intermine-RDFizer
We have exported all FlyMine data using Intermine-RDFizer.
The Intermine-RDFizer can query any InterMine endpoint via InterMine API, download the tables in tsv files and transform them into RDF nquads based on the XML object model file.
The InterMine-RDFizer script converts every row in a table into a RDF resource. The resource type is based on the class name (e.g. Gene, Organism) and the resource URI is built using the column “id”. The script converts the columns in resource properties and builds a RDF literal typed with the column’s name.
For FlyMine, we have created roughly 365 million triples and imported them into Blazegraph using the REST APIs provided.
We’ve started testing Blazegraph performance using all FlyMine data imported via InterMine-RDFizer and comparing the results with Neo4j.
As usual, we will keep you updated!