Exploring Blazegraph

While we’ve been testing Neo4j with all FlyMine data and with PhytoMine to verify how well it performs and scales with big databases, we started exploring another open source implementation for graph databases: Blazegraph.

Blazegraph overview

Blazegraph is a open source high-performance graph database supporting the RDF data model.

RDF is a model to describe and store data: in this model, you express facts, also known as “statements”, composed by three parts knowns as triples. Each triple is composed of a subject (the resource), the predicate (the property name of the resource) and the object (the property value). For this reasons, Blazegraph is also called a “triples store”.

Subject Predicate Object
http: //flymine.intermine.org/flymine/1007664 :hasSymbol “zen”

Blazegraph supports SPARQL (pronounced “sparkle”), a rich and expressive query language for RDF, which is extremely standardized. Using query operations like union, sort, filter and aggregation, the user can query the data in a very flexible way. With federated queries, the user can aggregate information executing queries distributed over different SPARQL endpoints and consequently discover more data across the web.

Blazegraph provides a SPARQL endpoint where the user can remotely explore, access, and download the data stored using SPARQL language; Blazegraph workbench provides a graphical interface for the REST APIs.

Blazegraph and Neo4j: different graph modelling

In Neo4j, a node in the graph corresponds to an entity in a domain. A node, but also the relationships between the nodes, can contain properties describing the object that it represents.

By contrast, in Blazegraph, the nodes don’t contain properties but primitive data like string, integer, date.

In Neo4j we’ve represented the gene entity and its relation with the organism in this way:

node1

neo4jrelation

In Blazegraph the same concept will be represented as:

blazegraph-post

with the following statements:

triplesOnly one statement represents the relation between the gene and the organism (that one containing the predicate hasOrganism), the others describe the properties of the two entities.

The resources represented in RDF are identified by unique HTTP URIs (in the example http: //flymine.intermine.org/flymine/1007664).

Exporting FlyMine data: Intermine-RDFizer

We have exported all FlyMine data using Intermine-RDFizer.

The Intermine-RDFizer can query any InterMine endpoint via InterMine API, download the tables in tsv files and transform them into RDF nquads based on the XML object model file.

Intermine-RDFizer

The InterMine-RDFizer script converts every row in a table into a RDF resource. The resource type is based on the class name (e.g. Gene, Organism) and the resource URI is built using the column “id”. The script converts the columns in resource properties and builds a RDF literal typed with the column’s name.

blazegrah-triplesFor FlyMine, we have created roughly 365 million triples and imported them into Blazegraph using the REST APIs provided.

Benchmarking

We’ve started testing Blazegraph performance using all FlyMine data imported via InterMine-RDFizer and comparing the results with Neo4j.

As usual, we will keep you updated!

 

Exploring graph databases for biological data models

graph

In order to keep InterMine updated to the latest technologies and integrated with the best solutions offered by the open source community, we always keep an eye on the emerging products and explore new tools/platforms. These days, our attention couldn’t not be caught by NoSQL databases.

What is NoSQL?

As the word says, NoSQL databases, refer, at least originally, to “non SQL” or “non relational” databases where the data are organised into one or more tables, however, most recently, the term NoSQL stands also for “not only SQL” because some tools have started introducing SQL-like query languages.

In NoSQL databases, there are many approaches to managing data using different structures:

key-value databases, the simplest NoSQL databases, where every single item is stored as an attribute name (or “key”), together with its value;

wide-column databases using tables, rows and columns, where the columns name and format can change from row to row within the same table;

document databases pairing each key with a complex data structure known as a document;

graph databases where the data are modeled into graphs, composed by nodes and edges (or “relations”).

As usual, there is no silver bullet and the best approach depends on the specific data model. So if we needed to implement a content management system or blogging platform, we would avoid using key-value databases, which are more suitable to store simple data (e.g. session information) and we’d be more inclined toward document databases.

In our specific case, because we have to handle complex biological data and relations, graph databases seem to be the most suitable candidate, worth considering as a possible alternative to the current relational database.

Experiment: InterMine + Neo4j

There are several open source implementations for graph databases; we have decided to start evaluating Neo4j, the most popular: very well established, good documentation, a big and active community supporting it, simple to use, regular meetups and events organized around the world.

The Neo4j Browser is a great tool to query data (using the simple Cypher language) and visualise them in different formats: graph, table, and text. In particular, the graph view is really neat and intuitive, in just few clicks you have a lot of information: clicking on any node or relationship you see the properties of that element and starting from a node you can expand all the relations associated to it. It is possible rearrange the graph, dragging or deleting nodes from the view, or to customize settings for colours, sizes and title nodes. Amazing!

Any time you run the Cypher queries in the editor at the top, the result is displayed in a new frame below; type another query, get another frame. Love it! And also the “history” command is so useful and persists across browser restarts. A really delightful and intuitive user interface.

But let us explain, in more detail, how the data are organized.

The Neo4j graphs are composed of nodes and relationships: the nodes, in general, represent the entities and they are connected by the relationships. Both of them can contain properties.

For example, the “zen” gene, represented as a row in the “gene” table in the current relational model, will be re-modeled as a node in the new graph model, and it’ll contain properties such as symbol, primaryidentifier, and secondaryidentifier. The same applies to the organism which the gene belongs to, it’s also now a node (in Postgres, organism is a separate table). The relationship PART_OF connects the gene node with its organism. Postgres requires a JOIN to query these two tables.

node1

Relationships can also have properties: the fact that a gene is located in a specific position within the chromosome could be represented by the relationship LOCATED_ON with properties: start, end and strand.

node2.png

Each node can have a label, so the node containing the gene will have label “Gene” and the node with the organism, the label “Organism”. Nice!

A node can have more than one label; so the node with genes will have labels: BioEntity, SequenceFeature, Gene. No more duplication of the same gene along the tables BioEntity, SequenceFeature, Gene, as we have in the current model, but just one node with several labels. This will save some database space, certainly.

Modelling the data

We have imported a part of FlyMine data into a new Neo4j database, using the Neo4j-shell tool and implementing new Cypher scripts.

Importing FlyMine data has been not only a necessary step before starting benchmarking, but also very useful to recognize the importance of re-thinking our data model.

  • Some associative tables have disappeared, replaced by relationships (e.g. the table genegoannotation has been replaced by the ANNOTATED_WITH relationship between the node Gene and the node GoAnnotation)
  • Some tables have been replaced by multiple relationships (e.g. the table homologue has been substituted by the relations IS_ORTHOLOGOUS, IS_PARALOGOUS, and IS_LEAST_DIVERGED_ORTHOLOGOUS depending on the type) while the table’s columns have become a relationship’s properties (e.g. LOCATED_ON in the picture above)
  • The view overlappingfeaturessequencefeature has been replaced by the OVERLAPS relationship between two genes.

Summary

These are just examples and maybe not the best approach to modelling our data, but they have helped us to imagine how our model could be represented in the Neo4j graph world and…we liked it!

graph2

Our first impressions of Neo4j have been very positive! We are very excited.

We are currently benchmarking the query execution times against PostgreSQL. We still have a lot of tuning and configuration settings to try out in order to obtain the best from Neo4j, which will be a challenge, but it is certainly worth the effort!

We will keep you updated.