Exploring Blazegraph

While we’ve been testing Neo4j with all FlyMine data and with PhytoMine to verify how well it performs and scales with big databases, we started exploring another open source implementation for graph databases: Blazegraph.

Blazegraph overview

Blazegraph is a open source high-performance graph database supporting the RDF data model.

RDF is a model to describe and store data: in this model, you express facts, also known as “statements”, composed by three parts knowns as triples. Each triple is composed of a subject (the resource), the predicate (the property name of the resource) and the object (the property value). For this reasons, Blazegraph is also called a “triples store”.

Subject Predicate Object
http: //flymine.intermine.org/flymine/1007664 :hasSymbol “zen”

Blazegraph supports SPARQL (pronounced “sparkle”), a rich and expressive query language for RDF, which is extremely standardized. Using query operations like union, sort, filter and aggregation, the user can query the data in a very flexible way. With federated queries, the user can aggregate information executing queries distributed over different SPARQL endpoints and consequently discover more data across the web.

Blazegraph provides a SPARQL endpoint where the user can remotely explore, access, and download the data stored using SPARQL language; Blazegraph workbench provides a graphical interface for the REST APIs.

Blazegraph and Neo4j: different graph modelling

In Neo4j, a node in the graph corresponds to an entity in a domain. A node, but also the relationships between the nodes, can contain properties describing the object that it represents.

By contrast, in Blazegraph, the nodes don’t contain properties but primitive data like string, integer, date.

In Neo4j we’ve represented the gene entity and its relation with the organism in this way:

node1

neo4jrelation

In Blazegraph the same concept will be represented as:

blazegraph-post

with the following statements:

triplesOnly one statement represents the relation between the gene and the organism (that one containing the predicate hasOrganism), the others describe the properties of the two entities.

The resources represented in RDF are identified by unique HTTP URIs (in the example http: //flymine.intermine.org/flymine/1007664).

Exporting FlyMine data: Intermine-RDFizer

We have exported all FlyMine data using Intermine-RDFizer.

The Intermine-RDFizer can query any InterMine endpoint via InterMine API, download the tables in tsv files and transform them into RDF nquads based on the XML object model file.

Intermine-RDFizer

The InterMine-RDFizer script converts every row in a table into a RDF resource. The resource type is based on the class name (e.g. Gene, Organism) and the resource URI is built using the column “id”. The script converts the columns in resource properties and builds a RDF literal typed with the column’s name.

blazegrah-triplesFor FlyMine, we have created roughly 365 million triples and imported them into Blazegraph using the REST APIs provided.

Benchmarking

We’ve started testing Blazegraph performance using all FlyMine data imported via InterMine-RDFizer and comparing the results with Neo4j.

As usual, we will keep you updated!

 

InterMine in Orlando: TAGC16

As many of you may know, TAGC (The Allied Genetics Conference) is happening in July,  in Orlando, Florida, USA, covering multiple model organisms.
We’ll be there, with a booth at stand 403: (Link to interactive floorplan)

Blank Flowchart - GSA (1).png

Come say hi at the booth and maybe grab some swag, like these fabulous stickers.

We’re planning to arrive and set things up late-ish Wednesday the 13th. The booth will always be staffed, but you’ll be able to catch us at the following additional sessions:

Thursday 14th July: 

  • 1:30 – 2:30 PM: Cypress Ballroom – Poster Session

Friday 15th July:

  • 2:50 – 3:30 PM: Cypress Ballroom – Poster Session

Saturday 17th July:

 

 

HumanMine 3.0

HumanMine has been updated to the latest version of NCBI Entrez Gene. All other data sets have also been updated to the newest versions and we have fixed a few bugs. See the data sources page for a full list of data and their versions. All data can be accessed through our comprehensive library of template searches or by building your own queries using the query builder.

 

New Data Source: ClinVar

We added a new data source linking genes with their alleles and associated diseases. Here’s an example query:

http://www.humanmine.org/humanmine/template.do?name=Gene_Alleles_Disease

Human Data Sources Switch

We switched Entrez and Ensembl gene identifiers around. Please see our blog for details. If you have questions or problems, please contact us.

Complex Interaction Viewer

We’ve added a nice viewer for complexes. Source: http://interactionviewer.org

71e3498e-18f5-11e6-8422-5e4486ab4b67

 

We have docs and videos, and for a full list of data sources available in HumanMine see the data sources list.

However, please do not hesitate to contact us should you require any further assistance. For all types of help and feedback email support@humanmine.org

FlyMine 43.0

FlyMine has been updated to the latest version of FlyBase. All other data sets have also been updated to the newest versions and we have fixed a few bugs. See the data sources page for a full list of data and their versions. All data can be accessed through our comprehensive library of template searches or by building your own queries using the query builder.

If you have any questions, please see our docs and videos. Please do not hesitate to contact us should you require any further assistance. For all types of help and feedback email support@flymine.org

Queryathlon: racing Neo4j against PostgreSQL

eb1911_greek_art_-_foot-race_-_panathenaic_vaseAs discussed in a previous post Exploring graph databases for biological data models, we’ve started evaluating Neo4j as a possible alternative to the current relational database for the InterMine system.

In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:

  • The Neo4j Browser UI,  which is very neat and clear;
  • The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;
  • The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.

This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.

Use cases

We identified the following queries to be part of our benchmark:

  • Simple basic queries: return all genes, return genes given an organism;
  • Typical queries: return genes associated with a specific GO term, return GO terms applied to orthologues of a specific gene;
  • Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.

We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.

For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.

We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.

Method

Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.

All the queries have been executed 5 times after warming up the Neo4j cache. The values are average values over the 5 executions.

We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfertime_pretransfer.

For PostgreSQL, we’ve used psql and turned on the timing.

In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.

In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.

Scripts and server REST extensions wrote for benchmarking are in github.

Results

All genes

Show all genes.

psql (SQL) Neo4j endpoint (Cypher) Notes
1200 ms 5 ms Return all properties
1400 ms 1400 ms Return all properties order by primary identifier
360 ms 12 ms Return primary identifier and symbol
85 ms 5 ms Return genes count

Genes given an organism

Show all genes given a specific organism: Drosophila melanogaster.

gene-organism
Representative example of the gene query – the real one has thousands of results!
psql (SQL) Neo4j endpoint (Cypher) Notes
80 ms 4 ms Return all properties
110 ms 84 ms Return all properties order by primary identifier
20 ms 10 ms Return primary identifier and symbol

GOterm -> Gene

Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.

gene-goterm

psql (SQL) Neo4j endpoint (Cypher) InterMine Web services Notes
15 ms 16 ms 37 ms protein binding
28 ms 15 ms 38 ms cellular_component
4.7 ms 6 ms 29 ms nucleoplasm

Gene -> Orthologue + Go term

Show GO terms applied to orthologues of a specific gene.

orthologue-gotermWe can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.

psql (SQL) Neo4j endpoint (Cypher) Notes
2 ms 3 ms No JOIN with organism
3 ms 4 ms JOIN with organism

To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine Web services
11.3 ms 12 ms 35 ms

As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.

Gene -> Overlapping Genes

For a particular gene, search for overlapping genes.

overlapping

 

Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.

The table below shows the execution time using the constraint lookup=CG11566.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine WS
3.5 ms 3.5 ms 30 ms

Conclusions

Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.

 

Exploring graph databases for biological data models

graph

In order to keep InterMine updated to the latest technologies and integrated with the best solutions offered by the open source community, we always keep an eye on the emerging products and explore new tools/platforms. These days, our attention couldn’t not be caught by NoSQL databases.

What is NoSQL?

As the word says, NoSQL databases, refer, at least originally, to “non SQL” or “non relational” databases where the data are organised into one or more tables, however, most recently, the term NoSQL stands also for “not only SQL” because some tools have started introducing SQL-like query languages.

In NoSQL databases, there are many approaches to managing data using different structures:

key-value databases, the simplest NoSQL databases, where every single item is stored as an attribute name (or “key”), together with its value;

wide-column databases using tables, rows and columns, where the columns name and format can change from row to row within the same table;

document databases pairing each key with a complex data structure known as a document;

graph databases where the data are modeled into graphs, composed by nodes and edges (or “relations”).

As usual, there is no silver bullet and the best approach depends on the specific data model. So if we needed to implement a content management system or blogging platform, we would avoid using key-value databases, which are more suitable to store simple data (e.g. session information) and we’d be more inclined toward document databases.

In our specific case, because we have to handle complex biological data and relations, graph databases seem to be the most suitable candidate, worth considering as a possible alternative to the current relational database.

Experiment: InterMine + Neo4j

There are several open source implementations for graph databases; we have decided to start evaluating Neo4j, the most popular: very well established, good documentation, a big and active community supporting it, simple to use, regular meetups and events organized around the world.

The Neo4j Browser is a great tool to query data (using the simple Cypher language) and visualise them in different formats: graph, table, and text. In particular, the graph view is really neat and intuitive, in just few clicks you have a lot of information: clicking on any node or relationship you see the properties of that element and starting from a node you can expand all the relations associated to it. It is possible rearrange the graph, dragging or deleting nodes from the view, or to customize settings for colours, sizes and title nodes. Amazing!

Any time you run the Cypher queries in the editor at the top, the result is displayed in a new frame below; type another query, get another frame. Love it! And also the “history” command is so useful and persists across browser restarts. A really delightful and intuitive user interface.

But let us explain, in more detail, how the data are organized.

The Neo4j graphs are composed of nodes and relationships: the nodes, in general, represent the entities and they are connected by the relationships. Both of them can contain properties.

For example, the “zen” gene, represented as a row in the “gene” table in the current relational model, will be re-modeled as a node in the new graph model, and it’ll contain properties such as symbol, primaryidentifier, and secondaryidentifier. The same applies to the organism which the gene belongs to, it’s also now a node (in Postgres, organism is a separate table). The relationship PART_OF connects the gene node with its organism. Postgres requires a JOIN to query these two tables.

node1

Relationships can also have properties: the fact that a gene is located in a specific position within the chromosome could be represented by the relationship LOCATED_ON with properties: start, end and strand.

node2.png

Each node can have a label, so the node containing the gene will have label “Gene” and the node with the organism, the label “Organism”. Nice!

A node can have more than one label; so the node with genes will have labels: BioEntity, SequenceFeature, Gene. No more duplication of the same gene along the tables BioEntity, SequenceFeature, Gene, as we have in the current model, but just one node with several labels. This will save some database space, certainly.

Modelling the data

We have imported a part of FlyMine data into a new Neo4j database, using the Neo4j-shell tool and implementing new Cypher scripts.

Importing FlyMine data has been not only a necessary step before starting benchmarking, but also very useful to recognize the importance of re-thinking our data model.

  • Some associative tables have disappeared, replaced by relationships (e.g. the table genegoannotation has been replaced by the ANNOTATED_WITH relationship between the node Gene and the node GoAnnotation)
  • Some tables have been replaced by multiple relationships (e.g. the table homologue has been substituted by the relations IS_ORTHOLOGOUS, IS_PARALOGOUS, and IS_LEAST_DIVERGED_ORTHOLOGOUS depending on the type) while the table’s columns have become a relationship’s properties (e.g. LOCATED_ON in the picture above)
  • The view overlappingfeaturessequencefeature has been replaced by the OVERLAPS relationship between two genes.

Summary

These are just examples and maybe not the best approach to modelling our data, but they have helped us to imagine how our model could be represented in the Neo4j graph world and…we liked it!

graph2

Our first impressions of Neo4j have been very positive! We are very excited.

We are currently benchmarking the query execution times against PostgreSQL. We still have a lot of tuning and configuration settings to try out in order to obtain the best from Neo4j, which will be a challenge, but it is certainly worth the effort!

We will keep you updated.

Planned outage, May 9th 2016

May 9th 2016, the chillers that help to cool our server room will be undergoing planned maintenance.

 

The following services will be down all day:

  • InterMine email, including personal email addresses and the mailing list (incidentally, we’ve announced this on other channels, but here’s a reminder: the address has changed to dev@lists.intermine.org).
  • BlueGenes

The following services may be down, particularly towards the end of the day:

We’ll try to leave the mines up as long as possible, with our sysadmins monitoring the temperature of the machine they’re on. If it gets too hot, they’ll get switched off too. Webservices and the webapp will be affected.

Things we hope won’t go down:

  • The CDN

As we mentioned earlier, we’re working on moving our CDN to be hosted on Amazon. We’ll keep this post updated, but at the moment we’re planning to migrate it before the downtime, so with any luck this’ll be up and happy. (Update, 28 April 2016: We’ve successfully set up a fallback CDN on Amazon! By default, CDN requests will still go to the CDN hosted in our local machine room, but if it’s ever down, Amazon will act as a failover server. Yay!!)

Keep an eye on our twitter feed for live updates as they happen.