Queryathlon: racing Neo4j against PostgreSQL

eb1911_greek_art_-_foot-race_-_panathenaic_vaseAs discussed in a previous post Exploring graph databases for biological data models, we’ve started evaluating Neo4j as a possible alternative to the current relational database for the InterMine system.

In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:

  • The Neo4j Browser UI,  which is very neat and clear;
  • The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;
  • The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.

This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.

Use cases

We identified the following queries to be part of our benchmark:

  • Simple basic queries: return all genes, return genes given an organism;
  • Typical queries: return genes associated with a specific GO term, return GO terms applied to orthologues of a specific gene;
  • Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.

We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.

For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.

We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.

Method

Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.

All the queries have been executed 5 times after warming up the Neo4j cache. The values are average values over the 5 executions.

We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfertime_pretransfer.

For PostgreSQL, we’ve used psql and turned on the timing.

In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.

In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.

Scripts and server REST extensions wrote for benchmarking are in github.

Results

All genes

Show all genes.

psql (SQL) Neo4j endpoint (Cypher) Notes
1200 ms 5 ms Return all properties
1400 ms 1400 ms Return all properties order by primary identifier
360 ms 12 ms Return primary identifier and symbol
85 ms 5 ms Return genes count

Genes given an organism

Show all genes given a specific organism: Drosophila melanogaster.

gene-organism
Representative example of the gene query – the real one has thousands of results!
psql (SQL) Neo4j endpoint (Cypher) Notes
80 ms 4 ms Return all properties
110 ms 84 ms Return all properties order by primary identifier
20 ms 10 ms Return primary identifier and symbol

GOterm -> Gene

Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.

gene-goterm

psql (SQL) Neo4j endpoint (Cypher) InterMine Web services Notes
15 ms 16 ms 37 ms protein binding
28 ms 15 ms 38 ms cellular_component
4.7 ms 6 ms 29 ms nucleoplasm

Gene -> Orthologue + Go term

Show GO terms applied to orthologues of a specific gene.

orthologue-gotermWe can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.

psql (SQL) Neo4j endpoint (Cypher) Notes
2 ms 3 ms No JOIN with organism
3 ms 4 ms JOIN with organism

To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine Web services
11.3 ms 12 ms 35 ms

As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.

Gene -> Overlapping Genes

For a particular gene, search for overlapping genes.

overlapping

 

Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.

The table below shows the execution time using the constraint lookup=CG11566.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine WS
3.5 ms 3.5 ms 30 ms

Conclusions

Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.

 

BioJS Workshop Dec 2015

After the excitement around BiVi, I’d be remiss if I didn’t discuss all the work put into both a BioJS presentation at BiVi, and BioJS Workshop in the afternoon after BiVi.

We’re already avid BioJS fans at InterMine, because BioJS provides easy plug-in visualisations (for example, cytoscape). I’d expected that a Venn diagram depicting the BioJS crowd would intersect almost perfectly with the BiVi crowd, so I was surprised to find that they were actually completely separate groups.

The difference was explained to me by Manny Corpas as follows: While BioJS,  given the (mostly) browser nature of Javascript, is indeed about visualisations – not all of it is dedicated to visual things. BioJS modules can be related to data parsing, for example.

On the other side of things, BiVi is about visualisation – no matter the language. Indeed, quite a few of the demos we saw at BiVi were desktop or server based, and unrelated to Javascript at all.

The workshop covered the basics of Javascript development, and shown how to include/interact with BioJS components on a webpage – but the most interesting sessions for me (as someone who makes a living out of writing Javascript, among other things) was definitely the session at the end where we were talked through creating our very own BioJS component.

Dennis Schwartz bravely live-coded a pie chart using d3.js on a projector – not an easy task! We started by setting up the scaffolding of the project using the BioJS Slush generator. This created examples, set up a build process, and ensured the BioJS pre-requisites were present, like licence and tags (which allows the biojs registry sniffer to pick up biojs packages from the npm registry). Despite only having an hour or so to get it all done, by the end we had each coded a functioning basic component.

The workshop finished off nicely with group pizza to feed hungry biojshackers. Unfortunately I was unable to attend the hackathon the next day, but if its quality was anything like the workshop, I’m sure it must have been a fabulous success.