Queryathlon: racing Neo4j against PostgreSQL

eb1911_greek_art_-_foot-race_-_panathenaic_vaseAs discussed in a previous post Exploring graph databases for biological data models, we’ve started evaluating Neo4j as a possible alternative to the current relational database for the InterMine system.

In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:

  • The Neo4j Browser UI,  which is very neat and clear;
  • The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;
  • The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.

This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.

Use cases

We identified the following queries to be part of our benchmark:

  • Simple basic queries: return all genes, return genes given an organism;
  • Typical queries: return genes associated with a specific GO term, return GO terms applied to orthologues of a specific gene;
  • Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.

We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.

For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.

We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.

Method

Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.

All the queries have been executed 5 times after warming up the Neo4j cache. The values are average values over the 5 executions.

We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfertime_pretransfer.

For PostgreSQL, we’ve used psql and turned on the timing.

In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.

In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.

Scripts and server REST extensions wrote for benchmarking are in github.

Results

All genes

Show all genes.

psql (SQL) Neo4j endpoint (Cypher) Notes
1200 ms 5 ms Return all properties
1400 ms 1400 ms Return all properties order by primary identifier
360 ms 12 ms Return primary identifier and symbol
85 ms 5 ms Return genes count

Genes given an organism

Show all genes given a specific organism: Drosophila melanogaster.

gene-organism
Representative example of the gene query – the real one has thousands of results!
psql (SQL) Neo4j endpoint (Cypher) Notes
80 ms 4 ms Return all properties
110 ms 84 ms Return all properties order by primary identifier
20 ms 10 ms Return primary identifier and symbol

GOterm -> Gene

Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.

gene-goterm

psql (SQL) Neo4j endpoint (Cypher) InterMine Web services Notes
15 ms 16 ms 37 ms protein binding
28 ms 15 ms 38 ms cellular_component
4.7 ms 6 ms 29 ms nucleoplasm

Gene -> Orthologue + Go term

Show GO terms applied to orthologues of a specific gene.

orthologue-gotermWe can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.

psql (SQL) Neo4j endpoint (Cypher) Notes
2 ms 3 ms No JOIN with organism
3 ms 4 ms JOIN with organism

To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine Web services
11.3 ms 12 ms 35 ms

As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.

Gene -> Overlapping Genes

For a particular gene, search for overlapping genes.

overlapping

 

Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.

The table below shows the execution time using the constraint lookup=CG11566.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine WS
3.5 ms 3.5 ms 30 ms

Conclusions

Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.

 

Advertisements

2 thoughts on “Queryathlon: racing Neo4j against PostgreSQL”

    1. Hi Samuel,
      thank you for your support and enthusiasm!
      It will be a long journey….so far we have only tested a subset of FlyMine data, the next step will be benchmark the same queries with larger databases.
      Any suggestion/feedback/previous experience with Neo4j is welcome!
      Daniela

      Like

Comments are closed.