As discussed in a previous post Exploring graph databases for biological data models, we’ve started evaluating Neo4j as a possible alternative to the current relational database for the InterMine system.
In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:
- The Neo4j Browser UI, which is very neat and clear;
- The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;
- The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.
This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.
Use cases
We identified the following queries to be part of our benchmark:
- Simple basic queries: return all genes, return genes given an organism;
- Typical queries: return genes associated with a specific GO term, return GO terms applied to orthologues of a specific gene;
- Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.
We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.
For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.
We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.
Method
Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.
All the queries have been executed 5 times after warming up the Neo4j cache. The values are average values over the 5 executions.
We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfer – time_pretransfer.
For PostgreSQL, we’ve used psql and turned on the timing.
In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.
In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.
Scripts and server REST extensions wrote for benchmarking are in github.
Results
All genes
Show all genes.
psql (SQL) | Neo4j endpoint (Cypher) | Notes |
1200 ms | 5 ms | Return all properties |
1400 ms | 1400 ms | Return all properties order by primary identifier |
360 ms | 12 ms | Return primary identifier and symbol |
85 ms | 5 ms | Return genes count |
Genes given an organism
Show all genes given a specific organism: Drosophila melanogaster.

psql (SQL) | Neo4j endpoint (Cypher) | Notes |
80 ms | 4 ms | Return all properties |
110 ms | 84 ms | Return all properties order by primary identifier |
20 ms | 10 ms | Return primary identifier and symbol |
GOterm -> Gene
Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.
psql (SQL) | Neo4j endpoint (Cypher) | InterMine Web services | Notes |
15 ms | 16 ms | 37 ms | protein binding |
28 ms | 15 ms | 38 ms | cellular_component |
4.7 ms | 6 ms | 29 ms | nucleoplasm |
Gene -> Orthologue + Go term
Show GO terms applied to orthologues of a specific gene.
We can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.
psql (SQL) | Neo4j endpoint (Cypher) | Notes |
2 ms | 3 ms | No JOIN with organism |
3 ms | 4 ms | JOIN with organism |
To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.
Neo4j endpoint (Cypher) | Server extension (Java API) | Intermine Web services |
11.3 ms | 12 ms | 35 ms |
As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.
Gene -> Overlapping Genes
For a particular gene, search for overlapping genes.
Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.
The table below shows the execution time using the constraint lookup=CG11566.
Neo4j endpoint (Cypher) | Server extension (Java API) | Intermine WS |
3.5 ms | 3.5 ms | 30 ms |
Conclusions
Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.
You must be logged in to post a comment.