GraphConnect – a Neo4j conference


We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

GSoCers Assemble! Announcing the InterMine GSOC 2017 students

Google Summer of Code is officially open as of 16:00 UTC today! This year InterMine will have five students coding over the summer, with five projects:


  • InterMineR will be getting better docs and hopefully submitted to R repos. Konstantinos Kyritsis will be working on this with the help of InterMine mentors Julie and Rachel.
  • Our Android App will get a younger sibling in the form of an iOS app, thanks to Nadia Yudina. I’ll be the primary mentor for this project.
  • We’ll finally have a proper registry of all the great InterMines out there, brought to you by Leonardo Kuffo with Daniela mentoring the project.
  • Samyadeep Basu will be looking at an ‘InterMine Similarity project’ – given a Gene (or other entity) from InterMine – are there any other interesting entities related to it in some way? Josh is the lead mentor on this project.
  • Yash Sharma will be working on creating Neo4j-InterMine API endpoints under Sam Hokin‘s mentorship.

We wish we could have accepted more of you. In total we had more than 40 students interested in GSoC 2017 with InterMine, resulting in around 30 finalised applications. Many of the applications were brilliant – far more than we could possibly have accepted. Deciding who to accept was really tough, and even if you didn’t get a place in GSoC with us you’re still entirely welcome to contribute to any of our projects if you had any ideas.

Suggestions for accepted students

Congratulations on being accepted. We’re really glad to have you on board. Please have a quick read through our GSoC guidelines to get started.

During the community bonding period, here are a few ideas for getting involved.

  • Find out more details that might pertain to your project (obviously) – investigate the API or work on bugs
  • Project management – in your project’s GitHub repo create milestones, tickets, project boards as appropriate.
  • Write an intro blog post about yourself & your planned work (to be posted here and/or a personal blog we could link to).
  • Come hang in the chat (below).

Non-GSoC InterMine community: you can play too!

We’ve created a couple of chat rooms at We’ll be encouraging our GSoC students to hang out in the #general channel, and you’re welcome to, as well. The students are from all around the world – come make them feel at home!

Exploring Blazegraph

While we’ve been testing Neo4j with all FlyMine data and with PhytoMine to verify how well it performs and scales with big databases, we started exploring another open source implementation for graph databases: Blazegraph.

Blazegraph overview

Blazegraph is a open source high-performance graph database supporting the RDF data model.

RDF is a model to describe and store data: in this model, you express facts, also known as “statements”, composed by three parts knowns as triples. Each triple is composed of a subject (the resource), the predicate (the property name of the resource) and the object (the property value). For this reasons, Blazegraph is also called a “triples store”.

Subject Predicate Object
http: // :hasSymbol “zen”

Blazegraph supports SPARQL (pronounced “sparkle”), a rich and expressive query language for RDF, which is extremely standardized. Using query operations like union, sort, filter and aggregation, the user can query the data in a very flexible way. With federated queries, the user can aggregate information executing queries distributed over different SPARQL endpoints and consequently discover more data across the web.

Blazegraph provides a SPARQL endpoint where the user can remotely explore, access, and download the data stored using SPARQL language; Blazegraph workbench provides a graphical interface for the REST APIs.

Blazegraph and Neo4j: different graph modelling

In Neo4j, a node in the graph corresponds to an entity in a domain. A node, but also the relationships between the nodes, can contain properties describing the object that it represents.

By contrast, in Blazegraph, the nodes don’t contain properties but primitive data like string, integer, date.

In Neo4j we’ve represented the gene entity and its relation with the organism in this way:



In Blazegraph the same concept will be represented as:


with the following statements:

triplesOnly one statement represents the relation between the gene and the organism (that one containing the predicate hasOrganism), the others describe the properties of the two entities.

The resources represented in RDF are identified by unique HTTP URIs (in the example http: //

Exporting FlyMine data: Intermine-RDFizer

We have exported all FlyMine data using Intermine-RDFizer.

The Intermine-RDFizer can query any InterMine endpoint via InterMine API, download the tables in tsv files and transform them into RDF nquads based on the XML object model file.


The InterMine-RDFizer script converts every row in a table into a RDF resource. The resource type is based on the class name (e.g. Gene, Organism) and the resource URI is built using the column “id”. The script converts the columns in resource properties and builds a RDF literal typed with the column’s name.

blazegrah-triplesFor FlyMine, we have created roughly 365 million triples and imported them into Blazegraph using the REST APIs provided.


We’ve started testing Blazegraph performance using all FlyMine data imported via InterMine-RDFizer and comparing the results with Neo4j.

As usual, we will keep you updated!


Exploring graph databases for biological data models


In order to keep InterMine updated to the latest technologies and integrated with the best solutions offered by the open source community, we always keep an eye on the emerging products and explore new tools/platforms. These days, our attention couldn’t not be caught by NoSQL databases.

What is NoSQL?

As the word says, NoSQL databases, refer, at least originally, to “non SQL” or “non relational” databases where the data are organised into one or more tables, however, most recently, the term NoSQL stands also for “not only SQL” because some tools have started introducing SQL-like query languages.

In NoSQL databases, there are many approaches to managing data using different structures:

key-value databases, the simplest NoSQL databases, where every single item is stored as an attribute name (or “key”), together with its value;

wide-column databases using tables, rows and columns, where the columns name and format can change from row to row within the same table;

document databases pairing each key with a complex data structure known as a document;

graph databases where the data are modeled into graphs, composed by nodes and edges (or “relations”).

As usual, there is no silver bullet and the best approach depends on the specific data model. So if we needed to implement a content management system or blogging platform, we would avoid using key-value databases, which are more suitable to store simple data (e.g. session information) and we’d be more inclined toward document databases.

In our specific case, because we have to handle complex biological data and relations, graph databases seem to be the most suitable candidate, worth considering as a possible alternative to the current relational database.

Experiment: InterMine + Neo4j

There are several open source implementations for graph databases; we have decided to start evaluating Neo4j, the most popular: very well established, good documentation, a big and active community supporting it, simple to use, regular meetups and events organized around the world.

The Neo4j Browser is a great tool to query data (using the simple Cypher language) and visualise them in different formats: graph, table, and text. In particular, the graph view is really neat and intuitive, in just few clicks you have a lot of information: clicking on any node or relationship you see the properties of that element and starting from a node you can expand all the relations associated to it. It is possible rearrange the graph, dragging or deleting nodes from the view, or to customize settings for colours, sizes and title nodes. Amazing!

Any time you run the Cypher queries in the editor at the top, the result is displayed in a new frame below; type another query, get another frame. Love it! And also the “history” command is so useful and persists across browser restarts. A really delightful and intuitive user interface.

But let us explain, in more detail, how the data are organized.

The Neo4j graphs are composed of nodes and relationships: the nodes, in general, represent the entities and they are connected by the relationships. Both of them can contain properties.

For example, the “zen” gene, represented as a row in the “gene” table in the current relational model, will be re-modeled as a node in the new graph model, and it’ll contain properties such as symbol, primaryidentifier, and secondaryidentifier. The same applies to the organism which the gene belongs to, it’s also now a node (in Postgres, organism is a separate table). The relationship PART_OF connects the gene node with its organism. Postgres requires a JOIN to query these two tables.


Relationships can also have properties: the fact that a gene is located in a specific position within the chromosome could be represented by the relationship LOCATED_ON with properties: start, end and strand.


Each node can have a label, so the node containing the gene will have label “Gene” and the node with the organism, the label “Organism”. Nice!

A node can have more than one label; so the node with genes will have labels: BioEntity, SequenceFeature, Gene. No more duplication of the same gene along the tables BioEntity, SequenceFeature, Gene, as we have in the current model, but just one node with several labels. This will save some database space, certainly.

Modelling the data

We have imported a part of FlyMine data into a new Neo4j database, using the Neo4j-shell tool and implementing new Cypher scripts.

Importing FlyMine data has been not only a necessary step before starting benchmarking, but also very useful to recognize the importance of re-thinking our data model.

  • Some associative tables have disappeared, replaced by relationships (e.g. the table genegoannotation has been replaced by the ANNOTATED_WITH relationship between the node Gene and the node GoAnnotation)
  • Some tables have been replaced by multiple relationships (e.g. the table homologue has been substituted by the relations IS_ORTHOLOGOUS, IS_PARALOGOUS, and IS_LEAST_DIVERGED_ORTHOLOGOUS depending on the type) while the table’s columns have become a relationship’s properties (e.g. LOCATED_ON in the picture above)
  • The view overlappingfeaturessequencefeature has been replaced by the OVERLAPS relationship between two genes.


These are just examples and maybe not the best approach to modelling our data, but they have helped us to imagine how our model could be represented in the Neo4j graph world and…we liked it!


Our first impressions of Neo4j have been very positive! We are very excited.

We are currently benchmarking the query execution times against PostgreSQL. We still have a lot of tuning and configuration settings to try out in order to obtain the best from Neo4j, which will be a challenge, but it is certainly worth the effort!

We will keep you updated.