GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

GSoCers Assemble! Announcing the InterMine GSOC 2017 students

Google Summer of Code is officially open as of 16:00 UTC today! This year InterMine will have five students coding over the summer, with five projects:

gsoc-icon-192

  • InterMineR will be getting better docs and hopefully submitted to R repos. Konstantinos Kyritsis will be working on this with the help of InterMine mentors Julie and Rachel.
  • Our Android App will get a younger sibling in the form of an iOS app, thanks to Nadia Yudina. I’ll be the primary mentor for this project.
  • We’ll finally have a proper registry of all the great InterMines out there, brought to you by Leonardo Kuffo with Daniela mentoring the project.
  • Samyadeep Basu will be looking at an ‘InterMine Similarity project’ – given a Gene (or other entity) from InterMine – are there any other interesting entities related to it in some way? Josh is the lead mentor on this project.
  • Yash Sharma will be working on creating Neo4j-InterMine API endpoints under Sam Hokin‘s mentorship.

We wish we could have accepted more of you. In total we had more than 40 students interested in GSoC 2017 with InterMine, resulting in around 30 finalised applications. Many of the applications were brilliant – far more than we could possibly have accepted. Deciding who to accept was really tough, and even if you didn’t get a place in GSoC with us you’re still entirely welcome to contribute to any of our projects if you had any ideas.

Suggestions for accepted students

Congratulations on being accepted. We’re really glad to have you on board. Please have a quick read through our GSoC guidelines to get started.

During the community bonding period, here are a few ideas for getting involved.

  • Find out more details that might pertain to your project (obviously) – investigate the API or work on bugs
  • Project management – in your project’s GitHub repo create milestones, tickets, project boards as appropriate.
  • Write an intro blog post about yourself & your planned work (to be posted here and/or a personal blog we could link to).
  • Come hang in the chat (below).

Non-GSoC InterMine community: you can play too!

We’ve created a couple of chat rooms at chat.intermine.org. We’ll be encouraging our GSoC students to hang out in the #general channel, and you’re welcome to, as well. The students are from all around the world – come make them feel at home!

California Dreaming: InterMine Dev Conf 2017 Report – Day 1

2017’s developer conference has been and gone; time to pay my dues in a blog post or two.

Day 0: Welcome dinner, 29 March 2017

The Cambridge InterMine arrived at Walnut Creek without a hitch, and after a jetlagged attempt at a night’s sleep we sat down to a mega-grant-writing session in the hotel lobby, fuelled by several pots of coffee and plates of nachos.

By 7PM, people had begun to gather in the lobby to head to the inaugural conference dinner at the delicious Walnut Creek Yacht Club. We had to change the venue quite late on in the game, meaning we decided to wander down the street to collect some of the InterMiners who had ended up at the original venue (sorry!!). By the end of the meal, most of the UK contingent was dead on their feet – 10pm California time worked out to be 6am according to our body clocks, so when Joe offered to give several of us a lift back to the hotel, it was impossible to decline.

20170329_221945

Day 1: Workshop Intro

The day started with intros from our PI, Gos, and our host, David Goodstein. 

Josh and I followed up by introducing BlueGenes, the UI we’ve been working on to replace InterMine’s older JSP-based UI. You can view Josh’s slide deck , try out a live demoor browse / check out the source on GitHub.

Next came one of my favourite parts: short talks from InterMiners.

Short community talks

Doppelgangers – Joel Richardson, MGI

Joel gave a great presentation about Doppelgangers in InterMine – that is, occasionally, depending on your data sets and config, you can end up with duplicate or strange / incomplete InterMine objects in your mine. He follows up with explanations of the root causes and mitigation methods – a great resource for any InterMiner who is working in data source integration! 

Genetic data in Mines – Sam Hokin, NCGR/LegFed

Next up was Sam’s talk about his various beany mines, including CowpeaMine, which has only genetics data, rather than the more typical InterMine genomic data. He’s also implemented several custom data visualisations on gene report pages – check out the slides or mines for more details.

JBrowse and Inter-mine communication – Vivek Krishnakumar, JCVI

Vivek focused on some great cross-InterMine collaborations (slides here), including the technical challenges integrating JBrowse into InterMine, as well as a method to link to other InterMines using synteny rather than InterMine’s typical homology approach.

InterMine at JGI – Joe Carlson, Phytozome, JGI

Joe has the privilege to run the biggest InterMine, covering (currently) 72 data sets on 69 organisms. Compared to most InterMines, this is massive! Unsurprisingly, this scale comes with a few hitches many of the other mines don’t encounter. Joe’s slides give a great overview of the problems you might encounter in a large-scale InterMine and their solutions.

Afternoon sessions

FAIR and the semantic web – Daniela & Justin

After a yummy lunch at a nearby cafe, Justin introduced the concept of FAIR, and discussed InterMine’s plans for a FAIRer future (slides). Discussion topics included:

  • How to make stable URIs (InterMine object IDs are transient and will change between builds)
  • Enhanced embedded metadata in webpages and query results (data provenance, licencing)
  • Better Findablility (the F in FAIR) by registering InterMine resources with external registries
  • RDF generation / SPARQL querying

This was followed up by Daniela’s introduction to RDF and SPARQL, which provided a great basic intro to the two concepts in an easily-understood manner. I really loved these slides, and I reckon they’d be a good introduction for anyone interested in learning more about what RDF and SPARQL are, whether or not you’re interested in InterMine .

Extending the InterMine Core Data Model – Sergio

Sergio ran the final session, “Extending the InterMine Core Data Model“. Shared models allow for easier cross-InterMine queries, as demoed in the GO tool prototype:

This discussion raised several interesting talking points:

  • Should model extensions be created via community RFC?
  • If so, who is involved? Developers, community members, curators, other?
  • Homologue or homolog? Who knew a simple “ue” could cause incompatibility problems? Most InterMine use the “ue” variation, with the exception of PhytoMine. An answer to this problem was presented in the “friendly mine” section of Vivek’s talk earlier in the day.

Another great output was Siddartha Basu’s gist on setting up InterMine – outlining some pain points and noting the good bits.

Most of us met up for dinner afterwards at Kevin’s Noodle House – highly recommended for meat eaters, less so for veggies.

The FAIR journey

backpack

It’s time to celebrate. After some hectic weeks preparing and organizing the InterMine conference we can now take a deep breath and and get ready for our new BBSRC funded project to make data more FAIR.

It will be a short rest, just the time to check that we have everything we need to start this long journey but also to savour the excitement before the departure towards FAIRness, a destination that will enhance InterMine compliance with FAIR principles; InterMine has been at the forefront of getting research data into the hands of scientists for over 10 years and we’re excited to support the formalisation of these principles.

As team, we recognized with no doubts, the need to implement the FAIR data principles, making biological data stored in InterMine instances more Findable, Accessible, Interoperable, and Reusable by both humans and machines,  as well as the huge impact that this achievement might have on the quality of biological data served by InterMine. Implementing some mechanisms that make data stored in InterMine FAIR, we provide a unique opportunity to make ALL data collection, served by nearly 30 public biological InterMine instances worldwide, FAIR.

This is a great chance and we didn’t want to miss it!

Here some important milestones we want to achieve along the journey:

  • Generate globally unique and stable URIs to identify InterMine data objects and register them in community bioinformatics repositories (for instance bio.tools and Identifiers.org) in order to provide more findable and accessible data.
  • Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  • Embed metadata in InterMine web pages and add extra metadata to InterMine’s existing structured query results to make data more findable
  • Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  • Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets
  • Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

It will be an exciting challenge.  Follow us on this blog.

A flurry of deadlines: Grants, GSoC, workshops, and more…

We blogged in February commenting that we had a lot of events over the March / April period. Here’s a re-cap:

  • Attending conferences: Amongst the team we attended Bioschemas, the Elixir all-hands, and the Cambridge Scientific Computation Day.
  • InterMine training: We delivered a training workshop about using InterMine at the EBI, part of their Introduction to Omics data integration week-long course.
    • This went well despite a server-room meltdown which conveniently timed itself for the morning of the same day (the training session was in the afternoon, so we thankfully had time to get the servers back up!).
    • In contrast to previous years, every single hand went up when we asked if the participants wrote code as part of their job. Next time, we will try to allow for a longer session on using InterMine web services, rather than the 15 minute slot we allocated this time!
  • Developer Workshop and Hackathon: 5 days in sunny California, spending time with InterMiners from around the world. Longer blog posts to follow, but in the meantime you can browse the agenda for links to slides from each session, or the storify summary of tweets.
  • Google Summer of Code: We’re participating in Google Summer of Code (GSoC) this year (previously) as a mentoring organisation. We had over 50 interested students and 30 distinct applications, many of which were simply brilliant. The deadline for students applying, naturally, was the day after the hackathon, making finding time to provide student feedback a challenge. Maybe there’s a reason to be grateful for jet-lag induced wakefulness at odd hours!
  • Grants: A tale of two grants… :
    • New application: We had a grant application deadline that was, once again, the day after the hackathon. Uh-oh! Feverish figure fixes, tentative typo tweaks and word-count winnowing was squeezed in at every opportunity.
    • Good news about an old application: Meanwhile, we got the news that we’d been fortunate enough to have our hard work pay off: a grant we’d applied for last year as part of the BBSRC BBR 2016 call was agreed to! Hint: the future of InterMine is looking very FAIR, possibly even SPARQLing. More details in a later post.

Events coming up soon:

Bioschemas

Justin and Gos attended the BioSchemas kick-off meeting at Hinxton this week. As well as giving a short talk about InterMine (the slides are on FigShare), Justin managed to jot down a few thoughts about the event:


The aim of the Bioschemas schema is to come up with simple metadata that can be embedded in webpages (JSON-LD, RDFa, etc.) to make it easier to find data. For instance, suppose you wanted datasets that concerned the effect of a certain drug.  At the moment, if you search in Google for that drug name, you will find relevant datasets, but also possibly datasets that happen to use that drug as part of their protocol and more things besides.

But if you can embed a schema which names the subject of the dataset in a structured way (e.g. puts an ontology term URL in a dataset.subject field) then you can pull up more relevant data.

However, there is a strong concern with keeping such markup as light as possible so that people aren’t put off annotating their datasets.  Hence, a notional rule that there should only be 6 properties per class (e.g. just name, description, url, keywords, variableMeasured, creator.name for dataset).

As such, bioschemas won’t be replacing any of the existing ‘heavyweight’ schemas (DATS, DataCite’s model, OMICSdi model, our own InterMine data model), as it’s not meant to be used as an internal data model.

  • Bioschemas is about getting properties into schema.org (as supported by Google, Bing, Yahoo, etc.).  However, schema.org is a big bag of potential metadata with few constraints (no cardinality, for instance!).  It’s up to Bioschemas to come up with constraints for our purposes, especially on generic metadata such as DataSet and DataCatalog.
  • Bioschemas is also largely about general areas at the moment (datasets, data sources, etc.) though there is some specific work on protein annotations and samples.  But not, for instance, on genes and proteins at this stage (though presumably protein annotations would need some metadata for proteins….)
  • Things are at an early stage – the next year will (hopefully) see some Bioschemas definitions.  There is still debate about exactly what it can be used for beyond search.  For instance, we (InterMine) may be able to use them to improve our data integration process if all sources start embedding common metadata in their download files as well as on webpages.
  • Bioschemas is more than schema work. The initiative covers topics such as identifiers, citation, metrics and tools too (which will be relevant to us in the future).  We can get value from these other areas too – for instance there was discussion of a standard way of providing notification that a data source (uniprot, etc.) had updated, which would be very useful to us in building mines that automatically update.  There was also talk of having metadata that specifies when a data source changes its format – another thing that would be tremendously useful for us.
  • The presentations were interesting and the group friendly.  InterMine seemed to be well received.  The group doesn’t have a many actual data consumers and integrators, so I think that we can make a valuable contribution from that perspective.

Google Summer of Code at InterMine

We’re pleased to announce that we’ll be participating in Google Summer of Code 2017 as a mentor organisation, under the umbrella of the Open Genome Informatics. Here’s the full ideas list for Open Genome Informatics Projects – InterMine projects are numbers 3 to 9.

Information for students:

About us:

InterMine is an open source biological data warehouse, based in the University of Cambridge. There are nearly thirty instances of public InterMines, covering a range of subjects from organisms like mice and rats, mines dedicated to plants such as the soybean, insects like the fruitfly or bees and wasps, and even mines dedicated to mitochondrial DNA and discovering drug targets.

We’re interested in mentoring students from a bioinformatics, computational biology, or computer science background.

You don’t have to be a biologist to work on InterMine related projects – many of the full time developers on the team didn’t come from a biology background – but biological knowledge is an advantage.

We use a range of languages in our projects, but most commonly you’ll see Java, PostgreSQL, Clojure/ClojureScript, and JavaScript. Each instance of InterMine has its own set of web services, and there are client libraries in five different languages, with a sixth in final stages of development.

Browse through our GitHub repos to see more of our projects: https://github.com/intermine

Getting started:

If you’re interested in applying for one of our projects, drop an email to the people named in the project description to introduce yourself, and explain which of the project(s) you’re interested in. There’s already been quite a lot of interest in the Similarity project from multiple students, so you might want to consider one of the other projects as a backup if you think you’d particularly like InterMine.

When you mail us, please make sure to include as many of the following as possible:

  • A CV / Resume. Tell us about yourself!
  • Links to GitHub, BitBucket, LinkedIn or similar.
  • Sample code. If you don’t have GitHub/Bitbucket etc. we’d still like to see what you can do. A class coding assignment or personal project you’re proud of is a great alternative.

A great way to familiarise yourself with the basics of building InterMine is to run through our tutorial: http://intermine.readthedocs.io/en/latest/get-started/tutorial/ – or alternatively you could try familiarising yourself with the web interface for your preferred InterMine. You can find the full list of InterMines at intermine.org, or try our experimental interface here: http://redgenes.apps.intermine.org/

We’ve also set up a few tickets on the core InterMine repo with the tag “Good first bug” if you’d like to get your hands dirty. Pop a note on the ticket and make a pull request when you think you’re ready. We have some guidelines for contributing that you should read before you make the pull request.

Finally, if you have any ideas or questions, please don’t hesitate to email us.

Useful links:

– Our twitter feed: https://twitter.com/intermineorg
– Here’s a blog post about some of the cool things the community has done with InterMine resources: https://intermineorg.wordpress.com/2016/11/22/cool-intermine-features-roundup/
– Our interactive web services docs: http://iodocs.apps.intermine.org/
– Our very in-the-works new ClojureScript UI. Demo: http://redgenes.apps.intermine.org/ repo: github.com/intermine/redgenes
– Developer documentation: http://intermine.readthedocs.io/en/latest/