In July 2018 Innovate UK awarded InterMine at the University of Cambridge and STORM Therapeutics a Knowledge Transfer Partnership (KTP). A KTP is a government program that helps businesses in the UK by linking them with an academic organisation — enabling them to bring in new skills and the latest academic thinking to deliver a specific, strategic innovation project.
The key objective of this particular project is to develop an analysis platform using the data warehouse InterMine to help STORM advance their cancer research.
Here we talk with Hendrik Weisser, Senior Bioinformatician at STORM, about this collaboration.
Can you tell me about this project?
Sure, my company (STORM) is partnering with InterMine in this project. We are going to develop a computational knowledge base for cancer drug discovery and RNA epigenetics, based on InterMine’s HumanMine database. We will extend InterMine by adding analysis tools, more biomedical data etc. to make it a bespoke platform to help us identify and validate drug targets.
Can you tell me more about STORM?
STORM Therapeutics is a drug discovery company focused on RNA epigenetics, developing small-molecule inhibitors of RNA-modifying enzymes for the treatment of cancer. We are a spin-out of Cambridge University, founded in 2015 by professors Eric Miska and Tony Kouzarides from the Gurdon Institute. You can find more information – and a cool animated video about RNA epigenetics – on our website, www.stormtherapeutics.com.
What do you hope to achieve?
For STORM, convenient access to available data on RNA-modifying enzymes, their roles in RNA epigenetics, and their associations to different cancers – both direct and via interaction partners – is vital for our efforts in target validation, indication prioritisation and patient stratification. A large amount of relevant data is publicly available but is scattered over many sources and not integrated, thus difficult and time-consuming to fully utilise. STORM’s vision is to develop an integrated database of relevant human biomedical data, that should enable our scientists to quickly view and interrogate the most pertinent data on target genes/proteins, but also allow us to easily perform bioinformatic analyses on these data.
What attracted you to InterMine? What makes InterMine a useful tool for drug discovery?
I found out about InterMine’s existence by chance and then quickly signed up to an InterMine training course at Cambridge University to learn more. I was impressed by the wealth of functionality offered by InterMine and by its sophisticated architecture that enables huge flexibility in dealing with different kinds of biological data. InterMine really represents the state of the art in terms of large-scale complex biomedical data integration. By focusing on extensibility and customisation and on enabling local installations, InterMine is able to serve a variety of research communities. These capabilities also make it an ideal fit for STORM’s requirements for an internal data management system that integrates diverse public data. The fact that InterMine is open-source, i.e. the code is and will stay available, is also important for us because it helps to ensure long-term maintainability.
Throughout the InterMine code, the InterMine version number is set via a global variable. Here’s an example:
# Maven will download the bio-core JAR with the correct version
compile group: 'org.intermine', name: 'bio-core', version: System.getProperty("bioVersion")
To change which InterMine version you are using , you will want to increment the value of the system property “imVersion” and “bioVersion“. These are located in the “gradle.properties” file for your mine:
# gradle.properties in your mine
Maven will now download, for example, the bio-core JAR of the latest version, e.g. “bio-core-2.1.0.jar”.
If you set the property to “2.1.+” you will get any small point releases that are published in the future. You can set the property to be 2.1.0 if you ONLY want to use version “2.1.0” and do not want to receive updates:
# gradle.properties to only get specific version
What is an RSE (Research Software Engineer), you may ask? It’s a role that has existed for decades, but has only been using this name for a few years. As RSEs, we tend to be software engineers who work in academia, or perhaps academics who write production-ready code – or maybe both.
A common theme seems to be universities establishing RSE groups who work in consultancy-style ways – academics who have code, or have a need for code, approach the groups and are helped through their tasks, whether it be refactoring some old/messy/slow code, providing suggestions, or writing code to make their research easier. The RSE group may also provide training in programming languages, version control, best practices and other relevant computational basics that ease the needs of researchers.
Whilst I think most or all of us at InterMine would consider ourselves to be RSEs, we don’t really fit this model – we all write code, we all contribute to papers, but all of our sub-projects and work focus around a single primary project – some of us are working to make InterMine more FAIR, others to make it easy to launch InterMine on the cloud, but it’s still all InterMine. I’m sure we’re not the only group like this, and it makes me wonder if there should be names for the different flavours of RSE groups out there. Central RSE groups vs. dedicated RSE groups? Consultancy / support / advocacy RSEs vs. RSE specialist groups? I’m not sure if any of these are quite right, and I’d be curious to hear what others think.
RSE 2018: a grassroots conference for research software engineers
Moving on from musing about job titles, though – a bit about the recent conference. RSE2018 is the third annual UK conference for Research Software Engineers, but it’s the first time I’ve attended, personally. It made a change to have a conference where everyone around was working in research and software development, but not all of it was open source or bioinformatics related. I relished the chance to meet and discuss career paths with others, and enjoyed perhaps too much when the late-night conference dinner descended into attempts to assign poetry genres to different programming languages. Java is obviously epic poetry, but others get trickier. Terse Clojure might be a haiku, and perhaps Python, with its structured whitespace, is a form of concrete poetry?
The conference keynotes varied – there was an introduction to a digital humanities project, Oracc, which hosts annotated and transcribed cuneiform, we were introduced to the Microsoft Hololens and some of the challenges and history of its creation, a talk about Google Deepmind, and I particularly enjoyed the keynote talking about the sustainability of research software. Given how chaotic dependencies make everything, it’s no wonder that maintaining software takes a significant amount of time and money!
You could think of software dependencies like ripples on rainy water: all spreading out and interacting, becoming beautiful chaos as ripples interact with one another, say @jameshowison.
There were some hands-on tutorials and workshops, but I mostly attended RSE-community related sessions. A couple that stood out to me, in no particular order:
Diversity in recruiting RSEs. We had speakers from Microsoft talking about their efforts to make their research staffing pool more diverse, which included gruelling-sounding half-days sessions where candidates were interviewed by four different interviewers in an attempt to remove bias. Somewhat entertainingly, the room this was conducted in – the senate chamber – had red throne-like seats and eight large portraits on the walls, every single one depicting an older white male. The irony was not lost upon the session attendees!
The RSE community AGM. Rather than being an informal gathering of individuals, the UK RSE group will soon be re-launching as an official society that members can join for a nominal fee. The AGM gave us a chance to hear about some of their plans (you can sign up to hear about the launch date), as well as the opportunity to share your wish list of likes, dislikes, and comments on the activities the group performs. I’m looking forward to interacting with the society and seeing where they head!
It’s a conference I’d definitely like to attend again. If you missed out, you can catch up with many of the relevant points on twitter, under the hashtag #RSE18.
One of the goals of Google Summer of Code (GSoC) is to help turn students into long-term open source maintainers and contributors. I suspect we’ve managed this with our current batch of students, who have contributed to our projects across a broad range of topics, whether it was querying InterMine using natural language sentences, updating our search capabilities (both UIs and search backends), or adding new features to the InterMine python client.
From the start of the application process, our fabulous pool of applicants spent time interacting with each other and even helping each other out before anyone had been officially accepted. We received numerous PRs, tickets, and suggestions on our GitHub repos, and for this year we had returning GSoC mentors who previously had been students. It’s almost hard to believe we hadn’t participated before 2017, seeing all of the great work and enthusiasm GSoC brings, all while being able to pay students for their time and give them valuable work experience.
To wrap up this year’s great set of projects we had a community call [agenda & notes here] where our students presented their work in roughly 5 minute slots. You can catch up on each of the recorded presentations in our GSoC 2018 playlist, or here are direct links to each of the videos:
Currently InterMine uses Apache Lucene (v3.0.2) library to index the data and provide a key-word style search over all data. The goal of this project is to introduce Apache Solr in InterMine so that indexing and searching can happen even quicker. Unlike Lucene which is a library, Apache Solr is a separate server application which is similar to a database server. We setup and configure Solr (v.7.2.1) independently from the application. We use Solr clients to communicate between the application and the Solr instance.
Here, SolrJ (v.7.2.1), a java client for solr is used to communicate between the InterMine and Solr. We also removed the bobo facet library which is used with Lucene since Solr itself provides faceted search. The implementations has been designed in a manner that InterMine would not be heavily coupled with Solr. When you want to change your search engine to something else in future, you just have make different implementations for the interfaces defined.
Currently the search index and the autocomplete index processes use Solr to index the data. The index time has improved significantly with compared to previous indexing times. For example, currently FlyMine takes around around 1900 seconds (32 mins) to index the data. But with Solr we see that it takes only 1250 seconds (21 mins) which is 34% reduction in time. Query time has also improved with Solr where a query of “*:*” in FlyMine would take around 30-40 seconds which with Solr takes less than 1 second. Previously with Lucene, the indexed data has to be retrieved from the database during the first search after starting the webapp. This took some time but with Solr, it is not the case and the results are instantly returned.
Addition to the above, two web services have been implemented. A Facet service has been implemented which will return only the facet counts for a particular query rather than returning all the results. The other web service is Facet List service which is similar to the previous one but it will return all the facets available in a mine. It will be useful when you want to know all the facets in a mine before you run an actual search.
All these changes are made against InterMine 2.0 version. These changes will be included in an InterMine release in near future, but for those who want to try these changes immediately, can checkout this branch in Github and follow these instructions. All these changes are tested with Apache Solr (v7.2.1).
This release represents a large milestone for the InterMine team! Not only because we made big fundamental changes to the core InterMine data model and build system, but also because this release represents a major shift in philosophy for us. Previously InterMine was a big, monolithic, single piece of software. You downloaded the whole InterMine, you compiled the whole InterMine, you got the whole of InterMine. Instead, we are moving towards this idea of modularity and responsiveness. Smaller, independent libraries that are interconnected but can be used for tools and features separately or linked together.
Smaller decoupled InterMine packages will allow us to develop more features faster with less errors. InterMine maintainers might then have the flexibility to include (or not) the features in their mine, plug in their own tools, etc.
Version 2.0 represents a big step towards this goal!
A New Interface
A new feature in InterMine 2.0 is the ability to run our new UI, nicknamed “Blue Genes”. This app is in addition to the current webapp and offers a new and responsive search environment for your InterMine data.
Blue genes is a modern UI built in Clojure and provides a modern user experience.
Super fast response times
Interactive list upload
Redesigned “My account” section
Template and query builder result previews
.. lots more!
Once you have your InterMine updated to InterMine 2.0, there is a single command that will launch Blue Genes for your mine.
We are actively seeking feedback on Blue Genes, it’s still very much in the beta phase still, so please get in touch once you have some opinions!
Thanks to everyone who helped test this release! Thanks Howie Motenko at MGI for your alpha testing and model insights. And a BIG thank you goes to Sam Hokin from the NCGR who spent a lot of time and effort helping improve InterMine! Thanks Sam and Howie! You are much appreciated.
The GSoC 2018 is coming to its end, and after 3 months of hard work, I can proudly present a summary of all our achievements during this summer of code.
Summary of Project Goals
InterMine is a open source Data Warehouse intended to be used for the integration and analysis of complex biological data. With InterMine, you can explore organism and other research data provided by different organizations, moving between databases using criteria such as homology.
The existing query builder in InterMine requires some experience to obtain the desired data in a mine, which can become overwhelming for new users. For instance, for a user interested on searching data in HumanMine using its query builder, he or she would need to browse through the different classes and attributes, choosing between the available fields and adding the different constraints over each of them, in order to get the desired output.
For example, a simple query you might want to glean from InterMine might be as follows:
Problem: This query sounds simple-ish, but building it in our query builder requires a strong familiarity with the data model, and can be confusing for anyone new to InterMine. We would like the data browser to be more complex than the simplicity of a simple keyword search, but less complex than the current query builder. For context, here’s the humanmine query builder: http://www.humanmine.org/humanmine/customQuery.do. We have attached a screenshot of what it looks like for the homology query mentioned above, where it can be seen why it looks a little intimidating.
This requires the user to have a decent knowledge of the model schema in order to successfully build a correct query for the expected query results. For new users this workflow can become, indeed, overwhelming to search for specific information in the data.
For this reason the goal of this project is to implement a faceted search tool to display the data from InterMine database, allowing the users to search easily within the different mines available around InterMine, without the requirement of having an extensive knowledge of the data model.
Summary of Project Achievements
In order to maintain a good workflow, the project was divided into three major versions or milestones, coinciding the deadline of each one with GSoC evaluation phases. The main developments in each milestone are listed below, and comprises a total of 67 closed issues with 194 commits.
Milestone 1 (June 11). In the first milestone (related GitHub issue here), the following features were added:
In the following figure, the different elements available on the browser interface are displayed and further explained.
As depicted above, the user is able to change between the different mines available in the InterMine registry by using the dropdown box in (1). Next, in (2) the viewed class can be changed, and currently it can be either Genes or Proteins. Moreover, in (3), the different filters available in the currently explored mine are displayed, where the user can filter the data shown in the table at (6) that better fulfills his/her requirements. Furthermore, some plots regarding to the data in the table are displayed on (4). Currently it shows a pie chart of individuals per different organism, but it will be extended with more plots in the future. Finally, the user has the option to save the table as list, generate the code to embed it elsewhere, or to export the results by using the options in (5).
There are already some features to be added to the browser after GSoC, some of them are, for instance, to allow users to add their personal InterMine’s API tokens for each mine and use them for the Save as List functionality of the table (link). Another useful feature that I wasn’t able to implement due to a temporary disabling in the InterMine ontology was a Phenotypes filter (link). Next, a new histogram plot to the top section about Gene length will be added (link). Furthermore, a “current class” filter will be added to the sidebar (link). Finally, another desirable feature would be to refactor the per-mine filters to use path query (link).
As a conclusion, the fact that the final product has been tested and is going to be truly helpful for the target community of users, is enough for me to be proud of the developed tool during this Google Summer of Code. Also the results of this project will allow us to, hopefully, publish a paper describing the new InterMine browser.