GSOC 2018 – Improved InterMine Search with Solr

Currently InterMine uses Apache Lucene (v3.0.2) library to index the data and provide a key-word style search over all data. The goal of this project is to introduce Apache Solr in InterMine so that indexing and searching can happen even quicker. Unlike Lucene which is a library, Apache Solr is a separate server application which is similar to a database server. We setup and configure Solr (v.7.2.1) independently from the application. We use Solr clients to communicate between the application and the Solr instance.

Here, SolrJ (v.7.2.1), a java client for solr is used to communicate between the InterMine and Solr. We also removed the bobo facet library which is used with Lucene since Solr itself provides faceted search. The implementations has been designed in a manner that InterMine would not be heavily coupled with Solr. When you want to change your search engine to something else in future, you just have make different implementations for the interfaces defined.

Currently the search index and the autocomplete index processes use Solr to index the data. The index time has improved significantly with compared to previous indexing times. For example, currently FlyMine takes around around 1900 seconds (32 mins) to index the data. But with Solr we see that it takes only 1250 seconds (21 mins) which is 34% reduction in time. Query time has also improved with Solr where a query of  “*:*” in FlyMine would take around 30-40 seconds which with Solr takes less than 1 second. Previously with Lucene, the indexed data has to be retrieved from the database during the first search after starting the webapp. This took some time but with Solr, it is not the case and the results are instantly returned.

Addition to the above, two web services have been implemented. A Facet service has been implemented which will return only the facet counts for a particular query rather than returning all the results. The other web service is Facet List service which is similar to the previous one but it will return all the facets available in a mine. It will be useful when you want to know all the facets in a mine before you run an actual search.

All these changes are made against InterMine 2.0 version. These changes will be included in an InterMine release in near future, but for those who want to try these changes immediately, can checkout this branch in Github and follow these instructions. All these changes are tested with Apache Solr (v7.2.1).

References :


GSoC Student Interview spotlight: ElasticSearch / Solr Project + Arunan Sugunakumar

This is our blog series interviewing our 2018 Google Summer of Code students, who will be working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Arunan Sugunakumar, who will be working on upgrading InterMine’s search facilities.


Hi Arunan! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

I am Arunan Sugunakumar, an undergraduate from Department of Computer Science and Engineering, University of Moratuwa. I am attracted to the concept of open source because I get to learn a lot by seeing contributions from other people all over the world and I learn by contributing myself. I did my internship in WSO2, a open source middleware company. I mostly contribute to Java, Python and JavaScript related projects. I am also interested in Internet of Things and Big Data stuff.

I like to read books in my spare time. It helps me to clear my mind. Also I like to play scrabble which is a popular word game.

What interested you about GSoC with InterMine?

I came to know about InterMine through a friend, and when I went through the project ideas and the community, I fixated in my mind that I should give a try to be part of this organization. Most of the project ideas were associated with core InterMine product rather than trial and error projects. So I know if I become a part of it, my contributions would be there in all InterMine instances. That gave me most of the excitement and the mentors were also very friendly and supportive.

Tell us about the project you’re planning to do for InterMine this summer.

Currently InterMine uses an outdated library to handle bio data search. My project aims to improve the search feature using modern search engines like Apache Solr / ElasticSearch. The existing architecture in InterMine has to be modified to handle the new approach and it should reduce the complexity to the user.

Are there any challenges you anticipate for your project? How do you plan to overcome them?

The main challenge for me is to understand the existing code base so that I can change it without breaking the workflow. I need to work closely with my mentor and need to update them with every change I make. Also I have to communicate my doubts to the community in a friendly manner so that I can get input from everyone.

Another challenge that I might face is choosing the appropriate search engine. There are many open source search engines out there and all of them are best in their own way. So I need to discuss with my mentor to select an appropriate search engine that would be suitable for the project.

Share a meme or gif that represents your project


Cool InterMine features roundup

I’ve said this before, but I’ll proudly say it again: one of the greatest things about being open source is the community. People are continually creative and resourceful with the tools we’ve built, and we love seeing all the different things you guys do with InterMine. Here’s a quick roundup of some of the things we’ve seen so far this year:

TargetMine’s Auxiliary Toolkit

TargetMine’s Auxiliary toolkit offers advanced analysis for networks and enrichment

TargetMine links out from report pages to provide external enrichment and interaction tools. Read more about it here, or  browse the tutorials: [Enrichment] [Interaction Network].

The Beany Mines:

The beany mines (Soy, Peanut, Legume, and Bean) recently added a shared motif search, as well as a couple of other great visualisations:legume-shared-motif-search


R and SOLR

Colin of HymenopteraMine and BovineMine did a great blog post about using our R client, InterMineR, and then continued to impress by making efforts to upgrade InterMine to use Solr.


Ever wondered what Model Organism Linked Data might look like?  MOLD includes a queryable SPARQL endpoint and draws from multiple different InterMines to create a single dataset.


Tip: Make it generic

Generic tools are ones that aren’t hard-coded to a specific Mine or model. We’re always on the look out for new and exciting features, whether it’s a visualisation or a web service or a database tweak. If you think it’s good, you can email us to discuss it or simply create a pull request, and bask in glory forever after.

We’d love to see more!

This list is awesome (thanks everyone!!) but by no means conclusive. If you think we’ve missed something out, or you’re doing something new at the moment, drop us a line and we’ll add you to the next round up. We’d also love to hear from others who might be interested in guest-blogging an InterMine related feature.