GSOC 2018 – Improved InterMine Search with Solr

Currently InterMine uses Apache Lucene (v3.0.2) library to index the data and provide a key-word style search over all data. The goal of this project is to introduce Apache Solr in InterMine so that indexing and searching can happen even quicker. Unlike Lucene which is a library, Apache Solr is a separate server application which is similar to a database server. We setup and configure Solr (v.7.2.1) independently from the application. We use Solr clients to communicate between the application and the Solr instance.

Here, SolrJ (v.7.2.1), a java client for solr is used to communicate between the InterMine and Solr. We also removed the bobo facet library which is used with Lucene since Solr itself provides faceted search. The implementations has been designed in a manner that InterMine would not be heavily coupled with Solr. When you want to change your search engine to something else in future, you just have make different implementations for the interfaces defined.

Currently the search index and the autocomplete index processes use Solr to index the data. The index time has improved significantly with compared to previous indexing times. For example, currently FlyMine takes around around 1900 seconds (32 mins) to index the data. But with Solr we see that it takes only 1250 seconds (21 mins) which is 34% reduction in time. Query time has also improved with Solr where a query of  “*:*” in FlyMine would take around 30-40 seconds which with Solr takes less than 1 second. Previously with Lucene, the indexed data has to be retrieved from the database during the first search after starting the webapp. This took some time but with Solr, it is not the case and the results are instantly returned.

Addition to the above, two web services have been implemented. A Facet service has been implemented which will return only the facet counts for a particular query rather than returning all the results. The other web service is Facet List service which is similar to the previous one but it will return all the facets available in a mine. It will be useful when you want to know all the facets in a mine before you run an actual search.

All these changes are made against InterMine 2.0 version. These changes will be included in an InterMine release in near future, but for those who want to try these changes immediately, can checkout this branch in Github and follow these instructions. All these changes are tested with Apache Solr (v7.2.1).

References :

Advertisements

Google Summer of Code 2018 - Wrapping up the final results for the Data Browser

 

The GSoC 2018 is coming to its end, and after 3 months of hard work, I can proudly present a summary of all our achievements during this summer of code.

 

Summary of Project Goals

InterMine is a open source Data Warehouse intended to be used for the integration and analysis of complex biological data. With InterMine, you can explore organism and other research data provided by different organizations, moving between databases using criteria such as homology.

The existing query builder in InterMine requires some experience to obtain the desired data in a mine, which can become overwhelming for new users. For instance, for a user interested on searching data in HumanMine using its query builder, he or she would need to browse through the different classes and attributes, choosing between the available fields and adding the different constraints over each of them, in order to get the desired output.

For example, a simple query you might want to glean from InterMine might be as follows:

Query: Given the human gene symbol “GATA1”, show all homologous genes in other organisms. (More on Homologues, also spelled homologs: https://en.wikipedia.org/wiki/Homology_(biology)

Problem: This query sounds simple-ish, but building it in our query builder requires a strong familiarity with the data model, and can be confusing for anyone new to InterMine. We would like the data browser to be more complex than the simplicity of a simple keyword search, but less complex than the current query builder. For context, here’s the humanmine query builder: http://www.humanmine.org/humanmine/customQuery.do. We have attached a screenshot of what it looks like for the homology query mentioned above, where it can be seen why it looks a little intimidating.

This requires the user to have a decent knowledge of the model schema in order to successfully build a correct query for the expected query results. For new users this workflow can become, indeed, overwhelming to search for specific information in the data.

For this reason the goal of this project is to implement a faceted search tool to display the data from InterMine database, allowing the users to search easily within the different mines available around InterMine, without the requirement of having an extensive knowledge of the data model.

 

Summary of Project Achievements

In order to maintain a good workflow, the project was divided into three major versions or milestones, coinciding the deadline of each one with GSoC evaluation phases. The main developments in each milestone are listed below, and comprises a total of 67 closed issues with 194 commits.

Milestone 1 (June 11). In the first milestone (related GitHub issue here), the following features were added:

  1. Initial environment setup (#1)
  2. Server routes to query for ‘counts’ of data (#2)
  3. Use InterMine im-tables for dynamically loading the data on the views (#3)
  4. Unit testing for the server-side routes (#4)
  5. Functional general statistics of the data with graphics and plots (#6)
  6. Searching data on HumanMine (basic ontology concepts) + Unit testing (#8, #9)
  7. Further User Interface improvements (#22, #23, #24, #25, #26, #27, #28)
  8. Code documentation and optimization (#7)

Milestone 2 (July 9). In the second milestone (related GitHub issue here), the following features were added:

  1. Query builder with automatically filled selectable fields + Unit testing (#11)
  2. Filtering options when searching for data + Unit testing (#13)
  3. Improvements to adapt to small devices (#29)
  4. Handling SSL errors (#30, #31)
  5. User interface improvements (#32, #34, #35, #36, #38)
  6. Show counts beside typeahead filters (#39)
  7. Allow users to add and remove filters (only one) (#40)
  8. Making the Dataset filter to be multiple checkbox (#42)
  9. Adding formal documentation using documentation.js (#43)
  10. Interactions filter (#44)
  11. Chromosome Locations filter
  12. Code documentation and optimization

Milestone 3 (August 6). In the third milestone (related GitHub issue here), the following features were added:

  1. Finish main dashboard interface (#5)
  2. Use InterMine color palette (#10)
  3. Enable save as list functionality (#16, #15)
  4. Add a ‘switch’ functionality to change between mines (#21)
  5. Option to show multiple filters for the same filter type at once (#41)
  6. ClinVar filter (#52)
  7. OMIM Disease filter (#53)
  8. Expression: Illumina bodymap filter (#55)
  9. Protein localisation: Protein Atlas filter (#56)
  10. JSON config file for mines to handle extra filters (#57)
  11. Show only default filters for mines not defined in JSON config and available in the registry (#58)
  12. Add running instructions (#61)
  13. Deep link to specific mine (#62)
  14. Remember which mine you were looking at last time (#63)
  15. User interface improvements (#66, #68)
  16. Handle an InterMine being down (#67)

 

Brief Overview of the Final Product

In the following figure, the different elements available on the browser interface are displayed and further explained.

As depicted above, the user is able to change between the different mines available in the InterMine registry by using the dropdown box in (1). Next, in (2) the viewed class can be changed, and currently it can be either Genes or Proteins. Moreover, in (3), the different filters available in the currently explored mine are displayed, where the user can filter the data shown in the table at (6) that better fulfills his/her requirements. Furthermore, some plots regarding to the data in the table are displayed on (4). Currently it shows a pie chart of individuals per different organism, but it will be extended with more plots in the future. Finally, the user has the option to save the table as list, generate the code to embed it elsewhere, or to export the results by using the options in (5).

 

More Screenshots of the Final Product

 

 

Related Blog Posts (in chronological order)

 

Links to Final Tool Code and Deployments

 

Future Work and Plans

There are already some features to be added to the browser after GSoC, some of them are, for instance, to allow users to add their personal InterMine’s API tokens for each mine and use them for the Save as List functionality of the table (link). Another useful feature that I wasn’t able to implement due to a temporary disabling in the InterMine ontology was a Phenotypes filter (link). Next, a new histogram plot to the top section about Gene length will be added (link). Furthermore, a “current class” filter will be added to the sidebar (link). Finally, another desirable feature would be to refactor the per-mine filters to use path query (link).

 

As a conclusion, the fact that the final product has been tested and is going to be truly helpful for the target community of users, is enough for me to be proud of the developed tool during this Google Summer of Code. Also the results of this project will allow us to, hopefully, publish a paper describing the new InterMine browser.

GSoC’18- Cross InterMine Search Tool (Progress so far)


It has been three weeks since the commencement of the GSoC’18 Coding Phase and a lot of progress has been made till now. The project has been deployed here.

This week has been really very productive. I have worked upon several parts of the application this week. My main focus for this week was to add filters into the application. But I ended up with some more cool stuff here 😉
Let’s have a look at all of them one by one.

  1. Search Rating/Relevance Score
The application knows what you are searching for 😉

Relevance score is a value which is calculated by the InterMine QuickSearch API endpoint for every keyword which is searched. This rating determines how much relevant is the result item with the search keyword. The QuickSearch API response provides a floating value of ‘relevance’ parameter. InterMine uses a formula for determining the Search score in the specific InterMine Search Portals. I have used the same formula to convert the floating relevance score into a score out of 5. The formula is here:
=> Math.round(Math.max(0.1, Math.min(1, relevance)) * 5)

2. Different colors for different Categories in Results

The world is colorful and hence our app too 😉

This was a feedback received from the InterMine community. Having different category results shaded in different colors helps in exploring the results easily. At times, there may be a long list of result items returned by the application. Then having separate colors for separate categories helps us to explore faster. Our eyes are meant to perceive colors more quickly than text.

3. External links to reports

Every result item has a link which opens the result report on its particular InterMine portal. This report page contains more detailed information about the result item. On clicking the icon, a new tab with the given result report opens in the browser. The result link is generated dynamically:

4. Metadata about search results

Metadata for BMAP mine (search: ‘brca1′)

Having metadata about the results returned is always handy at work. Every tab in the application loads a set of metadata as attached above. This can certainly help in understanding the presence of a search term in the mine in a unified way.

5. Search/Relevance Score Filter

Score filter on sidebar of the application

With addition of score/relevance in the application, it was very much necessary to add an option to filter the results based on that. This section provides radio buttons to filter out the results based on the relevance score of the result item. I hope this feature will be extremely beneficial for the community members out there. 🙂

6. Category Filter

Every mine contains data from various diverse categories. So, this feature too was a necessary requirement of a full fledged searching tool. This section is loaded dynamically based on the types of categories returned by the API. The application uses the search result metadata received from the API to generate these category checkboxes dynamically. So, we need not worry about hard-coding any of these.


So, this was all about progress so far. Almost everything in the application is mobile optimized and ready to use. Most probably, I will also be extending the scope of this project and add a REST API service for searching multiple mines. So, the project will be a full fledged Cross InterMine Search Tool in future, i.e. a package of, a client driven search interface & a back-end REST API service. You can find the project repository here. Please have a look at the application here. I would love to have more feedback from the community. (For providing your feedback/suggestions, kindly email me at dwivedi.aman96@gmail.com) Thanks for reading. Happy Coding! 🙂

 

GSoC Student Interview spotlight: Natural Language to InterMine Queries + Jake Macneal

This is our blog series interviewing our 2018 Google Summer of Code students, who will be working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Jake Macneal, who will be working on converting natural language phrases to InterMine PathQuery.

Hi Jake! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

jakeExcited to be joining the team! I’m an undergraduate studying computer engineering at McGill University in Montreal, about to enter my final semester in the fall. I’m originally from Philadelphia in the US, and I’ll be hopping around North America a little bit this summer (currently in Toronto). I’ve got a passion for robotics and artificial intelligence, which led to me joining my university’s robotics team to help design and build a Mars rover. Additionally, I’ve had the opportunity to intern at NASA Johnson Space Center in Texas, where I worked on a project which uses machine learning to track sensors around the space station (hopefully it’ll be put into use soon).

Aside from those technical interests, I enjoy soccer/football (both playing and watching), classical guitar, analog synthesizers (just getting into this but they’re really fun and fascinating), and the field of space exploration. Part of me is still holding on to the hope of becoming an astronaut some day.

What interested you about GSoC with InterMine?

I searched the GSoC organizations page for projects looking for a Clojure developer, and this was unsurprisingly one of the only ones. However, a language is hardly enough motivation to become passionate about a project. I’ve never had the chance to work in bioinformatics, but I did have a beloved computer science professor (Matthieu Blanchette) whose research was in that field, and he often spoke during lectures about his research. When I read through the organization and task descriptions I immediately thought of him, and knew that this would be a cool project to join. Nothing is more rewarding to me than the thought of using software as a tool to help others do good.

Tell us about the project you’re planning to do for InterMine this summer.

InterMine uses a graph query language (PathQuery) to retrieve information from the database. My project is to implement a more user-friendly alternative, allowing non-technical users to interact with an InterMine database without the need for esoteric queries crafted by an experienced programmer or system administrator. This will take the form of a natural language to PathQuery translation tool, written as a Clojure library. In addition, I’ll be building a proof-of-concept interface allowing novice users to submit English queries which will be translated and then submitted to the query engine. This simple app will be integrated with the InterMine web app, similar to the graphical query builder.

Are there any challenges you anticipate for your project? How do you plan to overcome them?

Natural language processing is a difficult field, and from working on a compiler, I learned that the key to building a correct parser and code generator is a huge number of test cases. Fortunately, the basic principle behind testing a translation tool is simple: assemble a set of English queries, along with the intended output (in the form of a PathQuery string). However, actually assembling such a set of tests which are useful and demonstrate realistic/important queries requires interacting with actual users in the community. I hope to spend much of my initial weeks working with the community to figure out the syntax they’d like to see supported, as well as the types of queries already being written in PathQuery.

Share a meme or gif that represents your project

jake-meme

GSoC Student Interview spotlight: Buzzbang Bioschemas search + Ankit Lohani

This is our blog series interviewing our 2018 Google Summer of Code students, who will be working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Ankit Kumar Lohani, who will be working on Buzzbang.

Hi Ankit! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

Hello InterMiners, I’m Ankit Lohani, a final year undergraduate student, Indian Institute of Technology, Kharagpur, India.  I will be completing my undergraduate studies in a few months with my major in Chemical Engineering.

Right from my first year, I have been interested in robotics and programming and my interest in this field has only grown with time. Initially, I spent over a year working on the hardware front and then shifted to making path planners for our soccer playing bots. Though my academic background has been completely different, it has only pushed me forward to work harder and to learn more. My interests are inclined towards natural language processing and information retrieval.

Apart from these, I love travelling and trekking. I am also planning to complete my 3rd trek this summer, this time above 14,000 feet.

What interested you about GSoC with InterMine?

I have never worked on an open source project and I realized that GSoC is the best place to start learning and seeing my stuff at work. Honestly, while looking for organisations, in which I may be able to contribute, I came across InterMine and the various projects enlisted here. The application domain of InterMine is very appealing and I could relate myself with this organisation because of two key reasons – first, my past internship was on information retrieval on clinicaltrials.gov data. I touched upon various topics like – semantics, ontologies, UMLS (Unified Medical Language System), PubMed, Named-Entity Recognition for biological terms etc. Secondly, because the technologies used in this project were something I have been familiar with as a part of my course and term projects, like Solr, elasticsearch, docker. Apart from these, the project itself has got a unique potential to create a breakthrough in the way complex scientific data is organised on the internet.

Tell us about the project you’re planning to do for InterMine this summer.

My project – Buzzbang, is significantly different from all other InterMine instances and it focuses on scraping all the data we have on internet marked with bioschemas.org and indexing them in a search tool – Apache Solr. So far, a basic scraping module and an indexing engine are up and running. I am planning to integrate “Scrapy” for crawling and indexing new paths and upgrading the Solr search tool in this project. Towards the end of this project, I will make sure all the changes are reflected in the front-end as well.
Are there any challenges you anticipate for your project? How do you plan to overcome them?

I believe there could be some serious challenges that I might face with Scrapy. Making a generalised scraping tool looks easy with the data having bioschemas.org markup, but, the organisation of this data on various domains varies, and crawling across some of those domains might not be a simple task. Moreover, we are also planning to introduce some degree of parallel processing to this module. Though my focus would be on EBI biosamples domain, which should make my task easier, I will try to keep the crawler as general and powerful as I can. Additionally, I suspect I would need some help in planning the architecture for the re-crawling and re-indexing part from the community. I am not very sure about what level of automation would be desirable in this project with respect to the previous point.

GSoC Student Interview spotlight: Cross InterMine Search Tool + Aman Dwivedi

This is our blog series interviewing our 2018 Google Summer of Code students, who will be working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Aman Dwivedi, who will be working on the Cross-InterMine Search tool.

Hi Aman! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

Heya! I’m Aman Dwivedi, a final year undergraduate student from Jabalpur Engineering College, India. I’m a web enthusiast and a Javascript lover (JS is love <3). I have worked with two startup companies as a Full Stack Node.js Developer Intern in the past. I’m also a proud member of the Mozilla Open Source Community (I have worked on the renowned Mozilla Firefox project). I have worked with many great programmers in the past and I’m extremely excited to work with the InterMine team.

What interested you about GSoC with InterMine?

I believe in the fact that a good open source community comes with its members sharing ideas and helping each other throughout. The sign of a good team is a friendly, yet productive environment. The best thing about InterMine is its team and its proud contributors. Everyone has a great helping attitude. The Application Phase was awesome, and I never had such a great experience in any of the past teams I worked with. Everyone is so much enthusiastic about new features and new implementations all the time. Also, one more brownie point is that my project work here will affect a very large scale of society (this is the most important motivating factor for me <3).

Tell us about the project you’re planning to do for InterMine this summer.

I will be working on the Cross InterMine Search Tool. This project will be developed from scratch. It will use the InterMine APIs and the registry to fire concurrent requests to all the selected InterMines for a search query. The project will be developed in Vue.js. It will have a great impact as currently there is no such tool which is capable of searching multiple mines at once. It will make life of all InterMiners and researchers very easy to search and browse through genomic data in all the InterMine instances.

Are there any challenges you anticipate for your project? How do you plan to overcome them?

The most important thing in the development of an open source project is the community. I will need suggestions and user reviews from the community to make the project better. My first priority is always the Community User experience. Suggestions will be really valuable throughout the project development, testing and the documentation phase.

mamandebug.png

GSoC Student Interview spotlight: ElasticSearch / Solr Project + Arunan Sugunakumar

This is our blog series interviewing our 2018 Google Summer of Code students, who will be working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Arunan Sugunakumar, who will be working on upgrading InterMine’s search facilities.

arunan-architecture.png

Hi Arunan! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

I am Arunan Sugunakumar, an undergraduate from Department of Computer Science and Engineering, University of Moratuwa. I am attracted to the concept of open source because I get to learn a lot by seeing contributions from other people all over the world and I learn by contributing myself. I did my internship in WSO2, a open source middleware company. I mostly contribute to Java, Python and JavaScript related projects. I am also interested in Internet of Things and Big Data stuff.

I like to read books in my spare time. It helps me to clear my mind. Also I like to play scrabble which is a popular word game.

What interested you about GSoC with InterMine?

I came to know about InterMine through a friend, and when I went through the project ideas and the community, I fixated in my mind that I should give a try to be part of this organization. Most of the project ideas were associated with core InterMine product rather than trial and error projects. So I know if I become a part of it, my contributions would be there in all InterMine instances. That gave me most of the excitement and the mentors were also very friendly and supportive.

Tell us about the project you’re planning to do for InterMine this summer.

Currently InterMine uses an outdated library to handle bio data search. My project aims to improve the search feature using modern search engines like Apache Solr / ElasticSearch. The existing architecture in InterMine has to be modified to handle the new approach and it should reduce the complexity to the user.

Are there any challenges you anticipate for your project? How do you plan to overcome them?

The main challenge for me is to understand the existing code base so that I can change it without breaking the workflow. I need to work closely with my mentor and need to update them with every change I make. Also I have to communicate my doubts to the community in a friendly manner so that I can get input from everyone.

Another challenge that I might face is choosing the appropriate search engine. There are many open source search engines out there and all of them are best in their own way. So I need to discuss with my mentor to select an appropriate search engine that would be suitable for the project.

Share a meme or gif that represents your project

apache-solr-spongebob.gif