This is our blog series interviewing our 2019 Google Summer of Code students, who working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Rahul Yadav, who will be working on the InterMine single sign-in project.
Hi Rahul! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?
Hi ! Excited to be on the team. I am a third year undergraduate student, pursuing my Bachelors of Technology in Computer Science from USICT (GGSIPU, Delhi). I love being in front of my laptop. I can certainly spend more time writing code than doing anything else, but Football and Basketball have always been an exception.
I have done many projects during my past academic year in order to utilise and explore my skill set. I have always loved contributing to open source because it is such a huge community of amazing developers who are always there to help you out. Apart from this, I have worked on oauth2 implementation during my internship in last summer where I used Java to connect google services like G-Drive, Hangout and others with the company codebase. I was always fascinated by cloud services so I kept working on GCP, AWS, AZURE and etc frequently.
What interested you about GSoC with InterMine?
To be honest, I never thought i would get an opportunity to work with a community like InterMine. But, when I saw list of projects, it intrigued me and I found myself on this very interesting project, single sign in which the project requirements and the tech seemed very familiar to me and because of that I kept on digging about the project requirements and did lots of research on it, and with every minute spent on this, my interest escalated exponentially, and Eureka! I finally came up with solution which helped me to be a part of this amazing community.
Tell us about the project you’re planning to do for InterMine this summer.
In the current scenario, a user logs in the desired intermine and saves the results and the required data. The problem arises when the same user wants to access a different intermine, he/she will have to register again on this new mine and log in again. Currently, InterMine community does not have a single common sign-in mechanism and thus it is authenticating users with the help of tokens (temporary and permanent one) or using google service to log in. This project will modify the existing token mechanism by making the intermine as an OAuth2 provider with a single common Authorization server for all 30 mines so that user could access all the mines with the single set of credentials i.e just one time registration.
Are there any challenges you anticipate for your project? How do you plan to overcome them?
This project is related to security and the most important part about it is, that it is all about user credentials which means a single wrong logic or step can expose our security, so implementing a fully secure system is a major challenge for this project.
I’m going to consider all the possible threats and vulnerabilities during the development phase of the system, and will focus on a lots of testing and debugging in search of any kind of loopholes, if so then fixing it before deployment.
TL;DR: Send us your awesome project ideas and/or volunteer to be a mentor!
We need your ideas!
GSoC 2019 has been announced, and as in 2017 and 2018, InterMine will be applying again to become a mentor organisation. This means we’re back at the “we want your project ideas!” phase – and we do! If you work with or use an InterMine and have ideas for its improvement – might it be something big enough for a student to work on for three months? Any of these types of ideas would be great:
An interesting exploratory project that answers a question – “is x likely to be possible or practical with InterMine?”
Fixing something that’s always bothered you – in 2018, we managed this with the Solr Search update!
A well-scoped application like the InterMine iOS app
In 2017 and 2018 we’ve had mentors from the community, including mentors who were previously GSoC students. Ideally, interested mentors should be known to us, perhaps because you are an InterMine user, developer/maintainer/administrator, or a previous student. If you don’t have any project ideas of your own, you may be able to pick one from the project list that suits your skills and interests.
What is mentoring like, you might ask? We have the basics set out in our mentor terms and conditions (which isn’t as dry as the title suggests). Some things to note:
The busiest period is the application phase, when multiple students will be interacting with you to learn about and contribute to the community.
After this, things calm down a lot. You’re expected to meet (virtually) with your student at least once a week for the three months of the coding phase. Proactive students often only need an hour or two here or there, but other students may need more hands-on attention.
Students wages are paid by Google. Mentors also get a small stipend and thank-you gift after GSoC is complete.
You’ll be paired with a Cambridge-based mentor for support, guidance, and cover while on vacation.
A great opportunity all around
GSoC as a program ends up being incredibly valuable two-way exchange. Students get three months of paid work experience at an open source organisation, and on the other side InterMine and InterMine mentors end up with the chance to guide projects and see some truly fantastic work implemented. Promising students might even end up applying for vacancies when they come up – it’s a great way to broaden your community!
One of the goals of Google Summer of Code (GSoC) is to help turn students into long-term open source maintainers and contributors. I suspect we’ve managed this with our current batch of students, who have contributed to our projects across a broad range of topics, whether it was querying InterMine using natural language sentences, updating our search capabilities (both UIs and search backends), or adding new features to the InterMine python client.
From the start of the application process, our fabulous pool of applicants spent time interacting with each other and even helping each other out before anyone had been officially accepted. We received numerous PRs, tickets, and suggestions on our GitHub repos, and for this year we had returning GSoC mentors who previously had been students. It’s almost hard to believe we hadn’t participated before 2017, seeing all of the great work and enthusiasm GSoC brings, all while being able to pay students for their time and give them valuable work experience.
To wrap up this year’s great set of projects we had a community call [agenda & notes here] where our students presented their work in roughly 5 minute slots. You can catch up on each of the recorded presentations in our GSoC 2018 playlist, or here are direct links to each of the videos:
Currently InterMine uses Apache Lucene (v3.0.2) library to index the data and provide a key-word style search over all data. The goal of this project is to introduce Apache Solr in InterMine so that indexing and searching can happen even quicker. Unlike Lucene which is a library, Apache Solr is a separate server application which is similar to a database server. We setup and configure Solr (v.7.2.1) independently from the application. We use Solr clients to communicate between the application and the Solr instance.
Here, SolrJ (v.7.2.1), a java client for solr is used to communicate between the InterMine and Solr. We also removed the bobo facet library which is used with Lucene since Solr itself provides faceted search. The implementations has been designed in a manner that InterMine would not be heavily coupled with Solr. When you want to change your search engine to something else in future, you just have make different implementations for the interfaces defined.
Currently the search index and the autocomplete index processes use Solr to index the data. The index time has improved significantly with compared to previous indexing times. For example, currently FlyMine takes around around 1900 seconds (32 mins) to index the data. But with Solr we see that it takes only 1250 seconds (21 mins) which is 34% reduction in time. Query time has also improved with Solr where a query of “*:*” in FlyMine would take around 30-40 seconds which with Solr takes less than 1 second. Previously with Lucene, the indexed data has to be retrieved from the database during the first search after starting the webapp. This took some time but with Solr, it is not the case and the results are instantly returned.
Addition to the above, two web services have been implemented. A Facet service has been implemented which will return only the facet counts for a particular query rather than returning all the results. The other web service is Facet List service which is similar to the previous one but it will return all the facets available in a mine. It will be useful when you want to know all the facets in a mine before you run an actual search.
All these changes are made against InterMine 2.0 version. These changes will be included in an InterMine release in near future, but for those who want to try these changes immediately, can checkout this branch in Github and follow these instructions. All these changes are tested with Apache Solr (v7.2.1).
The GSoC 2018 is coming to its end, and after 3 months of hard work, I can proudly present a summary of all our achievements during this summer of code.
Summary of Project Goals
InterMine is a open source Data Warehouse intended to be used for the integration and analysis of complex biological data. With InterMine, you can explore organism and other research data provided by different organizations, moving between databases using criteria such as homology.
The existing query builder in InterMine requires some experience to obtain the desired data in a mine, which can become overwhelming for new users. For instance, for a user interested on searching data in HumanMine using its query builder, he or she would need to browse through the different classes and attributes, choosing between the available fields and adding the different constraints over each of them, in order to get the desired output.
For example, a simple query you might want to glean from InterMine might be as follows:
Problem: This query sounds simple-ish, but building it in our query builder requires a strong familiarity with the data model, and can be confusing for anyone new to InterMine. We would like the data browser to be more complex than the simplicity of a simple keyword search, but less complex than the current query builder. For context, here’s the humanmine query builder: http://www.humanmine.org/humanmine/customQuery.do. We have attached a screenshot of what it looks like for the homology query mentioned above, where it can be seen why it looks a little intimidating.
This requires the user to have a decent knowledge of the model schema in order to successfully build a correct query for the expected query results. For new users this workflow can become, indeed, overwhelming to search for specific information in the data.
For this reason the goal of this project is to implement a faceted search tool to display the data from InterMine database, allowing the users to search easily within the different mines available around InterMine, without the requirement of having an extensive knowledge of the data model.
Summary of Project Achievements
In order to maintain a good workflow, the project was divided into three major versions or milestones, coinciding the deadline of each one with GSoC evaluation phases. The main developments in each milestone are listed below, and comprises a total of 67 closed issues with 194 commits.
Milestone 1 (June 11). In the first milestone (related GitHub issue here), the following features were added:
In the following figure, the different elements available on the browser interface are displayed and further explained.
As depicted above, the user is able to change between the different mines available in the InterMine registry by using the dropdown box in (1). Next, in (2) the viewed class can be changed, and currently it can be either Genes or Proteins. Moreover, in (3), the different filters available in the currently explored mine are displayed, where the user can filter the data shown in the table at (6) that better fulfills his/her requirements. Furthermore, some plots regarding to the data in the table are displayed on (4). Currently it shows a pie chart of individuals per different organism, but it will be extended with more plots in the future. Finally, the user has the option to save the table as list, generate the code to embed it elsewhere, or to export the results by using the options in (5).
There are already some features to be added to the browser after GSoC, some of them are, for instance, to allow users to add their personal InterMine’s API tokens for each mine and use them for the Save as List functionality of the table (link). Another useful feature that I wasn’t able to implement due to a temporary disabling in the InterMine ontology was a Phenotypes filter (link). Next, a new histogram plot to the top section about Gene length will be added (link). Furthermore, a “current class” filter will be added to the sidebar (link). Finally, another desirable feature would be to refactor the per-mine filters to use path query (link).
As a conclusion, the fact that the final product has been tested and is going to be truly helpful for the target community of users, is enough for me to be proud of the developed tool during this Google Summer of Code. Also the results of this project will allow us to, hopefully, publish a paper describing the new InterMine browser.