This is our blog series interviewing our 2018 Google Summer of Code students, who will be working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Ankit Kumar Lohani, who will be working on Buzzbang.
Hi Ankit! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?
Hello InterMiners, I’m Ankit Lohani, a final year undergraduate student, Indian Institute of Technology, Kharagpur, India. I will be completing my undergraduate studies in a few months with my major in Chemical Engineering.
Right from my first year, I have been interested in robotics and programming and my interest in this field has only grown with time. Initially, I spent over a year working on the hardware front and then shifted to making path planners for our soccer playing bots. Though my academic background has been completely different, it has only pushed me forward to work harder and to learn more. My interests are inclined towards natural language processing and information retrieval.
Apart from these, I love travelling and trekking. I am also planning to complete my 3rd trek this summer, this time above 14,000 feet.
What interested you about GSoC with InterMine?
I have never worked on an open source project and I realized that GSoC is the best place to start learning and seeing my stuff at work. Honestly, while looking for organisations, in which I may be able to contribute, I came across InterMine and the various projects enlisted here. The application domain of InterMine is very appealing and I could relate myself with this organisation because of two key reasons – first, my past internship was on information retrieval on clinicaltrials.gov data. I touched upon various topics like – semantics, ontologies, UMLS (Unified Medical Language System), PubMed, Named-Entity Recognition for biological terms etc. Secondly, because the technologies used in this project were something I have been familiar with as a part of my course and term projects, like Solr, elasticsearch, docker. Apart from these, the project itself has got a unique potential to create a breakthrough in the way complex scientific data is organised on the internet.
Tell us about the project you’re planning to do for InterMine this summer.
My project – Buzzbang, is significantly different from all other InterMine instances and it focuses on scraping all the data we have on internet marked with bioschemas.org and indexing them in a search tool – Apache Solr. So far, a basic scraping module and an indexing engine are up and running. I am planning to integrate “Scrapy” for crawling and indexing new paths and upgrading the Solr search tool in this project. Towards the end of this project, I will make sure all the changes are reflected in the front-end as well.
Are there any challenges you anticipate for your project? How do you plan to overcome them?
I believe there could be some serious challenges that I might face with Scrapy. Making a generalised scraping tool looks easy with the data having bioschemas.org markup, but, the organisation of this data on various domains varies, and crawling across some of those domains might not be a simple task. Moreover, we are also planning to introduce some degree of parallel processing to this module. Though my focus would be on EBI biosamples domain, which should make my task easier, I will try to keep the crawler as general and powerful as I can. Additionally, I suspect I would need some help in planning the architecture for the re-crawling and re-indexing part from the community. I am not very sure about what level of automation would be desirable in this project with respect to the previous point.