This is our blog series interviewing our 2018 Google Summer of Code students, who working remotely for InterMine for 3 months on a variety of projects. We’ve interviewed Nupur Gunwant, who will be working on the InterMine Python Client.
Hi Nupur! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?
I am Nupur Gunwant, a student from IIT Kharagpur, India. I am pursuing an Integrated Masters degree in Mathematics and Computing. I am an open source enthusiast and a maths lover. I love to solve problems and talk about ideas. I firmly believe in the power of Python and admire its versatility.
Apart from that, I am a lover of art. I want to further pursue my studies in the intersection of my artistic and technical interests. And most importantly, I always carry a book wherever I am.
What interested you about GSoC with InterMine?
I was deeply intrigued by the work InterMine does and as a student, I wanted to work with an organization with such a huge impact on the society. Another thing that motivated me towards preparing hard to work with InterMine as a student developer was the fact that it’s such a healthy and friendly community, where ideas are appreciated and one is always motivated to work on them. I think that made InterMine the most desired place to work with.
Tell us about the project you’re planning to do for InterMine this summer.
I will be working on adding functionalities to Python Client, a very important part of InterMine at present. I will begin with creating a link between the InterMine Registry and Python Client, so that the user can make use of the Registry features on the terminal.
Further I will build a Query Manager that will be a key source to perform operations on user queries using the terminal and lastly, I will add visuality to the Python Client using matplotlib.
Are there any challenges you anticipate for your project? How do you plan to overcome them?
The biggest challenge to meet all the needs of the user for the Client in all the three subparts of my project. I am planning to make community interactions and their feedback the greatest source of review on my work, which because of the communities’ experience in user experience should help a great deal in overcoming this problem.
Aman Dwivedi will be working on a Cross-InterMine search tool. This will use the registry to allow users to search multiple InterMines at once, and should be a good way to figure out which mine has the data you’re looking for. Aman will be mentored by Nadia Yudina, herself a graduate of one of last year’s InterMine+GSoC program.
Adrián Rodríguez Bazaga will be working on something we’ve always wanted: an InterMine data browser – hopefully a tool that will allow users to learn a bit more about data inside an InterMine without having to know the data model. Yay for easier learning curves! Adrian’s mentor will be Yo Yehudi.
Arunan Sugunakumar is going to explore hooking InterMine up to a more modern search package, probably Solr or ElasticSearch. Our current version of Lucene is very old, and we know there are better options out there! Daniela Butano will mentor this project.
Jake Macneal is going to work on a prototype to convert natural language questions into InterMine PathQuery – it would be exciting to have a user type “Show me all the genes associated with diabetes” into an InterMine, and get a sensible set of results back! Aaron Golden will mentor Jake.
Nupur Gunwant will be adding additional features to our python client, such as registry communication, a query manager, and visualisations. Julie Sullivan will be Nupur’s mentor for this project.
Ankit Kumar Lohani will be working on Buzzbang – a search engine to crawl multiple biological sources including, but not exclusively, InterMine instances. Justin Clark-Casey will be Ankit’s mentor.
We’re also planning to post a short interview series highlighting each student and their plans for the summer. We can’t wait to get started!!
At InterMine, a life sciences data integration platform, we’re working on a BBSRC grant to make data available through InterMine ‘FAIR’. What does this mean? Well, firstly FAIR is an initiative to make dataFindable, Accessible, Interoperable and Reusable (I’ve written a lot more about this here).
Taken on its face this is a bit woolly – isn’t InterMine data already FAIR? You can find data (type some text in its general search box or perform a structured query), access it (click the web link), interoperate with it (run a live query on its API) and reuse it (hey the data’s there, download it). Well, one of the great things about FAIR is that it has specific principles and recommendations on how to make data findable, accessible, interoperable and reusable. These place a heavy emphasis on uniformity so that software can much more easily use and combine data across the countless distinct data sources hosted by different organizations across the planet.
So in applying for the grant, how did we propose to apply these recommendations to InterMine? Essentially, we performed a gap analysis between the 15 guiding principles documented in the original FAIR paper and InterMine’s current capabilities, coming up with a plan for how we would bridge this gap.
Let’s take the first findability and accessibility FAIR guiding principles as an example
F1. (meta)data are assigned a globally unique and
A1. (meta)data are retrievable by their identifier
using a standardized communications protocol
One way to fulfil these principles, and something popular in the semantic web world,
is to make identifiers be URLs. So great, InterMine already has URLs that have a 1-to-1 mapping to biological data objects! Search for the gene MYH7 in HumanMine for instance, and the report page you get back has this URL (stripping away some non-essential tracking information).
Look at another biological object and that ID number will change, since this is the internal ID used to track objects within an InterMine database.
But there’s a problem here. These ID numbers are not persistent, as required by principle F1. When the data in an InterMine installation like HumanMine is updated, this is not done additively, but rather than entire database is rebuilt since data sources need to be integrated anew. And on this rebuild, MYH7 is no longer guaranteed to have the
internal InterMine ID 1157771. In fact, it’s very likely to be different.
So part of our proposal was to implement a resolution to this problem. For InterMine as a data integration platform rather than a primary data provider it’s a very complex topic, particularly as we’re generic and model driven (so in principle you could host something completely different like a company database in InterMine!). I won’t delve into the possible solutions too much here, but at the moment it looks like a tradeoff between trying to make our internal ID persistent (e.g. by maintaining the mapping to biological objects between database rebuilds) and trying to incorporate external IDs such as MYH7 directly into the InterMine URL as specified by the InterMine instance operator, something like
We’ll be reporting more on this in the future.
This was a fairly straightforward example. Some of the other principles, such as
I3. (meta)data include qualified references to other
required more interpretation, and in our proposal we related actions broadly to the principles (i.e. whether they addressed one or more of findability, accessibility, etc.) rather than specific FAIR clauses.
However, we wrote our proposal some time ago. Things are moving rapidly and many of the original FAIR paper authors are working on the FAIR metrics initiative, which will measure FAIRness with programattic and quantitative tests. I think this is a great step and now something for anybody looking to FAIRify their data resource to look at closely. We’ll be looking to apply these metrics to our own work as we continue development.
Our theme was around open science, with an activity designed to reinforce the idea that shared data (and therefore more data from different sources) results in better science. For adults we had a couple of great posters about the importance of data sharing, designed by Julie and Rachel. The posters are available freely online for re-use under a CC0 licence.
The Story: A party is rudely interrupted
Meanwhile, for kids (and some adults too!) we had a crime-solving activity. In our scenario, a dastardly fruit villain had stolen the passionfruit in the midst of an otherwise enjoyable soirée. In their haste to flee, the culprit knocked over a tin of blue paint, leaving tracks behind, as well as injuring themselves and leaving DNA evidence behind as they jumped out the window. We had four fruity suspects:
Solving the crime
Step 1: footprints in the paint
In order to solve the crime using science, our young detectives were invited to examine the footprints left by the culprit:
It was usually pretty easy to rule out the apple, and after thinking a little more, the strawberry could be ruled out too, but the orange and the lemon both looked rather similar.
Step 2: Juice found at the scene
Since the devilish thief had hurt themselves, we had samples to analyse. Our criminal investigators took strips of litmus paper and carefully examined the evidence:
Once again, the evidence wasn’t quite conclusive (and was very sticky). Still, it was fun! Let’s move on to the next bit of evidence…
Step 3: the skin
With sample fruits to compare, our enterprising criminologists got a step closer to the solution. Could the skin be from a lemon? Hmmm.
Step 4: We have samples, so let’s sequence the DNA!
Okay, so you may have guessed that we didn’t sequence the DNA of the suspects ourselves – but thankfully the lab had four profiles for us to compare to and they managed to quickly provide a DNA fragment from the crime scene evidence, too. This fragment was far more conclusive than the others, pointing unequivocally to the shadiest character of the bunch – Lithium Lemon.
Step 5: Putting the puzzle pieces together, and sabotage!
As our sleuths solved each different activity, we gave them a puzzle piece. At this stage they had four pieces of the puzzle, but they were still missing a couple of critical bits: the two central pieces. It turns out there had been some CCTV footage – but it had been stolen! After looking around, our vigilant investigators discovered where the crime scene video had been hidden (under the table) and managed to put the entire story together. Once again, shown front and centre of the puzzle was our suspect, Lithium Lemon.
While the shady character wad hauled off in cuffs to the county jail, successful detectives were rewarded with candy, some awesome stickers, and a handout that had a child-oriented activity sheet on one side, with a small copy of our open knowledge posters on the other side, for the slightly more grown-up folks.
"Open Science is Better Science" – more stickers for our @camscience booths have arrived! We suspect these ones will appeal to #openscience adults as well as kids! 😄
Our tables were generally very busy, and the kids seemed to have a great time examining the evidence and putting together the puzzle pieces one by one. I’m not sure how many of them quite perceived the data sharing theme, but some of the adults definitely did, and appreciated the posters as well.
I think one of the biggest surprises for use was how busy we all were! Genetics had a steady flow of people, but the Guildhall had even more. We haven’t heard numbers for this year yet, but in 2017 apparently there were around 3,000 people. What that meant in practical terms for us: Two tables with identical versions of the activity, two InterMine team members acting as detective wranglers at each table, and often two separate groups of people working through the activity simultaneously at each table. After several hours of this we were all ready for a nap! Next time, six staff might be better to allow people to have a breather.
We also learned to keep a good eye on our puzzles: Five puzzles left the office on Sunday morning but only four returned. Hopefully it’ll be cherished at someone’s house as memories of a great activity…. ?
Our materials are open!
Given that our activity was designed to advocate openly sharing your science, we’ve shared our materials online too, and you’re welcome to re-use them.
We’ll be running a GSoC question and answer video call session where students can learn more about the specific projects. Updates about the exact date and time will go out on this blog, our mailing lists, and twitter.
For today’s blog we’ve interviewed Basheer Becerra, an undergrad bioinformatics researcher at the Illinois State University, about the application he’s created, FlyCAGE – (Correlation Analysis on Gene Expression). (Live Demo, GitHub)
Hi Basheer! We love hearing about new and interesting uses of InterMine. Can you give us a brief non-technical intro to FlyCAGE? What inspired you to make it?
FlyCAGE is a web application that allow users to search for genes in Drosophila melanogaster that follow a specific mRNA expression pattern. The user can either enter a known gene name to find other genes with similar expression profiles, or the user can enter a custom expression pattern based on experimental data to find genes that follow the pattern. FlyCAGE would be useful in identifying candidate genes involved in a given process and discovering regulatory interactions in genetic networks.
Tell us a little about yourself.
My name is Basheer Becerra, and I am currently an undergraduate junior at Illinois State University double majoring in Computer Science & Statistics and minoring in Biological Sciences. Programming is something I love doing, specifically web development and data science. What makes me even more excited is using programming and mathematics to answer difficult questions in biology! When I’m not on the computer, you will usually see me reading, training for upcoming marathons, or spending time with friends and family.
You’ve been a friendly presence in our Twitter feed for a while now. How did you hear about InterMine originally?
InterMine was originally introduced to me by my advisor, Dr. Nathan Mortimer, as a tool to help scientists quickly review information about any gene. While InterMine is a helpful tool for scientists to browse gene information, I’ve also realized that InterMine is incredibly useful for developers and data scientists. When my advisor and I came up with the idea of FlyCAGE, InterMine was chosen to be the best solution for retrieving data due to its data integration features and ease of use.
Can you tell us a bit about the technical implementation of FlyCAGE?
The technologies used to implement FlyCAGE includes Spring Framework (Java) for the web back-end, Thymeleaf for template resolving, and HTML/CSS/Bootstrap and JS/jQuery for front-end development. After the expression information is extracted from the entered gene or pattern, Pearson’s correlation is performed on every gene stored in FlyBase for Drosophila melanogaster. Genes with the highest pearson’s correlation coefficient relative to the input expression is returned to the user. InterMine plays a significant role in the operation of FlyCAGE since FlyMine is the only resource used to retrieve the gene information and mRNA expression data. With FlyMine’s modern HTTP API, only a single query is needed to retrieve all the necessary data for FlyCAGE to operate. Without FlyMine’s data integration, FlyCAGE would have to manually integrate data from several different data-sources such as FlyBase, FlyAtlas, BDGP, etc., which would slow down development significantly.
What are your future plans for FlyCAGE? Are you going to expand to other organisms apart from flies?
As far as the program logic, the plan is to include more complex analysis of expression data such as including other data features to help determine “gene similarity”, predict regulatory interactions and unknown gene functions, and explain subtle differences in gene pairs with correlated expression patterns. There has also been a lot of interest to expand CAGE towards other organisms such as plant genomes. With InterMine’s standardized API interface across its several resources, I predict that scaling the functionality of CAGE towards other organisms should be a relatively feasible task.
FlyCAGE is currently in-alpha and can be accessed by this link. However, the link is likely to change as FlyCAGE gets close to releasing. If you would like to stay up-to-date with FlyCAGE or if you would like to help us with usability tests to improve FlyCAGE, please enter your email with this Google Form. If you’ve already looked at FlyCAGE and would like to send some feedback, send me a quick note at firstname.lastname@example.org. Any help is appreciated!
… and all through the lab, not an organism was stirring, not even a… crab?*
Emails and support: Just a quick blog reminder that the office will be pretty empty from around now until the second of January, so don’t be surprised if we take a while to reply to messages. Some of us may be in the office or working from home, but it’s pretty patchy over the holiday season, and I don’t think any of us will be answering emails between the 23rd and 26th of December, nor on the first of January.
Developer calls: There is no developer call this week (normally it would be scheduled for Thursday the 21st). I’m not sure at this point, but the call on the 4th of January may (or may not) be cancelled as well.
Be good, have fun, and we’ll see you next year!
*I can only apologise for the terrible rhyme. Apparently nothing rhymes with “Cambridge” except “drainage” (and even then it’s somewhat weak), so I tried “uni” and that only rhymed with “loony” and “goonie”. Finding rhymes for the word “office” was no better.