Textual representation of Human Genome

HumanMine – Identifier switch

Currently genes in the human InterMine (humanmine.org) have the Ensembl gene identifier (e.g. ENSG00000000003) as the “primary” identifier and the NCBI gene identifier (e.g. 7105) as the “secondary” identifier. In the next release of HumanMine, this will be switched.

A small change! But may impact your lists of genes. Please contact us if you are worried. We will keep the current version of HumanMine available for your convenience for the next few months just in case.

Why not just use both identifier schemes?

This is what we have done, and will continue to do. The problem is the two organisations do not agree completely on the genome annotation. This means that what Ensembl says is a gene may not be considered a gene by the NCBI. In fact there is a many-to-many relationship. There are some NCBI IDs that map to zero, one or several Ensembl identifiers. Conversely, there are some Ensembl identifiers that map to zero, one or several NCBI gene identifiers.

Why did you pick Ensembl identifiers?

There are a lot of quality data sets that use Ensembl identifiers. Not using Ensembl identifiers means that we may lose information from these valuable studies.

Why did you switch to using NCBI identifiers?

We are part of a BD2K pilot for the NIH Commons project involving six major model organism databases:  fly (FlyBase), mouse (MGI), rat (RGD), worm (WormBase), yeast (SGD), zebrafish (ZFIN). All of the model organisms use NCBI identifiers for human genes. For interoperability, we decided to use NCBI identifiers as well.

What were the final numbers? How much data was “lost” or gained?

Total genes
Data Source loaded into HumanMine database both NCBI and Ensembl identifier HGNC symbol
Ensembl 61,817 30,137 38,124
NCBI 59,613 23,016 39,670
  • Only 36 NCBI genes do not have a corresponding HGNC symbol.
  • There were 94 Ensembl identifiers that are assigned to more than one NCBI gene.
  • There were approximately 100 NCBI genes associated with more than one Ensembl identifier. In these cases, we did not assign the Ensembl to be an identifier. Instead we placed the two as “synonyms” so users can still search and find the relevant genes.

Why do I care?

If you have a saved list using Ensembl IDs, there may be data loss. We will keep the current version of HumanMine available for your convenience for the next few months just in case — so you aren’t in danger of losing any of your saved data.

 

 

 

 

Advertisements