Skip to content

HGNC as use case of multiple identifier complexities #19

@jmcmurry

Description

@jmcmurry

HGNC is an example collection with four co-occuring identifier complexities:

1. Ambiguity about what $id even is.

screen shot 2016-04-22 at 3 38 23 pm

The identifiers.org record above captures the fact that HGNC records exist in 3rd party databases but identifiers.org doesn't have a strong concept of a prefix; consequently it isn't possible to get to both "physical locations" of the entity using a single (equivalent) $id. In one case $id is prefixed, and in the other, it is not. HGNC, mercifully, honors both forms. However:

  1. Other data providers may not be as forgiving as HGNC is
  2. More often than not variation in the local ID pattern is precisely what the data provider is relying on in order to redirect to their right type-specific path.

A stronger notion of prefix is the simplest thing that would help data integrators collapse the following as equivalent http identifiers since 2674 is the invariant part of the ID.

Given the identifiers.org data model, there is no way to determine whether http://identifiers.org/hgnc/hgnc:2674 points to the same entity as http://identifiers.org/hgnc/2674. This is why I favor developing a bare-curie based resolver like http://n2t.net/hgnc:2674--or if identifiers.org is interested in doing so--http://identifiers.org/hgnc:2674

This would allow us to determine that all of these are talking about the same entity:

Authoritative sources:
Identifier resolvers:
Third party content providers
2. Multiple entity types (Genes and Gene families)
Identifiers.org namespace regex URI
hgnc ^((HGNC or hgnc):)?\d{1,5}$ http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$id [Example: 2674]
hgnc.family ^[A-Z0-9-]+(#[A-Z0-9-]+)?$ http://www.genenames.org/genefamilies/$id [Example: PADI]
hgnc.symbol ^[A-Za-z-0-9_]+(@)?$ http://www.genenames.org/cgi-bin/gene_symbol_report?match=$id [Example: DAPK1]

3. Multiple identifier types (alphanumeric symbol and numeric ID)

4. Type-specific URL patterns combined with lack of deterministic typing in local ID

Consequently you have to know what you're looking at before you can know where to resolve it. Note lack of deterministic typing in localID is not a problem unless you need type-specific URLs the way HGNC does.


Sorry to bug you @KrisGray, you're listed on the HGNC github; could you comment as to whether there's a single URL that can be used across types of IDs in HGNC? (family, symbol, numeric ID) so that we can address at least number 4 on the list?

cc: @timclark, @jkunze

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions