-
Notifications
You must be signed in to change notification settings - Fork 2
Description
HGNC is an example collection with four co-occuring identifier complexities:
1. Ambiguity about what $id even is.
The identifiers.org record above captures the fact that HGNC records exist in 3rd party databases but identifiers.org doesn't have a strong concept of a prefix; consequently it isn't possible to get to both "physical locations" of the entity using a single (equivalent) $id. In one case $id is prefixed, and in the other, it is not. HGNC, mercifully, honors both forms. However:
- Other data providers may not be as forgiving as HGNC is
- More often than not variation in the local ID pattern is precisely what the data provider is relying on in order to redirect to their right type-specific path.
A stronger notion of prefix is the simplest thing that would help data integrators collapse the following as equivalent http identifiers since 2674 is the invariant part of the ID.
Given the identifiers.org data model, there is no way to determine whether http://identifiers.org/hgnc/hgnc:2674 points to the same entity as http://identifiers.org/hgnc/2674. This is why I favor developing a bare-curie based resolver like http://n2t.net/hgnc:2674--or if identifiers.org is interested in doing so--http://identifiers.org/hgnc:2674
This would allow us to determine that all of these are talking about the same entity:
Authoritative sources:
- http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=hgnc:$localid
- http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$localid
Identifier resolvers:
- http://identifiers.org/hgnc/$localid
- http://identifiers.org/hgnc/hgnc:$localid
- http://identifiers.org/hgnc/HGNC:$localid
- http://n2t.net/HGNC:$localid
Third party content providers
- http://hgnc.bio2rdf.org/describe/?url=http://bio2rdf.org/hgnc:$localid
- https://monarchinitiative.org/resolve/HGNC:$localid
2. Multiple entity types (Genes and Gene families)
| Identifiers.org namespace | regex | URI |
|---|---|---|
| hgnc | ^((HGNC or hgnc):)?\d{1,5}$ | http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$id [Example: 2674] |
| hgnc.family | ^[A-Z0-9-]+(#[A-Z0-9-]+)?$ | http://www.genenames.org/genefamilies/$id [Example: PADI] |
| hgnc.symbol | ^[A-Za-z-0-9_]+(@)?$ | http://www.genenames.org/cgi-bin/gene_symbol_report?match=$id [Example: DAPK1] |
3. Multiple identifier types (alphanumeric symbol and numeric ID)
4. Type-specific URL patterns combined with lack of deterministic typing in local ID
Consequently you have to know what you're looking at before you can know where to resolve it. Note lack of deterministic typing in localID is not a problem unless you need type-specific URLs the way HGNC does.
Sorry to bug you @KrisGray, you're listed on the HGNC github; could you comment as to whether there's a single URL that can be used across types of IDs in HGNC? (family, symbol, numeric ID) so that we can address at least number 4 on the list?
