diff --git a/README.md b/README.md index bd124e7..397e26f 100644 --- a/README.md +++ b/README.md @@ -105,7 +105,8 @@ The web server can be started after successful completion of the load. uvicorn --host 0.0.0.0 --port 8000 --workers 1 node_normalizer.server:app ``` -Then navigate to http://localhost:8000/docs to run the application +Then navigate to http://localhost:8000/docs to run the application. +Documentation for the [NodeNorm API](./documentation/API.md) is available. ### Webserver Docker container creation and execution Much like the Redis Docker container noted above, a Docker container can also be created and executed to run the webserver. diff --git a/documentation/API.md b/documentation/API.md new file mode 100644 index 0000000..5231d69 --- /dev/null +++ b/documentation/API.md @@ -0,0 +1,318 @@ +# NodeNorm API + +The NodeNorm API includes many API endpoints that cover normalization of identifiers, TRAPI messages +and identifier sets, as well as endpoints to retrieve allowed conflations, semantic types and CURIE +prefixes. The [NodeNorm FastAPI Documentation](https://nodenormalization-sri.renci.org/docs) includes +information about the parameters for calling each endpoint, but this document will describe the +intended function of each endpoint, suggestions for use and descriptions of the JSON documents returned. + +## Identifier/Node Normalization + +### `/get_normalized_nodes` + +* Method: [GET](https://nodenormalization-sri.renci.org/docs#/default/get_normalized_node_handler_get_normalized_nodes_get) + * Parameters: + * `curie` (e.g. `curie=MESH:D014867&curie=NCIT:C34373`): The identifiers to normalize. + * `conflate` (e.g. `conflate=true`): Whether to apply GeneProtein conflation. + * `drug_chemical_conflate` (e.g. `drug_chemical_conflate=true`): Whether to apply DrugChemical conflation. + * `description` (e.g. `description=false`): Whether to include descriptions for nodes that have descriptions. + * `individual_types` (e.g. `individual_types=true`): When returning a conflated result, should Biolink types be + returned for each individual identifier. +* Method: [POST](https://nodenormalization-sri.renci.org/docs#/default/get_normalized_node_handler_post_get_normalized_nodes_post) + * POST Body: A JSON object with the same parameters as the GET method, with a `curies` list instead of individual + `curie` entries. + +Example output: + +```json +{ + "MESH:D014867": { + "id": { + "identifier": "CHEBI:15377", + "label": "Water", + "description": "An oxygen hydride consisting of an oxygen atom that is covalently bonded to two hydrogen atoms" + }, + "equivalent_identifiers": [ + { + "identifier": "CHEBI:15377", + "label": "water", + "description": "An oxygen hydride consisting of an oxygen atom that is covalently bonded to two hydrogen atoms", + "type": "biolink:SmallMolecule" + }, + { + "identifier": "UNII:059QF0KO0R", + "label": "WATER", + "type": "biolink:SmallMolecule" + }, + { + "identifier": "PUBCHEM.COMPOUND:962", + "label": "Water", + "type": "biolink:SmallMolecule" + }, + [...] + ], + "type": [ + "biolink:SmallMolecule", + "biolink:MolecularEntity", + "biolink:ChemicalEntity", + "biolink:PhysicalEssence", + "biolink:ChemicalOrDrugOrTreatment", + "biolink:ChemicalEntityOrGeneOrGeneProduct", + "biolink:ChemicalEntityOrProteinOrPolypeptide", + "biolink:NamedThing", + "biolink:PhysicalEssenceOrOccurrent" + ], + "information_content": 47.7 + } +} +``` + +* Output values: the output is a dictionary with queried CURIEs as the keys and with JSON objects + as the values, containing the following keys: + * `id`: A JSON object that provides the preferred identifier and labels for this clique. + * `identifier`: The preferred CURIE for this clique. Every Biolink class includes a list of + preferred prefixes (e.g. + [valid ID prefixes for SmallMolecule](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes)), + and this is used to choose the preferred CURIE for this clique. + * `label`: The preferred label for this clique. Note that this is not necessarily the label associated with the + preferred CURIE: for some classes (such as chemicals), we choose the best label in a different prefix order than + the Biolink Model preferred prefix order, based on which sources tend to have the best labels. + * `description`: One of the descriptions for the identifiers within this clique. + * `equivalent_identifiers`: a list of identifiers that are part of this clique given the conflation options. + Each identifier includes an `identifier` (a CURIE), a `label` (which corresponds to the label of the CURIE as per + its authoritative source), a `description` (currently only taken from UberGraph), and (if `individual_types` is set) + the Biolink type of each identifier. This list is ordered in the Biolink Model's preferred prefix order for this class. + * `type`: The list of Biolink classes for this clique, starting with the most specific type (in this example, + `biolink:SmallMolecule`), and ending with any mixins that include this class. + * `information_content`: the information content value between 0 and 100. This is calculated by retrieving the + [normalized information content value](https://github.com/INCATools/ubergraph/?tab=readme-ov-file#graph-organization) + for each identifier that is present in UberGraph, and then calculating the lowest information content value of + any identifier in this clique for which UberGraph has an identifier value. According to UberGraph's documentation, + the normalized information content value is "Precomputed information content score for each ontology class, based + on the count of terms related to a given term via rdfs:subClassOf or any existential relation. The scores are + xsd:decimal values scaled from 0 to 100 (e.g., a very specific term with no subclasses)." + * Internally, conflation is represented as sets of cliques that should be combined when that conflation is turned on. + This means that a conflated clique will be represented by a single list of equivalent identifiers, starting with the + equivalent identifiers from the first clique, followed by the equivalent identifiers from the second clique, and so + on. There is currently no way to retrieve the clique leaders (although + [this is a requested feature](https://github.com/TranslatorSRI/NodeNormalization/issues/320)), but you can use the + `individual_types` parameter to get a Biolink type for each identifier. + +## Sets + +### `/get_setid` + +This endpoint is used to calculate a `set ID` for a set of CURIEs. CURIEs that can be normalized will +be normalized (using the conflation settings provided), and those that can't be will be left as is. +Duplicate normalized CURIEs will be removed, even if two distinct CURIEs were passed to this endpoint +but were normalized to the same CURIE. CURIEs will then be sorted in alphabetical order and a hash +generated as a set ID for that set of CURIEs. A set ID is therefore unique to a set of normalized CURIEs for the curies +passed in. + +* Method: [GET](https://nodenormalization-sri.renci.org/docs#/default/get_setid_get_setid_get) + * Parameters: + * `curie` (example: `curie=MESH:D014867&curie=NCIT:C34373`): The CURIEs to normalize as a set. + * `conflation` (optional, example: `conflation=GeneProtein&conflation=DrugChemical`): The conflations to apply. +* Method: [POST](https://nodenormalization-sri.renci.org/docs#/default/get_setid_get_setid_post) + * POST Body: a JSON string representing a list of sets, where each set consists of: + * `curies` (e.g. `"MESH:D014867", "NCIT:C34373": A list of CURIEs to normalize as a set. + * `conflations` (optional, e.g. `["GeneProtein", "DrugChemical"]): A list of conflations to apply. + +Example output: note that the GET method will return a single object, while the POST method will +return a list that corresponds to the list of sets sent to this endpoint for normalization. + +```json +[ + { + "curies": [ + "NCIT:C34373", + "MESH:D014867", + "UNII:63M8RYN44N", + "RUBBISH:1234" + ], + "conflations": [ + "GeneProtein", + "DrugChemical" + ], + "error": null, + "normalized_curies": [ + "CHEBI:15377", + "MONDO:0004976", + "RUBBISH:1234" + ], + "normalized_string": "CHEBI:15377||MONDO:0004976||RUBBISH:1234", + "setid": "uuid:771d3c09-9a8c-5c46-8b85-97f481a90d40" + } +] +``` + +Output values: +* `curies`: The list of CURIEs passed to this endpoint for normalization. +* `conflations`: The list of conflations to apply as passed to this endpoint. +* `error`: Any error that occurred when normalizing this string. Note that a CURIE that cannot be normalized + does not count as an error. +* `normalized_curies`: The list of unique normalized queries used to construct the setid. +* `setid`: The setid calculated for this set. + +## Status + +### [/status](https://nodenormalization-sri.renci.org/docs#/default/status_get_status_get) + +This endpoint can be used to find out about the NodeNorm service and the underlying Redis databases. +It can be useful to confirm whether the databases are fully loaded and how much memory is being used. + +* Methods: GET only +* No parameters. + +Example output: +```json +{ + "status": "running", + "babel_version": "2025mar31", + "babel_version_url": "https://github.com/TranslatorSRI/Babel/blob/master/releases/2025mar31.md", + "databases": { + "eq_id_to_id_db": { + "dbname": "id-id", + "count": 677731045, + "used_memory_rss_human": "68.83G", + "is_cluster": false + }, + [...] + } +} +``` + +Output values: + +* `status` (example: `running`): Whether or not the service is running. +* `babel_version` (example: `2025mar31`): The version of [Babel](https://github.com/TranslatorSRI/Babel) used to generate + the cliques being presented. These are usually date-based versions indicating approximately when the Babel build was + completed. +* `babel_version_url` (example: https://github.com/TranslatorSRI/Babel/blob/master/releases/2025mar31.md): A URL you + can use to learn more about this version of Babel, and how it differs from previous and future versions. +* `databases`: A dictionary of Redis key-value databases used by this NodeNorm instance (currently: 7). Each database + uses the internal name of this database as its key, along with the following information: + * `dbname`: A second name used for this database. + * `count`: The number of keys in this database. + * `used_memory_rss_human`: the `used_memory_rss_human` value returned by this Redis database, described + [in the Redis documentation](https://redis.io/docs/latest/commands/info/) as "Human readable representation of + [Number of bytes that Redis allocated as seen by the operating system (a.k.a resident set size). This is the number reported by tools such as top(1) and + ps(1)]." + * `is_cluster`: Whether this database is being used as part of a cluster or as a single node database. + +## Informational endpoints + +### `/get_allowed_conflations` + +Returns a list of the supported conflations. + +* Method: [GET](https://nodenormalization-sri.renci.org/docs#/default/get_conflations_get_allowed_conflations_get) +* No parameters. + +Example output: + +```json +{ + "conflations": [ + "GeneProtein", + "DrugChemical" + ] +} +``` + +### `/get_semantic_types` + +Returns a list of all the Biolink types/classes that this instance of NodeNorm has at least one identifier for. + +* Method: [GET](https://nodenormalization-sri.renci.org/docs#/default/get_semantic_types_handler_get_semantic_types_get) + +Example output: + +```json +{ + "semantic_types": { + "types": [ + "biolink:NucleicAcidEntity", + "biolink:ActivityAndBehavior", + "biolink:PhysicalEssence", + "biolink:StudyPopulation", + "biolink:PhysicalEssenceOrOccurrent", + "biolink:GenomicEntity", + "biolink:Protein", + "biolink:Event", + [...] + ] + } +} +``` + +### `/get_curie_prefixes` + +Returns a list of CURIE prefixes for zero or more Biolink types, as well as the number of identifiers for each prefix. + +These are generated when the Babel compendia are loaded into NodeNorm, and I haven't verified if they are +accurate — I'm more confident about the Babel reports, but I haven't checked them against each other. + +* Method: [GET](https://nodenormalization-sri.renci.org/docs#/default/get_curie_prefixes_handler_get_curie_prefixes_get) + * Parameters: + * `semantic_type` (optional, e.g. `semantic_type=biolink:ChemicalEntity&semantic_type=biolink:AnatomicalEntity`) + * Without a `semantic_type`, every semantic type is returned. +* Method: [POST](https://nodenormalization-sri.renci.org/docs#/default/get_curie_prefixes_handler_get_curie_prefixes_post) + * POST Body: `{"semantic_types": ["biolink:ChemicalEntity", "biolink:AnatomicalEntity"]}` + +Example output: +```json +{ + "biolink:ChemicalEntity": { + "curie_prefix": { + "PUBCHEM.COMPOUND": "119397095", + "INCHIKEY": "115661650", + "CHEMBL.COMPOUND": "2496527", + "CAS": "4029002", + "CHEBI": "200507", + "HMDB": "217920", + "MESH": "258506", + "UMLS": "668019", + "KEGG.COMPOUND": "16035", + "UNII": "134411", + "DRUGBANK": "16108", + "GTOPDB": "12953", + "DrugCentral": "4995", + "RXCUI": "124852" + } + }, + "biolink:AnatomicalEntity": { + "curie_prefix": { + "UMLS": "159496", + "FMA": "98632", + "MESH": "1992", + "UBERON": "14513", + "NCIT": "10223", + "EMAPA": "968", + "ZFA": "607", + "FBbt": "117", + "WBbt": "18", + "CL": "2865", + "SNOMEDCT": "1421", + "GO": "4022" + } + } +} +``` + +## TRAPI Normalization (deprecated) + +These methods + +### `/query` + +Normalizes all the identifiers in a [TRAPI](https://github.com/NCATSTranslator/ReasonerAPI) message. + +* Method: [POST](https://nodenormalization-sri.renci.org/docs#/default/query_query_post) + +### `/asyncquery` + +Identical to `/query`, but returns a URL that the requester can use to poll for the response +rather than waiting for the request to complete. + +* Method: [POST](https://nodenormalization-sri.renci.org/docs#/default/async_query_asyncquery_post)