An important task is to determine whether a target of an assay is a full length sequence or a partial sequence. We want to employ the help of an LLM to achieve this, particularly by accessing the assay description and the target name. Because the assay description refers to all targets of an assay, it is very helpful to restrict to assays that have only a single protein target (that is, not multiple targets and no complex targets). Even if an assay has only a single protein target, the assay description may contain terms indicating partial reactants that are not referring to the target but to other reactants of the assay. Therefore, we should provide the LLM a list of protein names and synonyms of the target to help it decide if the target was partial.
To achieve this, here are the steps:
-
Extract all potential UniProt identifiers from the fields below using the regular expression
"\b[A-Z][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9A-Z](-\d+)?\b":polymer.unpid1polymer.comments
This can be done with the tool in extract_accessions.py.
-
For each of the matches, check if it is a valid UniProtKB accession number, this can be done with the tool_is_uniprot_accession.py. Only one UniProtKB ID should remain (or none). If there is more than one UniPRotKB ID or none at all, finish with a respective message.
-
Get the sequence for the UniProtKB ID, can be done with the tool in fetch_sequence.py.
-
Compare the length of the sequence obtained from UniProt with the sequence in the input data. If they are not identical, assume we have a partial sequence or mutation. Use the tool in check_seq.py to check for this. If the result is that this is not a full length sequence or has any mutations, return this result and finish.
-
Get all synonyms for this protein from UniProt with the tool in fetch_synonyms.py. Use them as additional input for the LLM, just in case the assay description uses a different name than the official name.
-
Get all domains, including their positional ranges, from the UniProt entry with the tool in fetch_domains.py. Use the domain names as additional input for the LLM because the assay description could just say something like "expressed the kinase domain", or "the intracellular region", etc.
-
Let the LLM find out if there is any information anywhere in the input data that would indicate that the protein sequence is not full-length or has any mutations. A partial sequence could be specified as a range of positions, or just a domain name. The LLM gets the protein names and synonyms and the domain names as part of the input. Return whether or not the LLM thinks the sequence is partial, and also return all mutations detected.
Below is the current version of the LangGraph graph. Currently, only a single assay is selected in select_assay. I will keep it that way until everything works for a single assay. Afterwards, a loop will be added to process all assays from the input data.
The functions below show examples how to extract information from a UniProtKB entry. Here, such a full entry is a dictionary that was fetched using the UniProt API to get the entry for an accession number as JSON. The extracted information is the GO Molecular Function or the protein sequence. However, there is also an alternative way of getting this information by accessing UniProtKB directly with a request for that particular information instead of a full entry. By doing so, it is not necessary to keep the full UniProtKB entry as part of the LangGraph state, and there is no code needed for extracting the relevant information from the full entry. I keep the examples below anyway just in case they may be helpful again.
[(crossref["id"], prop["value"][2:])
for crossref in result["uniProtKBCrossReferences"]
if crossref["database"] == "GO"
for prop in crossref["properties"]
if prop["key"] == "GoTerm" and prop["value"].startswith("F:")
]Example Result:
[('GO:0003677', 'DNA binding'),
('GO:0019899', 'enzyme binding'),
('GO:0008201', 'heparin binding'),
('GO:0042802', 'identical protein binding'),
('GO:0120283', 'protein serine/threonine kinase binding'),
('GO:0051425', 'PTB domain binding'),
('GO:0048018', 'receptor ligand activity'),
('GO:0000978', 'RNA polymerase II cis-regulatory region sequence-specific DNA binding'),
('GO:0004867', 'serine-type endopeptidase inhibitor activity'),
('GO:0030546', 'signaling receptor activator activity'),
('GO:0005102', 'signaling receptor binding'),
('GO:0046914', 'transition metal ion binding')]result["sequence"]["value"]Example Result:
MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVGEFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFRGVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEEEADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARDPVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQAKNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITALQAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYERMNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTETKTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTNIKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN
