Add tokenizer for BigSMILES representation of polymers#8
Add tokenizer for BigSMILES representation of polymers#8anoushka2000 wants to merge 25 commits intomainfrom
Conversation
awadell1
left a comment
There was a problem hiding this comment.
- missing rust level tests
- missing tests for expected unknowns (ie 🤗)
- is there a spec for this? The test set should run against every BigSMILE in the spec
awadell1
left a comment
There was a problem hiding this comment.
Code looks good, some comments on tests and serialization. But my main concern is how are labels handled? The spec seems pretty unbounded, and as is, the current tokenizer is open
Do you plan on handling the special fragment labels?
You need way more tests. Things like higher digit counts for bonding descriptors, weird labels and every facet of the spec. There's not test for labels or the abstract fragment spec
| return sorted(out) | ||
|
|
||
|
|
||
| def merge_tokens_grouped(tokens): |
There was a problem hiding this comment.
| is a valid character in the weighted BigSMILES representation. The expression output from merge_tokens would capture this e.g. A| could be a token. I didn't want to alter the original merge_tokens function. Also ended up not supporting the G-BigSMILES extension where the | character occurs but probably still better to now have that be ambiguous.
| r"\(|\)|", | ||
| r"\{|\}|", // Stochastic object delimiters | ||
| r",|;|", // Repeat unit separator and end group separator | ||
| r"[A-Z][A-Za-z0-9']*|", // Fragment and abstract spec labels |
There was a problem hiding this comment.
This can match basically any name.
- not covered by the test set
- demands an unbounded vocabulary
There was a problem hiding this comment.
- how does this interact with the rest of the spec. For example, what's the tokenization of
AbCl - aren't labels bracketed?
| {[<][>]NC(C)C(=O)[<],[>]NCC(=O)[<][>]}O | ||
| {[<]NC(C)C(=O),NCC(=O)[>]}O | ||
| {[][$]CC(C)([#R])[$][]}.{#R=C(=O)OCC12CC(C3)CC(C1)CC3C2} | ||
| C([#Arm])([#Arm])([#Arm])[#Arm].{#Arm=CO{[<][>]CCO[<][>]}} |
There was a problem hiding this comment.
This produces unknowns right? Cause Arm isn't a token?
There was a problem hiding this comment.
At the Python level this should be used to check for no unknowns
No description provided.