Skip to content

Add tokenizer for BigSMILES representation of polymers#8

Open
anoushka2000 wants to merge 25 commits intomainfrom
feat/big-smiles
Open

Add tokenizer for BigSMILES representation of polymers#8
anoushka2000 wants to merge 25 commits intomainfrom
feat/big-smiles

Conversation

@anoushka2000
Copy link
Copy Markdown
Collaborator

No description provided.

@anoushka2000 anoushka2000 requested a review from awadell1 April 23, 2026 23:50
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • missing rust level tests
  • missing tests for expected unknowns (ie 🤗)
  • is there a spec for this? The test set should run against every BigSMILE in the spec

Comment thread pyproject.toml Outdated
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments. Also this needs an entry in the changelog

Comment thread python/smirk/__init__.py
Comment thread python/smirk/__init__.py
Comment thread python/smirk/vocab_bigsmiles.json
Comment thread docs/big_smirk_demo.ipynb
Comment thread src/pre_tokenizers/split_bigsmiles.rs
Comment thread src/wrapper.rs
@anoushka2000 anoushka2000 requested a review from awadell1 April 26, 2026 17:40
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, some comments on tests and serialization. But my main concern is how are labels handled? The spec seems pretty unbounded, and as is, the current tokenizer is open

Do you plan on handling the special fragment labels?

You need way more tests. Things like higher digit counts for bonding descriptors, weird labels and every facet of the spec. There's not test for labels or the abstract fragment spec

Comment thread opt/build_vocab.py
return sorted(out)


def merge_tokens_grouped(tokens):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like merge_tokens

Copy link
Copy Markdown
Collaborator Author

@anoushka2000 anoushka2000 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| is a valid character in the weighted BigSMILES representation. The expression output from merge_tokens would capture this e.g. A| could be a token. I didn't want to alter the original merge_tokens function. Also ended up not supporting the G-BigSMILES extension where the | character occurs but probably still better to now have that be ambiguous.

https://github.com/InnocentBug/G-BigSMILES/blob/2fbbeb7879dc9c15c67178d1399b0a9bc9a21f38/README.md?plain=1#L11

Comment thread src/pre_tokenizers/split_bigsmiles.rs Outdated
r"\(|\)|",
r"\{|\}|", // Stochastic object delimiters
r",|;|", // Repeat unit separator and end group separator
r"[A-Z][A-Za-z0-9']*|", // Fragment and abstract spec labels
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can match basically any name.

  1. not covered by the test set
  2. demands an unbounded vocabulary

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. how does this interact with the rest of the spec. For example, what's the tokenization of AbCl
  2. aren't labels bracketed?

Comment thread test/bigsmiles.smi
{[<][>]NC(C)C(=O)[<],[>]NCC(=O)[<][>]}O
{[<]NC(C)C(=O),NCC(=O)[>]}O
{[][$]CC(C)([#R])[$][]}.{#R=C(=O)OCC12CC(C3)CC(C1)CC3C2}
C([#Arm])([#Arm])([#Arm])[#Arm].{#Arm=CO{[<][>]CCO[<][>]}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces unknowns right? Cause Arm isn't a token?

Comment thread test/bigsmiles.smi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the Python level this should be used to check for no unknowns

Comment thread src/tokenizer.rs
Comment thread test/test_tokenize_bigsmiles.py
Comment thread opt/build_vocab.py Outdated
Comment thread opt/build_vocab.py
@anoushka2000 anoushka2000 requested a review from awadell1 May 1, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants