Add tokenizer for BigSMILES representation of polymers by anoushka2000 · Pull Request #8 · BattModels/smirk

anoushka2000 · 2026-04-23T20:40:29Z

No description provided.

awadell1

missing rust level tests
missing tests for expected unknowns (ie 🤗)
is there a spec for this? The test set should run against every BigSMILE in the spec

awadell1

More comments. Also this needs an entry in the changelog

awadell1

Code looks good, some comments on tests and serialization. But my main concern is how are labels handled? The spec seems pretty unbounded, and as is, the current tokenizer is open

Do you plan on handling the special fragment labels?

You need way more tests. Things like higher digit counts for bonding descriptors, weird labels and every facet of the spec. There's not test for labels or the abstract fragment spec

awadell1 · 2026-04-27T21:15:26Z

    return sorted(out)


+def merge_tokens_grouped(tokens):


This looks like merge_tokens

| is a valid character in the weighted BigSMILES representation. The expression output from merge_tokens would capture this e.g. A| could be a token. I didn't want to alter the original merge_tokens function. Also ended up not supporting the G-BigSMILES extension where the | character occurs but probably still better to now have that be ambiguous.

https://github.com/InnocentBug/G-BigSMILES/blob/2fbbeb7879dc9c15c67178d1399b0a9bc9a21f38/README.md?plain=1#L11

awadell1 · 2026-04-27T21:25:13Z

+    r"\(|\)|",
+    r"\{|\}|",                         // Stochastic object delimiters
+    r",|;|",                           // Repeat unit separator and end group separator
+    r"[A-Z][A-Za-z0-9']*|",            // Fragment and abstract spec labels


This can match basically any name.

not covered by the test set

demands an unbounded vocabulary

how does this interact with the rest of the spec. For example, what's the tokenization of AbCl

aren't labels bracketed?

awadell1 · 2026-04-27T21:29:16Z

+{[<][>]NC(C)C(=O)[<],[>]NCC(=O)[<][>]}O
+{[<]NC(C)C(=O),NCC(=O)[>]}O
+{[][$]CC(C)([#R])[$][]}.{#R=C(=O)OCC12CC(C3)CC(C1)CC3C2}
+C([#Arm])([#Arm])([#Arm])[#Arm].{#Arm=CO{[<][>]CCO[<][>]}}


This produces unknowns right? Cause Arm isn't a token?

awadell1 · 2026-04-27T21:34:15Z

At the Python level this should be used to check for no unknowns

anoushka2000 added 9 commits April 23, 2026 15:59

regex splitting logic

fa5cf48

BigSMILES vocab file add

95e834d

add vocab_bigsmiles to includes

821f97e

fist pass big smiles tokenizer

b656edd

add big smiles tokenizer to wrapper

b3f7eac

add python tests

56002c6

formatting and first pass demo

26d284b

add additional tokenization and MLM demo in nbk

c860816

add API ref

ab85a89

anoushka2000 requested a review from awadell1 April 23, 2026 23:50

awadell1 requested changes Apr 26, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

awadell1 reviewed Apr 26, 2026

View reviewed changes

Comment thread python/smirk/__init__.py

Comment thread python/smirk/__init__.py

Comment thread python/smirk/vocab_bigsmiles.json

Comment thread docs/big_smirk_demo.ipynb

Comment thread src/pre_tokenizers/split_bigsmiles.rs

Comment thread src/wrapper.rs

anoushka2000 added 9 commits April 25, 2026 22:11

docs: add path to BigSMILES spec

d7eb1a8

add updated vocab generator

3da58ec

add support for abstarct labels and test for spec egs

ba53f9f

fix unk handling and add test

3905621

add missing test for clone

789b9fb

formatting

e553b15

add rust level test for splitting and init

6a039e3

remove redundant jupyter notebook dep

e765b39

add BigSMILES tokenizer to changelog

023b549

anoushka2000 requested a review from awadell1 April 26, 2026 17:40

awadell1 requested changes Apr 27, 2026

View reviewed changes

anoushka2000 added 7 commits April 27, 2026 21:45

serialize BigSMILES version and test save

b20b142

fix typo in demo notebook

601b36c

only support abstract labels when fragment def is provided

a3f14cf

regex generator updated for new abstract label logic

5bf0d10

expand test and test all smiles/ bigsmiles for round trip

0af2abc

clean up test formatting

7ef3e88

add fragment label handling in docs notebook

a6e503a

anoushka2000 requested a review from awadell1 May 1, 2026 19:07

Conversation

anoushka2000 commented Apr 23, 2026

Uh oh!

awadell1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

awadell1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awadell1 left a comment

Choose a reason for hiding this comment

Uh oh!

awadell1 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

anoushka2000 Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awadell1 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

awadell1 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

awadell1 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

awadell1 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

awadell1 left a comment •

edited

Loading

anoushka2000 Apr 28, 2026 •

edited

Loading