Skip to content

Validation improvements (wip)#69

Draft
masklinn wants to merge 2 commits intoua-parser:mainfrom
masklinn:fullvalidation
Draft

Validation improvements (wip)#69
masklinn wants to merge 2 commits intoua-parser:mainfrom
masklinn:fullvalidation

Conversation

@masklinn
Copy link
Contributor

@masklinn masklinn commented Mar 21, 2026

Try to add more validations / cross-impl checks to ensure regex-filtered yields the same result as more naive implementations even on complete / real-world datasets.

  • The python version of matchindex might not be useful as the regex implementation has the exact same model / method and seems to go quite a bit faster already.
  • Unless this gets leveraged for performance work, but if you see bullying of unfortunate runtimes you should speak up in that case re2 and FilteredRE2 implementations should be added.

as well as add stdin or URLs support. Generating regex files out of
`regexes.yaml` is a convenient first step to make subsequent scripts
simpler (e.g. not require every one of them to read yaml).

This script could be an yq command (plus an optional curl first step),
but e.g. nix's `yq` is a python wrapper around `jq` which depends on
pyyaml so the gain is limited.
These three files / scripts are 3 different implementations (python,
regex, regex-filtered) of the same thing: taking a regex set and a
bunch of needles, for each needle find the first matching regex, and
output its index (0-indexed).

This is the core loop of ua-parser, and allows validating that
regex-filtered matches a more naive version of the same process.
Happily I couldn't find any divergence although that means I did a
fair amount of useless work. Also the python version is really slow
compared to even the regex one, so probably don't use that...

`paste` allows using it to combine index extraction of multiple
domains as well as the original needle as TSV documents if that's of
use. This could also be expanded to multi-index extraction if that's a
need for anyone and should be checked more extensively.

Note that only the python version supports stdin input at this point,
I couldn't be arsed to do that with the Rust ones, but process
substitution ought work fine anyway? The needles are read on the go so
they should not need to be an actual file.

This may not be in a state fit for performance checking as the output
loop of the rust version is the worst (no buffering, no
stdout-locking).
@masklinn masklinn changed the title Validation extensions (wip) Validation improvements (wip) Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant