v3 API: general XPath 2.0 mechanism, generateDS true reverse mapping, ocrd-filter#21
v3 API: general XPath 2.0 mechanism, generateDS true reverse mapping, ocrd-filter#21bertsky wants to merge 218 commits intonew-processor-apifrom
Conversation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
…e, avoid buggy lxml global registration mechanism
… 'query'), use 'elementpath.XPathParser.external_function' with global registration instead of 'etree.FunctionNamespace' with local extension
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
kba
left a comment
There was a problem hiding this comment.
Thanks, LGTM. Also good to be up-to-date with generateDS again. Testing now.
|
Note to self: we need to know whether refactoring the AlternativeImage selection logic out of |
@kba I don't think we need to break anything here in the future. The methods
|
|
closing – see OCR-D#1300 and OCR-D#1301 |
I initially published the first form of builtin processor
ocrd-filteron OCR-D#1240 directly, but since this is new functionality and involves lots of other changes, I rebased and split this off into this for easier reviewing.The idea behind
ocrd-filteris that the user gets to write powerful XPath expressions as runtime parameters, and the processor takes care of the removal from PAGE (including ReadingOrder update, and optionally saving images for those segments that did get removed for quick visual inspection).To make this as expressive as possible, we need
pixelareaor concatenatedtextequiv, but more to come surely)For the former I initially (see first commits) experimented with lxml's builtin
etree.FunctionNamespace, but this turned out to be quite buggy. (It crashes with segmentation errors if using the global namespace registration with a namespace prefix, even in single-threaded mode. It did work using local namespace registration, though.) I briefly looked at SaxonC-HE, but found it does not allow for extension functions in Python (only in Java). So I ended up with pure-Python elementpath, which is slower, but really powerful – and easy to use.Then I figured it would be really helpful (for ocrd-filter, but also other processors) if our
OcrdPage.revmapactually did contain a reverse lookup mechanism (from tree node to generated DOM object). And since generateds (after v2.40) does now support that, while it does not have the problems with simple type enums anymore, I decided to try and update ocrd_page_generateds again – and it worked. So now we can really dopage.xpath()→page.revmap→page.pcgts.I placed the first two extension functions under
ocrd_models.xpath_functions(as we might also want to write some for METS or MODS or whatever), but this is just an idea ATM.Besides more XPath extension functions (e.g. a function for the ratio of foreground pixels when binarized derived images are present) I am also planning on extending generated PcGtsType via user methods directly (e.g. a method for TextEquiv consistency across the hierarchy, and another for Coords projection)...