🔥 remove everything except the cropper and layout analyzer and convert that to OCR-D/core v3 API#112
Conversation
bertsky
left a comment
There was a problem hiding this comment.
Looks sane.
But I thought we wanted to do that in two separate stages:
- migrate the cropper to v3, keep the others as-is
- drop the others, perhaps by archiving the entire repo and re-introducing the cropper elsewhere
At least for the layout-analysis (i.e. model-driven page classifier) here we had the only existing processor of that kind (and to some extent of a tool that writes logical structMap/div, although docstruct now also does). I am not entirely sure the classifier was outright unusable – perhaps we should keep at least that until we have a replacement?
| - restore_cache: | ||
| keys: | ||
| # ocrd-resources depends on the model files registered under ocrd_anybaseocr/ocrd-tool.json | ||
| - v2-models-{{ checksum "ocrd_anybaseocr/ocrd-tool.json" }} | ||
| - run: | ||
| name: download models | ||
| command: make models | ||
| - save_cache: | ||
| paths: | ||
| - ~/.local/share/ocrd-resources | ||
| key: v2-models-{{ checksum "ocrd_anybaseocr/ocrd-tool.json" }} |
There was a problem hiding this comment.
Already forgot about that pattern. Too bad it needs to go here.
| from .assets import assets, copy_of_directory | ||
|
|
||
|
|
||
| def test_crop(): |
There was a problem hiding this comment.
We should also adopt the scheme employed in other v3 processor tests here: checking with METS Server and page-parallel, too. (Infrastructure in conftest.py, test function just gets processor_kwargs, including the workspace.)
There was a problem hiding this comment.
But my fairly trivial test fails for pageparallel and following for test_layout, not sure where my logic is wrong :/
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
| self.write_to_mets(results, page_id) | ||
| return result |
There was a problem hiding this comment.
This entire approach does not work with the METS Server, which we cannot disallow beforehand, because max_workers only applies to the multiprocessing part. So running with METS Server would always fail in that method (raising something about ClientSideOcrdMets not provinding access to some attributes).
But we could do better: as examples like ocrd_pagetopdf or ocrd_cis/ocrd_cis/postcorrect/cli.py illustrate, one can do an extra METS serialization step at the end of process_workspace to ensure we get access to the METS file directly (and then have the METS Server reload itself, if present)
There was a problem hiding this comment.
To be more concrete: on each page, you would just record enough information into a processor attribute dict. Then after the page loop (in process_workspace, after the super call), you act on that dict:
- if necessary (ClientSideOcrdMets):
self.workspace.mets.save_mets(),- instantiate a new workspace (to get a direct METS)
- do the METS update
- save the new workspace METS
self.workspace.mets.reload_mets()
- otherwise: do the METS update
There was a problem hiding this comment.
Not sure whether the final workspace.mets.reload() or the shutdown are necessary.
There was a problem hiding this comment.
Not sure whether the final
workspace.mets.reload()or theshutdownare necessary.
Yes, the reload is necessary so the (next processor in the) workflow can proceed with the correct (new) version of the METS.
shutdown should actually del self.model if it exists. The reset should happen at the end of process_workspace. And log_id / log_links / logID / logIDs do not need to be attributes anymore – they could now be passed in to write_to_mets. In contrast, self.page_labels can be directly used by that function and does not need to be passed in. In fact, that function could itself loop over it.
There was a problem hiding this comment.
e0e63db I moved the self.reset() call and changed self.shutdown() accordingly, but I don't want to mess with the attributes/method params right now. Feel free to push if you have the stomach for that refactoring :)
There was a problem hiding this comment.
ok, I'll test locally and see what I can do
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
bertsky
left a comment
There was a problem hiding this comment.
Thanks!
(I still think we should make the METS Server scenario workable, though.)
| "format": "uri", | ||
| "content-type": "text/directory", | ||
| "cacheable": true, | ||
| "default": "structure_analysis", |
There was a problem hiding this comment.
does not work in the way these are currently packaged (see CI failure)
| "default": "structure_analysis", | |
| "default": "models/structure_analysis", |
There was a problem hiding this comment.
Wait. This would work for resolve, but then resmgr list-installed does not show them anymore (because we just disallowed recursion for module location in the spec).
What we could instead do in the processor class itself:
def moduledir(self):
return resource_filename(self.module, 'models')There was a problem hiding this comment.
Or just move the models to the package root. I'll try that first, if it works, I'll rebase before merging, so the git repo stays reasonably small.
There was a problem hiding this comment.
Of course, that would work as well. But redefining moduledir is the pattern we should generally use, I believe. (Also for preinstalled models in ocrd_froc etc.)
…bel_mapping Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
- rebase on ocrd v3.3 image - add labels and variables according to spec - preinstall ocrd-all-tool.json - simplify
finish v3 api
Since this project is largely unmaintained and the only functionality currently used sensibly is the cropper and perhaps the layout analysis, this PR removes everything except those two.