Skip to content

🔥 remove everything except the cropper and layout analyzer and convert that to OCR-D/core v3 API#112

Merged
kba merged 41 commits intomasterfrom
v3-api
Apr 10, 2025
Merged

🔥 remove everything except the cropper and layout analyzer and convert that to OCR-D/core v3 API#112
kba merged 41 commits intomasterfrom
v3-api

Conversation

@kba
Copy link
Copy Markdown
Member

@kba kba commented Apr 8, 2025

Since this project is largely unmaintained and the only functionality currently used sensibly is the cropper and perhaps the layout analysis, this PR removes everything except those two.

@kba kba requested a review from bertsky April 8, 2025 16:20
@kba kba marked this pull request as draft April 8, 2025 16:31
@kba kba marked this pull request as ready for review April 8, 2025 17:03
Copy link
Copy Markdown
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane.

But I thought we wanted to do that in two separate stages:

  1. migrate the cropper to v3, keep the others as-is
  2. drop the others, perhaps by archiving the entire repo and re-introducing the cropper elsewhere

At least for the layout-analysis (i.e. model-driven page classifier) here we had the only existing processor of that kind (and to some extent of a tool that writes logical structMap/div, although docstruct now also does). I am not entirely sure the classifier was outright unusable – perhaps we should keep at least that until we have a replacement?

Comment thread .circleci/config.yml
Comment on lines -25 to -35
- restore_cache:
keys:
# ocrd-resources depends on the model files registered under ocrd_anybaseocr/ocrd-tool.json
- v2-models-{{ checksum "ocrd_anybaseocr/ocrd-tool.json" }}
- run:
name: download models
command: make models
- save_cache:
paths:
- ~/.local/share/ocrd-resources
key: v2-models-{{ checksum "ocrd_anybaseocr/ocrd-tool.json" }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already forgot about that pattern. Too bad it needs to go here.

Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_cropping.py Outdated
Comment thread src/ocrd_anybaseocr/ocrd-tool.json
Comment thread tests/test_crop.py Outdated
from .assets import assets, copy_of_directory


def test_crop():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also adopt the scheme employed in other v3 processor tests here: checking with METS Server and page-parallel, too. (Infrastructure in conftest.py, test function just gets processor_kwargs, including the workspace.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9d077e2

But my fairly trivial test fails for pageparallel and following for test_layout, not sure where my logic is wrong :/

@kba kba changed the title 🔥 remove everything except the cropper and convert that to OCR-D/core v3 API 🔥 remove everything except the cropper and layout analyzer and convert that to OCR-D/core v3 API Apr 9, 2025
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_cropping.py
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
Comment on lines +93 to +94
self.write_to_mets(results, page_id)
return result
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire approach does not work with the METS Server, which we cannot disallow beforehand, because max_workers only applies to the multiprocessing part. So running with METS Server would always fail in that method (raising something about ClientSideOcrdMets not provinding access to some attributes).

But we could do better: as examples like ocrd_pagetopdf or ocrd_cis/ocrd_cis/postcorrect/cli.py illustrate, one can do an extra METS serialization step at the end of process_workspace to ensure we get access to the METS file directly (and then have the METS Server reload itself, if present)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more concrete: on each page, you would just record enough information into a processor attribute dict. Then after the page loop (in process_workspace, after the super call), you act on that dict:

  • if necessary (ClientSideOcrdMets):
    • self.workspace.mets.save_mets(),
    • instantiate a new workspace (to get a direct METS)
    • do the METS update
    • save the new workspace METS
    • self.workspace.mets.reload_mets()
  • otherwise: do the METS update

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e3ba1d7

Not sure whether the final workspace.mets.reload() or the shutdown are necessary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether the final workspace.mets.reload() or the shutdown are necessary.

Yes, the reload is necessary so the (next processor in the) workflow can proceed with the correct (new) version of the METS.

shutdown should actually del self.model if it exists. The reset should happen at the end of process_workspace. And log_id / log_links / logID / logIDs do not need to be attributes anymore – they could now be passed in to write_to_mets. In contrast, self.page_labels can be directly used by that function and does not need to be passed in. In fact, that function could itself loop over it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e0e63db I moved the self.reset() call and changed self.shutdown() accordingly, but I don't want to mess with the attributes/method params right now. Feel free to push if you have the stomach for that refactoring :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll test locally and see what I can do

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

(I still think we should make the METS Server scenario workable, though.)

Comment thread src/ocrd_anybaseocr/ocrd-tool.json Outdated
"format": "uri",
"content-type": "text/directory",
"cacheable": true,
"default": "structure_analysis",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not work in the way these are currently packaged (see CI failure)

Suggested change
"default": "structure_analysis",
"default": "models/structure_analysis",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. This would work for resolve, but then resmgr list-installed does not show them anymore (because we just disallowed recursion for module location in the spec).

What we could instead do in the processor class itself:

    def moduledir(self):
        return resource_filename(self.module, 'models')

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just move the models to the package root. I'll try that first, if it works, I'll rebase before merging, so the git repo stays reasonably small.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, that would work as well. But redefining moduledir is the pattern we should generally use, I believe. (Also for preinstalled models in ocrd_froc etc.)

Comment thread src/ocrd_anybaseocr/ocrd-tool.json
Comment thread src/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py Outdated
@kba kba merged commit fbb5942 into master Apr 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants