Skip to content

make install-models to install commonly used models for tesseract, ocropy, calamari#103

Merged
kba merged 11 commits intomasterfrom
install-models
Jul 31, 2020
Merged

make install-models to install commonly used models for tesseract, ocropy, calamari#103
kba merged 11 commits intomasterfrom
install-models

Conversation

@kba
Copy link
Copy Markdown
Member

@kba kba commented May 30, 2020

Adds targets install-models{,-tesseract,-ocropy,-calamari} to install recognition models by @stweil, @mikegerber, @jze, @chreul to sensible locations in $(VIRTUAL_ENV)/share.

The list of installed models is far from exhaustive and can/should be easily amended.

This is a stopgap solution until we have a proper model repository. I had to re-download models over the last two weeks for different environments and I thought I might as well create a script and ocrd_all seemed like a good place.

I tried to use GitHub locations where possible. If potential traffic spikes from adding model download here are of concern, I can set up a mirror.

Comment thread Makefile Outdated
Comment thread Makefile Outdated
Copy link
Copy Markdown
Collaborator

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea, and it will help others including me, too.

kba and others added 2 commits May 30, 2020 16:20
Copy link
Copy Markdown
Collaborator

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

kba added 3 commits May 31, 2020 16:22
Update:

  * `ocrd_olena`: e000f24 (1.1.7)
  * `tesseract`: 62eae84
  * `ocrd_cis`: a6a2ecd

Added:

  * `CHANGELOG.md` to track changes
@kba kba marked this pull request as draft May 31, 2020 18:50
@stweil
Copy link
Copy Markdown
Collaborator

stweil commented May 31, 2020

Did I already mention that I don't like environment variables (with only a handful of exceptions like PATH)? I nearly never use TESSDATA_PREFIX. It is not needed when Tesseract was build for a virtual environment, and I would not propagate its use.

@kba
Copy link
Copy Markdown
Member Author

kba commented May 31, 2020

I removed the instructions on TESSDATA_PREFIX, they were not working for ocrd_tesserocr anyways.

@kba kba force-pushed the install-models branch from e8fe2af to 48625f7 Compare May 31, 2020 19:10
@mikegerber
Copy link
Copy Markdown
Contributor

I tried to use GitHub locations where possible. If potential traffic spikes from adding model download here are of concern, I can set up a mirror.

In my experience, the GitHub downloads are painfully slow at times, you may want to consider mirroring them. (I mirored tessdata_best for that reason here: https://qurator-data.de/mirror/github.com/tesseract-ocr/tessdata_best/archive/)

@kba kba marked this pull request as ready for review June 17, 2020 08:25
Copy link
Copy Markdown
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @stweil @kba.

Just a few thoughts:

  1. We are in the middle of a discussion about new rules in the spec how relative filenames (for model files) should be resolved in processors. $VIRTUAL_ENV/share/<processor> is there as well, but surely a processor name is different from a module name like ocropus or calamari or tessdata. Maybe the other options there are more appropriate (like the XDG paths)? – Anyway, this could be changed again in another PR.
  2. How about adding (some of) the Calamari models here? Also, some kind of selection mechanism would be nice, so users can easily download and install a subset of these models by name (like TESSERACT_MODELS for Tesseract)...
  3. How about adding some Docker configs which include some of these, too?

@stweil
Copy link
Copy Markdown
Collaborator

stweil commented Jun 18, 2020

Regarding Tesseract, my idea is that in the future models will be selectable by their URL, downloaded on demand and cached locally. That's one of the reasons why Tesseract uses the curl library.

@kba
Copy link
Copy Markdown
Member Author

kba commented Jul 7, 2020

downloaded on demand and cached locally

That is also what we have in mind for processors in general, hence the cacheable and content-type attributes to mark parameters as (cacheable) remote files.

As I said before, this is merely a stopgap solution to let users benefit from the enhanced models out there now, while model repository infrastructure and processors improve.

@kba
Copy link
Copy Markdown
Member Author

kba commented Jul 31, 2020

Since we're already describing the mechanism in our documentation, I'll merge. Better solution work bundling data with processors is upcoming.

@kba kba merged commit 11f156e into master Jul 31, 2020
@kba kba deleted the install-models branch July 31, 2020 09:09
@mikegerber
Copy link
Copy Markdown
Contributor

2. How about adding (some of) the Calamari models [here](https://github.com/Calamari-OCR/calamari_models)? 

(Late to the party.) I have never gotten any useful result from those models, but that may have been my mistake. → should be tested before integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants