make install-models to install commonly used models for tesseract, ocropy, calamari#103
make install-models to install commonly used models for tesseract, ocropy, calamari#103
Conversation
stweil
left a comment
There was a problem hiding this comment.
I think that's a good idea, and it will help others including me, too.
Co-authored-by: Stefan Weil <sw@weilnetz.de>
Update: * `ocrd_olena`: e000f24 (1.1.7) * `tesseract`: 62eae84 * `ocrd_cis`: a6a2ecd Added: * `CHANGELOG.md` to track changes
|
Did I already mention that I don't like environment variables (with only a handful of exceptions like |
|
I removed the instructions on |
In my experience, the GitHub downloads are painfully slow at times, you may want to consider mirroring them. (I mirored tessdata_best for that reason here: https://qurator-data.de/mirror/github.com/tesseract-ocr/tessdata_best/archive/) |
bertsky
left a comment
There was a problem hiding this comment.
Just a few thoughts:
- We are in the middle of a discussion about new rules in the spec how relative filenames (for model files) should be resolved in processors.
$VIRTUAL_ENV/share/<processor>is there as well, but surely a processor name is different from a module name likeocropusorcalamariortessdata. Maybe the other options there are more appropriate (like the XDG paths)? – Anyway, this could be changed again in another PR. - How about adding (some of) the Calamari models here? Also, some kind of selection mechanism would be nice, so users can easily download and install a subset of these models by name (like
TESSERACT_MODELSfor Tesseract)... - How about adding some Docker configs which include some of these, too?
|
Regarding Tesseract, my idea is that in the future models will be selectable by their URL, downloaded on demand and cached locally. That's one of the reasons why Tesseract uses the curl library. |
That is also what we have in mind for processors in general, hence the As I said before, this is merely a stopgap solution to let users benefit from the enhanced models out there now, while model repository infrastructure and processors improve. |
|
Since we're already describing the mechanism in our documentation, I'll merge. Better solution work bundling data with processors is upcoming. |
(Late to the party.) I have never gotten any useful result from those models, but that may have been my mistake. → should be tested before integration. |
Adds targets
install-models{,-tesseract,-ocropy,-calamari}to install recognition models by @stweil, @mikegerber, @jze, @chreul to sensible locations in$(VIRTUAL_ENV)/share.The list of installed models is far from exhaustive and can/should be easily amended.
This is a stopgap solution until we have a proper model repository. I had to re-download models over the last two weeks for different environments and I thought I might as well create a script and
ocrd_allseemed like a good place.I tried to use GitHub locations where possible. If potential traffic spikes from adding model download here are of concern, I can set up a mirror.