Skip to content

chained workflows #25

@bertsky

Description

@bertsky

It would help if workflows can be chained at runtime, e.g. ocrd-make -f pre3.mk -f seg1.mk -f ocr4.mk -f post.mk, where each makefile would consume the last fileGrp of the previous – so each stage can be replaced by an alternative configuration independent of the others. This in turn would allow writing very concise small (sub-)configurations without repetition.

As for implementation, make allows passing multiple makefiles and reads them sequentially (w.r.t. first phase, i.e. expansion of immediate variables etc.), then combines them (second phase) and finally computes dependencies.

So we could by convention (for chainable configurations) allow defining a simply expanded variable (say) OUTPUT for the (phase's) output fileGrp name, and allow using INPUT for the (phase's) dynamic input fileGrp name. Internally then (i.e. in our Makefile that always needs to be included), we predefine INPUT := $(or $(OUTPUT),$(INPUT)) and .DEFAULT_GOAL := $(OUTPUT). For the very first phase (entry point), we then just have to pass INPUT – either in a separate (phase zero) non-rule config file or with an additional cmdline arg.

For example

  • pre3.mk
BIN: $(INPUT)
BIN: TOOL = ocrd-doxa-binarize

DESK: BIN
DESK: TOOL = ocrd-cis-ocropy-deskew
DESK: PARAMS = "level-of-operation": "page"

CROP: DESK
CROP: TOOL = ocrd-anybaseocr-crop
CROP: PARAMS = "rulerAreaMax": 0

OUTPUT := CROP
  • seg1.mk
SEG: $(INPUT)
SEG: TOOL = ocrd-kraken-segment
SEG: PARAMS = "model": "blla.mlmodel"

RESEG: SEG
RESEG: TOOL = ocrd-cis-ocropy-resegment
RESEG: PARAMS = "method": "baseline"

OUTPUT := RESEG
  • ocr4.mk
OCR1: $(INPUT)
OCR2: $(INPUT)
OCR3: $(INPUT)
OCR1 OCR2 OCR3: OPTIONS = -P textequiv_level glyph

OCR1: TOOL = ocrd-tesserocr-recognize
OCR1: OPTIONS += -P model frak2021+deu

OCR2: TOOL = ocrd-calamari-recognize
OCR2: OPTIONS += -P checkpoint_dir qurator-gt4histocr-1.0

OCR3: TOOL = ocrd-kraken-recognize
OCR3: OPTIONS += -P model austriannewspapers.mlmodel

MULTI: OCR1 OCR2 OCR3
MULTI: TOOL = ocrd-cor-asv-ann-align
MULTI: PARAMS = "method": "combined"

OUTPUT := MULTI
  • post.mk
ALTO: $(INPUT)
ALTO: TOOL = ocrd-fileformat-transform
ALTO: OPTIONS = -P from-to "page alto" -P script-args "--no-check-border --dummy-word"

OUTPUT := ALTO
  • in preinstalled Makefile
override INPUT := $(or $(OUTPUT),$(INPUT))
.DEFAULT_GOAL := $(OUTPUT)
...
  • running
make -f pre3.mk -f seg1.mk -f ocr4.mk -f post.mk INPUT=ORIGINAL

Since this only requires these 2 additional lines and does not break existing makefiles, this is more of a documentation issue actually. (And probably, the old makefiles should be removed or updated or split into multi-stage configurations anyway.)

@mikegerber would that fit your need as well?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions