There is a regression in 84a4e1a: When passing multiple pages for an image-only input fileGrp, e.g. -g phys_0001,phys_0007 -I OCR-D-IMG, now the logic that tries to prevent mixing derived images with original images is falsely triggered:
|
ret = self.workspace.mets.find_all_files( |
|
fileGrp=self.input_file_grp, pageId=self.page_id, mimetype="//image/.*") |
|
if self.page_id and len(ret) > 1: |
|
raise ValueError("No PAGE-XML %s in fileGrp '%s' but multiple images." % ( |
|
"for page '%s'" % self.page_id if self.page_id else '', |
|
self.input_file_grp |
|
)) |
|
return ret |
The problem is that self.page_id here is actually a list (formatted in comma-join notation).
So the correct way of ensuring that no single page gets multiple image file results is by
- either disallowing
find_all_files to aggregate them like this (which is probably valid in other contexts, though)
- or going through its result
ret and checking whether any of its pageIds repeat:
page_ids = [file.pageId for file in ret]
if len(page_ids) != len(set(page_ids)):
There is a regression in 84a4e1a: When passing multiple pages for an image-only input fileGrp, e.g.
-g phys_0001,phys_0007 -I OCR-D-IMG, now the logic that tries to prevent mixing derived images with original images is falsely triggered:core/ocrd/ocrd/processor/base.py
Lines 118 to 125 in edf31fa
The problem is that
self.page_idhere is actually a list (formatted in comma-join notation).So the correct way of ensuring that no single page gets multiple image file results is by
find_all_filesto aggregate them like this (which is probably valid in other contexts, though)retand checking whether any of itspageIds repeat: