Processor.input_files broke for pageId selector lists

There is a regression in 84a4e1ac38479732ef8674981047a6f6a55d0e9f: When passing multiple pages for an image-only input fileGrp, e.g. `-g phys_0001,phys_0007 -I OCR-D-IMG`, now the logic that tries to prevent mixing derived images with original images is falsely triggered:
https://github.com/OCR-D/core/blob/edf31fae3e88aaa2c6a5bd002618946c8f21df95/ocrd/ocrd/processor/base.py#L118-L125

The problem is that `self.page_id` here is actually a _list_ (formatted in comma-join notation).

So the correct way of ensuring that no single page gets multiple image file results is by
- either disallowing `find_all_files` to aggregate them like this (which is probably valid in other contexts, though)
- or going through its result `ret` and checking whether any of its `pageId`s repeat:
```python
page_ids = [file.pageId for file in ret]
if len(page_ids) != len(set(page_ids)):
```

	ret = self.workspace.mets.find_all_files(
	fileGrp=self.input_file_grp, pageId=self.page_id, mimetype="//image/.*")
	if self.page_id and len(ret) > 1:
	raise ValueError("No PAGE-XML %s in fileGrp '%s' but multiple images." % (
	"for page '%s'" % self.page_id if self.page_id else '',
	self.input_file_grp
	))
	return ret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processor.input_files broke for pageId selector lists #622

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Processor.input_files broke for pageId selector lists #622

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions