Conversation
In make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash); but before fallback to purely numerical ID, additionally check if the ID does already contain the pageId: in that case, only append the output fileGrp to that ID.
7ba0454 to
f96d3fd
Compare
(no numerical pageId extraction any more)
To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change. |
kba
left a comment
There was a problem hiding this comment.
Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.
To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.
IIUC this seems unnecessary. Do you have an example of an ID where this might be pertinent?
| """https://github.com/OCR-D/core/pull/605""" | ||
| mets = OcrdMets.empty_mets() | ||
| f = mets.add_file('1:!GRP', ID='FOO_0001', pageId='phys0001') | ||
| f = mets.add_file('2:!GRP', ID='FOO_0001', pageId='phys0001') |
There was a problem hiding this comment.
We should probably disallow filegroups starting with a number because the resulting ID might lead to an invalid xsd:ID because they mustn't start with a number.
There was a problem hiding this comment.
Yes, if #746 kicks in, such tests should all fail...
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
I recently saw it in Konzilsprotokolle GT, e.g. |
In
make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash).But before fallback to purely numerical ID, additionally check if the input file's ID does already contain the pageId: in that case,
only append the output fileGrp to that ID (because it is sufficiently unique already).
(Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works. On the other hand, often the file IDs from non-OCRD data contain the pageId directly, in which case it's better to stick to that convention.)