Although the spec requires fileGrp/@USE names to follow a very strict scheme, we have not enforced this in core (only the workspace validator checks it). However, if fileGrp names are left completely unrestricted, we get follow-up problems: For example, since we normally base file IDs on fileGrp names, some user choices will unwittingly end up in invalid METS:
element file: Schemas validity error : Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'OCR-D-OCR-TESS-Fraktur+Latin-SEG-LINE-tesseract-ocropy-DEWARP_0005' is not a valid value of the atomic type 'xs:ID'.
element file: Schemas validity error : Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'OCR-D-GT-SEG-PAGE-ſs-sſ-EVAL_0006' is not a valid value of the atomic type 'xs:ID'.
...
element fptr: Schemas validity error : Element '{http://www.loc.gov/METS/}fptr', attribute 'FILEID': 'OCR-D-OCR-TESS-Fraktur+Latin-SEG-LINE-tesseract-ocropy-DEWARP_0005' is not a valid value of the atomic type 'xs:IDREF'.
element fptr: Schemas validity error : Element '{http://www.loc.gov/METS/}fptr', attribute 'FILEID': 'OCR-D-GT-SEG-PAGE-ſs-sſ-EVAL_0006' is not a valid value of the atomic type 'xs:IDREF'.
I therefore suggest extending add_file's
|
if not REGEX_FILE_ID.fullmatch(ID): |
|
raise ValueError("Invalid syntax for mets:file/@ID %s" % ID) |
check to
add_file_grp.
Although the spec requires
fileGrp/@USEnames to follow a very strict scheme, we have not enforced this in core (only the workspace validator checks it). However, if fileGrp names are left completely unrestricted, we get follow-up problems: For example, since we normally base file IDs on fileGrp names, some user choices will unwittingly end up in invalid METS:I therefore suggest extending
add_file'score/ocrd_models/ocrd_models/ocrd_mets.py
Lines 298 to 299 in d9f660e
add_file_grp.