Skip to content

Bare bones add#23

Closed
kba wants to merge 2 commits intobertsky:masterfrom
kba:bare-bones-add
Closed

Bare bones add#23
kba wants to merge 2 commits intobertsky:masterfrom
kba:bare-bones-add

Conversation

@kba
Copy link
Copy Markdown
Contributor

@kba kba commented Aug 5, 2022

These changes are not meant to be taken over as-is, they are just adaptions to make ocrd-import

  1. faster
  2. ensure the generated page ID have a form that results in predictable and make_file_id-comptaible IDs

Just wanted to make a draft PR as a discussion basis.

@kba kba mentioned this pull request Aug 8, 2022
Copy link
Copy Markdown
Owner

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to help make ocrd-import more versatile, but don't fully understand your use-case (esp. the P_ change).

Comment thread ocrd-import
Comment on lines -7 to +3
function debug { ocrd log -n ocrd-import debug "$1"; }
function critical { echo critical "$1"; }
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the OCR-D logging facility is too inefficient, I suggest to use a proper mechanism to override it with an additional switch. For example, by aliasing ocrd log -n ocrd-import or echo to a common log backend.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was just a quick hack to reduce the overhead of calling ocrd log.

For example, by aliasing ocrd log -n ocrd-import or echo to a common log backend.

How do you mean, aliasing?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean, aliasing?

I meant

alias log="ocrd log -n ocrd-import"
...
((fastlog)) && alias log=echo
...
function debug { log debug "$1"; }
...
function critical { log critical "$1"; }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 that is a nice solution and configurable too.

Comment thread ocrd-import
set -e
trap rollback ERR
page=p${zeros:0:$((4-${#num}))}$num
echo "PAGE=$page"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use the log fn

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just debugging by @stefanCCS, can go without replacement.

Comment thread ocrd-import
if ! [[ ${base:0:1} =~ [a-zA-Z] ]]; then
base=f${base}
#base=P_${base}
a=0 # just do something to have a correct syntax for this 'if'
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-XS-compliant page names are a problem regardless of numpageid or not. I don't see why this branch should be abandoned. (But if you do, : is the sh word for no-op.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the intention was here. @stefanCCS ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba , @bertsky : My intention was to create the same result using ocrd-import as with using ocrd workspace add.
Here pageelements looking like this
<mets:div TYPE="page" ID="P_3871_007767869_00004">
having an IDstarting with P_ (instead of starting with f).

Comment thread ocrd-import
Comment on lines -265 to +183
done
#case "$mimetype" in
# ${MIMETYPE_PAGE})
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to have a version with next to no checks, the right approach would be to just add a switch to circumvent the no-clash check for resulting pages/files. Everything else is already fast (like the PAGE vs ALTO check for .xml) or can be deactivated (like the --no-convert switch).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was very broad commenting out. The "offending" call that slows down import is

clashes=($(ocrd workspace find -i "$base" -k local_filename -k mimetype -k pageId))

@bertsky
Copy link
Copy Markdown
Owner

bertsky commented Mar 14, 2025

#34 includes a Pythonic version of ocrd-import that (among many other improvements) handles this via ocrd_utils.make_xml_id, and is much faster.

@bertsky bertsky closed this Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants