-
Notifications
You must be signed in to change notification settings - Fork 7
Description
In regards to writing file paths in manifest files, the spec states the following:
If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].
My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR and LF are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of % too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.
I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR and LF and do not encode %. The first behavior is broken for file names that contain an LF or CR and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.
I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.
Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n and a \ is added to the beginning of the line. Additionally, literal \ characters are escaped with another \ and a \ is also added to the beginning of the line.
For example, let's say that we have the file named new\nline (important, this must be an actual newline and not the characters \ and n) and one named back\slash, and then executed the following:
# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee new\nline
# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee new\nlineThis seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.