Skip to content

Conversation

@prateekm
Copy link
Contributor

@prateekm prateekm commented Mar 22, 2023

Symptom: Blob store state restore fails when verifying restored directory contents due to a mismatch between permissions of local vs remote files.

Cause: Apache Commons 2.11 (at least, more versions may be affected) FileUtils#deleteDirectory() has a bug where it changes the permissions of symlink/hardlink files to read-only after deletion. For more details, see https://issues.apache.org/jira/browse/IO-751. This means that TaskStorageCommitManager#deleteOldCheckpointDirs() causes the permissions for the original store files hard-linked from the checkpoint directory to be changed to read only. This causes the file to be considered different than the one already uploaded on next checkpoint and re-uploaded. Since restore recreates all files in rw mode, they fail post-restore verifications due to permissions mismatch.

Changes:
Since we cannot control which commons-io version the user application is using, this PR changes all uses of FileUtils#deleteDirectory to PathUtils#deleteDirectory() which is not affected. (according to the ticket above, and verifed with the failing test).
Unrelated cleanup: changed state backend integration tests to use a unique temp directory for each run.

Tests: Discovered with the BlobStoreStateBackendIntegrationTest added in PR #1657. Verified fix by re-running test with the bad apache-commons version.

private static final String BLOB_STORE_BASE_DIR;
static {
try {
LOGGED_STORE_BASE_DIR = Files.createTempDirectory("logged-store-").toString();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated cleanup. With this the temp base dir is unique for each test run (createTempDirectory adds a random suffix to the provided prefix).

@prateekm
Copy link
Contributor Author

@shekhars-li FYI, another symptom/side effect of this bug is that a file can appear in both DirIndex#filesPresent and DirIndex#filesRemoved lists, despite the validation in the constructor. This is because validatePresentAndRemovedOverlap compares FileIndexes, whose equals also compares file permissions. So the same file with changed permissions (e.g. due to bug above) will be considered different than the remote file and be present in filesAdded (new local file) and filesRemoved (old remote file).

@prateekm prateekm force-pushed the apache-commons-fileutils-fix branch from 6ef7210 to 237d05d Compare March 22, 2023 04:21
Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: update the PR description changes section to reflect the clean up as well.

@prateekm prateekm merged commit 985a604 into apache:master Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants