-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19217. Introduce getTrashPolicy to FileSystem API #8063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
ee71aac to
4183ac9
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
@ayushtkn @KeeProMise Could you please review this PR? Thank you very much! |
|
@steveloughran Please help to take a look when you are available. Thanks in advance. |
|
curious where we will use it, adding this as a FileSystem API looks overkill to me, if it required for some particular use case, maybe part of getServerDefaults()? |
|
@ayushtkn Thanks for taking a look.
The problem that we encountered is that we have a client that can access both hdfs:// and ofs:// but TrashPolicyDefault and OzoneTrashPolicy is not compatible (TrashPolicyDefault was changed that cause issues when applied to Ozone). Despite setting "fs.trash.classname" in both HDFS and Ozone, only one will be picked (usually TrashPolicyDefault) which causes user not being able to move files to trash in Ozone. Other related works like HADOOP-18013 and HADOOP-18893 approached it by using per-scheme configuration (e.g. fs.s3a.trash.classname)
I am not aware with |
|
cc: @sadanand48 |
|
just noticed this. Having it per FS really matters, as it allows people to have an hdfs fs which uses trash and and s3 store where you want fast deletions |
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TrashPolicy trash = ReflectionUtils.newInstance(trashClass, conf); | ||
| trash.initialize(conf, fs); // initialize TrashPolicy | ||
| return trash; | ||
| TrashPolicy trashPolicy = fs.getTrashPolicy(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this (legacy) static call, pass in a path of "/" to the getTrashPolicy() call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| */ | ||
| @InterfaceAudience.Public | ||
| @InterfaceStability.Unstable | ||
| public TrashPolicy getTrashPolicy(Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have it take a Path.
this'll allow viewfs to resolve through the mount points to the final value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated.
However, the reason I don't put the Path is because the FileSystem resolution logic is already done in Trash#moveToAppropriateTrash. In that case, we might not need to use a Path. This is the reason why I mentioned in the filesystem.md that
FileSystem implementations with multiple child file systems (e.g.
ViewFileSystem) should NOT implement this method since the Hadoop trash mechanism should resolve to the underlying filesystem before invokinggetTrashPolicy.
Please let me know what you think.
| @InterfaceStability.Unstable | ||
| public TrashPolicy getTrashPolicy(Configuration conf) { | ||
| Class<? extends TrashPolicy> trashClass = conf.getClass( | ||
| "fs.trash.classname", TrashPolicyDefault.class, TrashPolicy.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- make this a constant somewhere.
- log at debug the trash policy "default filesysetm trash policy loaded policy "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| trash.initialize(conf, fs); // initialize TrashPolicy | ||
| return trash; | ||
| TrashPolicy trashPolicy = fs.getTrashPolicy(conf); | ||
| trashPolicy.initialize(conf, fs); // initialize TrashPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about requiring the initialize to be done in the fs instance, rather than here? That would support directly accessing the policy in newer code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not entire sure what you meant, do you mean:
- Invoke
TrashPolicy#initializeinFileSystem#getTrashPolicy - Introduce a new API to initialize the
TrashPolicy - Initialize a
TrashPolicyinFileSystem#initialize
IIRC I tried to do (1), but I think it caused some regressions, and I reverted it.
| // Test plugged TrashPolicy | ||
| conf.setClass("fs.trash.classname", TestTrashPolicy.class, TrashPolicy.class); | ||
| Trash trash = new Trash(conf); | ||
| assertInstanceOf(TestTrashPolicy.class, trash.getTrashPolicy()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use assertJ asserts , as these we can backport to older branches without problems
Assertions.assertThat(trash.getTrashPolicy()).isInstanceOf(TestTrashPolicy.class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| if (checkpoints.size() == 4) { | ||
| // The actual contents should be smaller since the last checkpoint | ||
| // should've been deleted and Current might not have been recreated yet | ||
| assertTrue(checkpoints.size() > files.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertj assert on size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| ContractTestUtils.writeTextFile(fs, myFile, "myFileContent", false); | ||
|
|
||
| // Verify that we succeed in removing the file we re-created | ||
| assertTrue(trash.moveToTrash(myFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about factoring this out into an assertMovedToTrash(trash,Path) so the assertTrue and error message can be used everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| * </ol> | ||
| * </p> | ||
| */ | ||
| public abstract class AbstractContractTrashTest extends AbstractFSContractTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a risk here that recycled filesystems have the older trash policy. Have setup invoke a FileSystem.closeAll()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| @AfterEach | ||
| @Override | ||
| public void teardown() throws Exception { | ||
| final FileSystem fs = getFileSystem(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a try/catch here so if there's a failure in the test and teardown, the teardown failure doesn't hide the test failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| * The path returned is a directory | ||
|
|
||
|
|
||
| ### `TrashPolicy getTrashPolicy(Configuration conf)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to define what a trash policy is/does. Maybe add a new file trashpolicy.md
trash policies are part of the public API (Hive and iceberg use it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @steveloughran for the review. I have updated some of the reviews (mostly the low hanging ones). Regarding a separate document about trash and trash policy behavior, I'll work on it when I'm available.
Mehakmeet's has code you want to pull in, specifically the "EmptyTrashPolicy" which deletes files; my DeleteFilesTrashPolicy is similar.
Regarding this, I'll also incorporate it to this PR later. The idea is that S3A and ABFS getTrashPolicy will check fs.s3a.trash.classname and fs.abfs.trash.classname and both will default to EmptyTrashPolicy. Let me know if this approach is agreeable.
| */ | ||
| @InterfaceAudience.Public | ||
| @InterfaceStability.Unstable | ||
| public TrashPolicy getTrashPolicy(Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated.
However, the reason I don't put the Path is because the FileSystem resolution logic is already done in Trash#moveToAppropriateTrash. In that case, we might not need to use a Path. This is the reason why I mentioned in the filesystem.md that
FileSystem implementations with multiple child file systems (e.g.
ViewFileSystem) should NOT implement this method since the Hadoop trash mechanism should resolve to the underlying filesystem before invokinggetTrashPolicy.
Please let me know what you think.
| @InterfaceStability.Unstable | ||
| public TrashPolicy getTrashPolicy(Configuration conf) { | ||
| Class<? extends TrashPolicy> trashClass = conf.getClass( | ||
| "fs.trash.classname", TrashPolicyDefault.class, TrashPolicy.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| TrashPolicy trash = ReflectionUtils.newInstance(trashClass, conf); | ||
| trash.initialize(conf, fs); // initialize TrashPolicy | ||
| return trash; | ||
| TrashPolicy trashPolicy = fs.getTrashPolicy(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| trash.initialize(conf, fs); // initialize TrashPolicy | ||
| return trash; | ||
| TrashPolicy trashPolicy = fs.getTrashPolicy(conf); | ||
| trashPolicy.initialize(conf, fs); // initialize TrashPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm not entire sure what you meant, do you mean:
- Invoke
TrashPolicy#initializeinFileSystem#getTrashPolicy - Introduce a new API to initialize the
TrashPolicy - Initialize a
TrashPolicyinFileSystem#initialize
IIRC I tried to do (1), but I think it caused some regressions, and I reverted it.
| */ | ||
| package org.apache.hadoop.fs.contract; | ||
|
|
||
| import org.apache.hadoop.conf.Configuration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks. It'll be nice if this can be enforced by the CI.
| fileIndex++; | ||
|
|
||
| // Move the files to trash | ||
| assertTrue(trashPolicy.moveToTrash(myFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| if (checkpoints.size() == 4) { | ||
| // The actual contents should be smaller since the last checkpoint | ||
| // should've been deleted and Current might not have been recreated yet | ||
| assertTrue(checkpoints.size() > files.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| ContractTestUtils.writeTextFile(fs, myFile, "myFileContent", false); | ||
|
|
||
| // Verify that we succeed in removing the file we re-created | ||
| assertTrue(trash.moveToTrash(myFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| // Test plugged TrashPolicy | ||
| conf.setClass("fs.trash.classname", TestTrashPolicy.class, TrashPolicy.class); | ||
| Trash trash = new Trash(conf); | ||
| assertInstanceOf(TestTrashPolicy.class, trash.getTrashPolicy()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| * following order should not result in unexpected results such as files in trash that | ||
| * will never be deleted by trash mechanism. | ||
| * <ol> | ||
| * <li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Description of PR
JIRA: https://issues.apache.org/jira/browse/HADOOP-19217
Hadoop FileSystem supports multiple FileSystem implementations awareness (e.g. client is aware of both hdfs:// and ofs:// protocols).
However, it seems that currently Hadoop TrashPolicy remains the same regardless of the URI scheme. The TrashPolicy is governed by "fs.trash.classname" configuration and stays the same regardless of the FileSystem implementation. For example, HDFS defaults to TrashPolicyDefault and Ozone defaults to TrashPolicyOzone, but only one will be picked since the the configuration will be overwritten by the other.
Therefore, I propose to tie the TrashPolicy implementation to each FileSystem implementation by introducing a new FileSystem#getTrashPolicy interface. TrashPolicy#getInstance can call FileSystem#getTrashPolicy to get the appropriate TrashPolicy.
How was this patch tested?
Unit and contract tests (HDFS and LocalFS).
Disclosure: FileSystem.md part was initially generated by AI, but majority of it was updated. The other implementations are hand-coded.