Skip to content

Conversation

@jordepic
Copy link
Contributor

@jordepic jordepic commented Nov 4, 2025

As of now, the HadoopFileIO uses the Java delete
API, which always skips using a configured trash
directory. If the table's hadoop configuration
has trash enabled, we should use it.

@github-actions github-actions bot added the core label Nov 4, 2025
@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from b0ba8b9 to 8d07d49 Compare November 4, 2025 16:50
@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from 8d07d49 to 5cb16cf Compare November 5, 2025 16:14
Copy link
Contributor

@anuragmantri anuragmantri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jordepic. This is very useful.

However, it seems like a behavior change (even if trash was enabled previously, Iceberg was not honoring it). IMO, we should make this configurable using a property to avoid surprises (unexpected storage consumption).

}

private void deletePath(FileSystem fs, Path toDelete, boolean recursive) throws IOException {
Trash trash = new Trash(fs, getConf());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about number of Trash objects we create. Does Hadoop API ensure we can reuse the trash object for a given (fs, conf)?
I couldn't tell from https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Trash.html#Trash-org.apache.hadoop.fs.FileSystem-org.apache.hadoop.conf.Configuration-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good call. I've added a new toggle that can be put in the hadoop configuration to determine if we want to use the trash for iceberg, following Russel Spitzer's example in other HadoopFileIO changes.

I've taken a look regarding object reuse. The trash can change due to lots of changes in configuration (meaning I'd have to create a cache based on 5+ configuration values which are susceptible to change in the future), unlike the file system (Key doesn't actually rely on conf, just relies on the URI and user group information). With that being said, the change that I made to check for hadoop configuration first makes it so that we don't create the Trash object unless specifically opted into. I hope that this is good enough for now - an iceberg user will now have to opt into this change to experience any possible object churn.

@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from 5cb16cf to efc6a8f Compare November 6, 2025 15:25
@danielcweeks danielcweeks self-requested a review November 10, 2025 19:24
return;
}
Trash trash = new Trash(fs, getConf());
if (!trash.isEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can configure the trash at the Hadoop Conf level, why are we adding a separate configuration? Shouldn't we just do this when the HadoopFileIO is initialized and rely on the Hadoop trash conf? It feels like we're adding two separate configs to enable trash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. A previous commenter here wanted to do this in order to make the trash "opt-in", not "opt-out", for those who had already configured it.

I agree though, if trash is configured the normal way, we should use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my understanding was that the current java API for delete always "ignores" the trash config even if it is set server side. In which case, not ignoring it anymore could be potentially unexpected. But I guess that is the right thing to do (honor the trash config if it is set via Hadoop conf). I'm ok with just using the Hadoop conf. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly familiar with the way the Hadoop Trash works, but it's quite odd to me. The Trash isn't natively part of the FileSystem (like versioned objects are native to cloud storage). This leaves an awkward situation where, if we respect the core-site.xml config, then we turn it on everywhere like you said @anuragmantri (assuming the deleter has access to the trash?).

I could see this going both ways in that if you configure the core-site.xml and didn't realize there was a second property to set, you would lose data. You also potentially have the issue where deleted data goes to the trash configured for whoever deleted it (so two separate deletes could end up in completely different locations).

I'm not sure what's right here and open to suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this going both ways in that if you configure the core-site.xml and didn't realize there was a second property to set, you would lose data.

To be fair, this is the status quo

You also potentially have the issue where deleted data goes to the trash configured for whoever deleted it (so two separate deletes could end up in completely different locations).

I suppose that would be the case for HDFS trash in general - even if writing parquet files without iceberg, each writer might potentially use a different trash based on configuration.

I think these are just limitations of the existing system, but let me know if you disagree

@danielcweeks
Copy link
Contributor

@jordepic I'm a little concerned about the utility of this. If we're just relocating arbitrary files into the trash location, how do you know which table it was associated with? In isolation it seems like it would make sense, but across a warehouse, this feels like it would be really difficult to reconstruct anything.

@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from efc6a8f to ba28083 Compare November 12, 2025 21:03
@jordepic
Copy link
Contributor Author

@jordepic I'm a little concerned about the utility of this. If we're just relocating arbitrary files into the trash location, how do you know which table it was associated with? In isolation it seems like it would make sense, but across a warehouse, this feels like it would be really difficult to reconstruct anything.

@danielcweeks a file at path /iceberg/tablename/data/.... is relocated to /.Trash/current/iceberg/tablename/data/...

It doesn't go to a completely arbitrary path!

@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from ba28083 to d0d62e8 Compare November 12, 2025 21:23
@anuragmantri
Copy link
Contributor

anuragmantri commented Nov 14, 2025

In isolation it seems like it would make sense, but across a warehouse, this feels like it would be really difficult to reconstruct anything.

I agree that this may not very useful in isolation, but can we still let the client use the trash in HadoopIO (if configured) and have the ability for users to restore the table state. We have had examples where users are able to restore accidentally deleted files via cloud providers' object lifecycle policies but could not do so in Hadoop environments because the client was not using the trash.

@ludlows
Copy link
Contributor

ludlows commented Nov 15, 2025

it would be nice to provide a table level parameter to control this behavior

@jordepic
Copy link
Contributor Author

it would be nice to provide a table level parameter to control this behavior

@ludlows That was basically the first iteration of my change. I think that @danielcweeks felt this level of control was unnecessary, and that those that configure their hadoop to use the trash should use it.

Open to more discussion here.

@danielcweeks
Copy link
Contributor

@jordepic and @ludlows After looking a little more into the way trash works, I don't think this is something we want to turn on at a table level (especially considering how this implementation works).

The Trash feature in Hadoop/HDFS is quite strange as it's a client, config, and cluster level feature that all depend upon each other. For example, the client has to respect the config and initialize the Trash and perform a move operation otherwise it's ignored. The config has to be set and configured properly to a location the user has access to. Finally, if you don't apply the configuration to both the client and the NameNode, then cleanup won't be performed properly.

Given all of that, this feels very much like a administrator-level feature that needs to be configured (this appears to be the case for Cloudera already, though I don't know if engines like Hive/Impala respect the trash settings).

It could be potentially dangerous to allow users to configure this on a per-table basis because cleanup may not be configured, which may result in data that should be deleted, persisting in the file system. There's also nothing that appears to prevent the configuration from being applied to other file-system implementations (like S3A), which would be bad (data copy, no cleanup), but I feel like we should discourage that. @jordepic Is there anything we can do to prevent this?

I'm not a huge fan of this approach, but it seems like what we have to work with.

@jordepic
Copy link
Contributor Author

jordepic commented Nov 18, 2025

It could be potentially dangerous to allow users to configure this on a per-table basis because cleanup may not be configured, which may result in data that should be deleted, persisting in the file system.

When you call trash.isEnabled(), it checks whether the TrashPolicy.isEnabled(), and in the TrashPolicyDefault, isEnabled() ensures that the deletion interval is > 0. So I think this may be a non issue. If people override their trash class to be something else, it could be an issue.

For reference:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Trash.java#L62

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicy.java#L142

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java#L126

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Trash.java#L130

Hive also seems to employ the HDFS trash:
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java#L80

There's also nothing that appears to prevent the configuration from being applied to other file-system implementations (like S3A), which would be bad (data copy, no cleanup), but I feel like we should discourage that. @jordepic Is there anything we can do to prevent this?

I'm less sure what you mean on this one. We aren't making this change in the s3 file IO, but I'm less familiar with the differences between that and s3a.

@danielcweeks
Copy link
Contributor

danielcweeks commented Nov 18, 2025

When you call trash.isEnabled(), it checks whether the TrashPolicy.isEnabled(), and in the TrashPolicyDefault, isEnabled() ensures that the deletion interval is > 0. So I think this may be a non issue. If people override their trash class to be something else, it could be an issue.

The issue is that the config can be different for the client than for the NameNode. So if a client configures interval > 0, but the NameNode does not have that config, then a client will move data files, but they will never be cleaned up.

I'm less sure what you mean on this one. We aren't making this change in the s3 file IO, but I'm less familiar with the differences between that and s3a.

HadoopFileIO is an abstraction for all Hadoop FileSystem implementations (DistributedFileSystem, S3AFileSystem, GCSFileSystem, etc.). That means that if I enable this in core-side.xml and use a s3 mapped scheme, I would trigger the move behavior, which I don't think we want for non HDFS file systems. The config (fs.trash.interval) is not specific to a scheme, so it appears to be global for all file system implementations.

@jordepic
Copy link
Contributor Author

The issue is that the config can be different for the client than for the NameNode. So if a client configures interval > 0, but the NameNode does not have that config, then a client will move data files, but they will never be cleaned up.

Good point. Though, at the end of the day, I'm not sure that I see this differently from any other misconfiguration that an iceberg user might have that would adversely impact them. For example, we misconfigured a table location and then removed an entire hadoop directory thinking they were orphan files, haha!

HadoopFileIO is an abstraction for all Hadoop FileSystem implementations (DistributedFileSystem, S3AFileSystem, GCSFileSystem, etc.). That means that if I enable this in core-side.xml and use a s3 mapped scheme, I would trigger the move behavior, which I don't think we want for non HDFS file systems. The config (fs.trash.interval) is not specific to a scheme, so it appears to be global for all file system implementations.

Also a fair point. I think that I could resolve this one pretty safely using some instanceOf checks on the FileSystem object. Are you at all opposed to that?

@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from d0d62e8 to e490367 Compare November 18, 2025 22:26
As of now, the HadoopFileIO uses the Java delete
API, which always skips using a configured trash
directory. If the table's hadoop configuration
has trash enabled, we should use it.

We should also only do this for implementations
where trashing files is acceptable. In our case,
this is the LocalFileSystem and the
DistributedFileSystem.
@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from e490367 to 416c041 Compare November 18, 2025 23:01
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordepic I think this looks good. When we get closer to the 1.11 release, please reach out to the release manager to highlight this in the release notes as it could have an impact on people running hdfs.

Thanks!

@danielcweeks danielcweeks merged commit 06c1e0a into apache:main Nov 20, 2025
44 checks passed

private void deletePath(FileSystem fs, Path toDelete, boolean recursive) throws IOException {
Trash trash = new Trash(fs, getConf());
if ((fs instanceof LocalFileSystem || fs instanceof DistributedFileSystem)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ViewFileSystem?

@steveloughran
Copy link
Contributor

@jordepic @danielcweeks joining in very late here.

that trash api really exists to stop users doing things on the command line, so hadoop fs -rm -rf / doesn't (on s3a:// we don't let you delete root as an alternative).
Database operations tend not not go through trash on the basis that databases can do their own thing and/or you need disaster recovery mechanisms at this point.

I do see hive has it as a safety check, presumably someone did a DROP TABLE and changed their mind. I suspect it is not used on every file deletion though, more the whole-table operations, because one aspect of trash it likes to be atomic: moving a whole table in there gives you that.

S3aFs doesn't like trash as the PoV there is "S3 versioning may not be atomic but it's a lot faster than renaming".

We've discussed having a plugin policy here where each fs could have its own .HADOOP-18013. ABFS: add cloud trash policy with per-schema policy selection; superceded by something with active development apache/hadoop#8063 .

I'll see about getting that in.

Regarding this patch

  • it's going to cause problems in HD/Insight as micrsoft don't put the HDFS Jars on the classpath, and this has explicit reference to the classes.
  • it doesn't let people turn on trash on azure storage or elsewhere if they want it.

What about just a configuration option "iceberg.hadoop.trash.schemas" to take a list of filesystem schemas "hdfs, viewfs, file, abfs" for which trash is enabled?.

@manuzhang
Copy link
Member

What about just a configuration option "iceberg.hadoop.trash.schemas" to take a list of filesystem schemas "hdfs, viewfs, file, abfs" for which trash is enabled?.

I like this idea as viewfs was not handled in this PR. @steveloughran Do you plan to open a follow-up PR?

@steveloughran
Copy link
Contributor

@manuzhang yes, I also want to get the bulk delete api calls in for cloud delete performance; the changes here are complicating that. I can do this change first. Then when ozone adds its own trash policy, it'll be easy to support

@jordepic
Copy link
Contributor Author

Hi @steveloughran ! I'm sorry for the very late response on my end here. I'm happy to review or take care of the follow up change - let me know what you prefer.

@steveloughran
Copy link
Contributor

I'll have a go at the change

thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
As of now, the HadoopFileIO uses the Java delete
API, which always skips using a configured trash
directory. If the table's hadoop configuration
has trash enabled, we should use it.

We should also only do this for implementations
where trashing files is acceptable. In our case,
this is the LocalFileSystem and the
DistributedFileSystem.

Co-authored-by: Jordan Epstein <jordan.epstein@imc.com>
@steveloughran
Copy link
Contributor

started the new PR. Also discovered a regression here, the moveToTrash() code raises an FNFE if there's no file/dir at the end of the path.

Caused by: java.io.FileNotFoundException: File file:/var/folders/4n/w4cjr_d95kg9bxkl6sz3n3ym0000gr/T/junit-2971716127417174714/missing does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:980)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1301)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:970)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.hadoop.fs.TrashPolicyDefault.moveToTrash(TrashPolicyDefault.java:139)
	at org.apache.hadoop.fs.Trash.moveToTrash(Trash.java:140)
	at org.apache.iceberg.hadoop.HadoopFileIO.deletePath(HadoopFileIO.java:247)
	at org.apache.iceberg.hadoop.HadoopFileIO.deleteFile(HadoopFileIO.java:109)
	... 4 more

This means that if ever a file which has already been deleted is deleted again: failure. Whereas filesystem.delete() is just a no-op.

Going to catch an FNFE in moving a file.

@steveloughran
Copy link
Contributor

The Trash feature in Hadoop/HDFS is quite strange as it's a client, config, and cluster level feature that all depend upon each other. For example, the client has to respect the config and initialize the Trash and perform a move operation otherwise it's ignored. The config has to be set and configured properly to a location the user has access to.

For HDFS the client actually asks the service what the policy is

    public FsServerDefaults getServerDefaults() throws IOException {
        return this.dfs.getServerDefaults();
    }

ViewFS does this for the resolved FS of a path, so will get it for hdfs there.

The PR #15111 uses this info so the entire trash settings should be picked up from the store. If a client config has trash off when working with local fs, s3, abfs etc, when it interacts with hdfs it'll still get the settings from that cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants