Skip to content

Conversation

@lchqlchq
Copy link
Contributor

@lchqlchq lchqlchq commented Aug 10, 2023

Socket::close could only mark the object as closed but leave
underlying socket fd open if the socket is registered to nio selector.
So, we have to call Selector::selectNow in cleanup to close
underlying socket fd.

This could happen if network service get shutdown in socket connecting,
say, "ifdown eth0" or "service network stop" and so on.

See:

@sonatype-lift
Copy link

sonatype-lift bot commented Aug 10, 2023

Sonatype Lift is retiring

Sonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console.
We are extremely grateful and thank you for your support over the years.

📖 Read about the impacts and timeline

@lchqlchq lchqlchq changed the title fix fd leak ZOOKEEPER-4736 fix fd leak Aug 15, 2023
@lchqlchq lchqlchq changed the title ZOOKEEPER-4736 fix fd leak ZOOKEEPER-4736: fix fd leak Aug 15, 2023
@lchqlchq
Copy link
Contributor Author

lchqlchq commented Aug 15, 2023

@eolivelli, @kalmar, @hanm ,@jowiho Can you please take a minute look at this ?

Copy link
Member

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in general and I have verified this with local tests.

I think it is better to write a test for this. You can use ClientCnxnSocketFragilityTest as an example.

void registerAndConnect(SocketChannel sock, InetSocketAddress addr) throws IOException {
sockKey = sock.register(selector, SelectionKey.OP_CONNECT);
boolean immediateConnect = sock.connect(addr);
sockKey = sock.register(selector, SelectionKey.OP_CONNECT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but because in ClientCnxnSocketNIO::registerAndConnect method socket is registed to selector firstly and do sock.connect operation leading the fd of sock can't be closed.

https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java#L841

            // If this channel is not registered then it's safe to close the fd
            // immediately since we know at this point that no thread is
            // blocked in an I/O operation upon the channel and, since the
            // channel is marked closed, no thread will start another such
            // operation.  If this channel is registered then we don't close
            // the fd since it might be in use by a selector.  In that case
            // closing this channel caused its keys to be cancelled, so the
            // last selector to deregister a key for this channel will invoke
            // kill() to close the fd.
            //
            if (!isRegistered())
                kill();

sock.close will not close fd due to registered status. It is sad that:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asides from above, I think we should skip register in case of immediateConnect otherwise we may prime the connection twice. But it is probably another story and need verification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. I agree with your viewpoint on the reason of fd leak. if skip register we can not get the sockKey that the primeConnection() is based of in the follow-up actions. And this just register a SelectionKey.OP_CONNECT event,it will not affect primeConnection() implements that will alter the interestOps with "clientCnxnSocket.enableReadWriteOnly()".

@ZihuanLing
Copy link

Why not just invoke selector.select method after cleanup? After selector.select, the canceled sockKey will be removed from selector.keys, and fd will be released.

@kezhuw
Copy link
Member

kezhuw commented Jun 23, 2025

Why not just invoke selector.select method after cleanup? After selector.select, the canceled sockKey will be removed from selector.keys, and fd will be released.

This sounds sensible and is a one for all solution. Currently, cleanup could not close socket until next selector.select. By do additional select, socket could be closed instantly.

@lchqlchq What do you think on this ? I think we could add selector.selectNow right after sockKey.cancel in `cleanup.

…addr) method throws "SocketException: Network is unreachable", because the socket had registered to selector, cleanup() method can't close the fd ,And SendThread keep doing startConnect() to keep zk client alive ,leading to fd leak.
@kezhuw kezhuw changed the title ZOOKEEPER-4736: fix fd leak ZOOKEEPER-4736: Fix nio socket fd leak if network service is down Aug 17, 2025
@kezhuw
Copy link
Member

kezhuw commented Aug 17, 2025

I have taken over this so to move forward. It has been opened for two years.

Could you please take a look ? @anmolnar @eolivelli @tisonkun

@kezhuw kezhuw closed this Aug 17, 2025
@kezhuw kezhuw reopened this Aug 17, 2025
`Socket::close` could only mark the object as closed but leave
underlying socket fd open if the socket is registered to nio selector.
So, we have to call `Selector::selectNow` in `cleanup` to close
underlying socket fd.

This could happen if network service get shutdown in socket connecting,
say, "ifdown eth0" or "service network stop" and so on.

See:
* https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java#L841
Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@anmolnar anmolnar merged commit e8e141b into apache:master Oct 30, 2025
16 checks passed
asf-gitbox-commits pushed a commit that referenced this pull request Oct 30, 2025
Reviewers: kezhuw, anmolnar
Author: lchqlchq
Closes #2047 from lchqlchq/fd

(cherry picked from commit e8e141b)
Signed-off-by: Andor Molnar <andor@cloudera.com>
@anmolnar
Copy link
Contributor

Merged to master and branch-3.9 branches. Thanks @lchqlchq @kezhuw !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants