Skip to content

reconnect to the graphite after transient disconnect#2952

Merged
fjy merged 2 commits intoapache:masterfrom
b-slim:fix_reconnect_graphite
May 12, 2016
Merged

reconnect to the graphite after transient disconnect#2952
fjy merged 2 commits intoapache:masterfrom
b-slim:fix_reconnect_graphite

Conversation

@b-slim
Copy link
Copy Markdown
Contributor

@b-slim b-slim commented May 10, 2016

This is a fix to issue #1978 (comment).
Try to reconnect when the connection is list due to transient problems.

@b-slim b-slim mentioned this pull request May 10, 2016
@l15k4
Copy link
Copy Markdown

l15k4 commented May 10, 2016

I'm 100% sure this won't help because Socket#isConnected remains true, I'm sure of that because it passes all these conditions :

    public OutputStream getOutputStream() throws IOException {
        if (isClosed())
            throw new SocketException("Socket is closed");
        if (!isConnected())
            throw new SocketException("Socket is not connected");
        if (isOutputShutdown())
            throw new SocketException("Socket output is shutdown");

It is a know problem of java Socket api, people work it around by checking :

PrintWriter out = new PrintWriter(socket.getOutputStream(), true);
if (out.checkError())

Another solution would be reconnecting in the com.codahale.metrics.graphite.PickledGraphite#writeMetrics in the catch block in case of SocketException ...

@b-slim
Copy link
Copy Markdown
Contributor Author

b-slim commented May 11, 2016

@l15k4 thanks for your feed back, what you think about this one ?

@l15k4
Copy link
Copy Markdown

l15k4 commented May 11, 2016

@b-slim This LGTM 👍

@b-slim
Copy link
Copy Markdown
Contributor Author

b-slim commented May 12, 2016

@himanshug and @fjy can you please check this in ?

@nishantmonu51
Copy link
Copy Markdown
Member

👍

@fjy fjy added this to the 0.9.1 milestone May 12, 2016
@fjy fjy added the Bug label May 12, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented May 12, 2016

👍

@fjy fjy merged commit abf64a1 into apache:master May 12, 2016
@l15k4
Copy link
Copy Markdown

l15k4 commented Sep 22, 2016

I just upgraded to 0.9.1.1 and after stopping graphite and starting again, I can see 3 more of these in middleManager logs :

2016-09-22T00:29:03,515 ERROR [GraphiteEmitter-1] io.druid.emitter.graphite.GraphiteEmitter - Broken pipe
java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_101]
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) ~[?:1.8.0_101]
    at java.net.SocketOutputStream.write(SocketOutputStream.java:141) ~[?:1.8.0_101]
    at com.codahale.metrics.graphite.PickledGraphite.writeMetrics(PickledGraphite.java:262) ~[metrics-graphite-3.1.2.jar:3.1.2]
    at com.codahale.metrics.graphite.PickledGraphite.flush(PickledGraphite.java:221) ~[metrics-graphite-3.1.2.jar:3.1.2]
    at io.druid.emitter.graphite.GraphiteEmitter$ConsumerRunnable.run(GraphiteEmitter.java:176) [graphite-emitter-0.9.1.1.jar:0.9.1.1]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_101]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_101]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_101]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_101]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_101]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_101]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]

So I'm afraid it still happens, I have to restart middleManager for emitter to connect again.

@itaiy
Copy link
Copy Markdown

itaiy commented Nov 8, 2016

Hey,
We have the same problem (using 0.9.2-rc1) :

2016-11-03T15:07:23,532 ERROR [GraphiteEmitter-0] io.druid.emitter.graphite.GraphiteEmitter - Broken pipe
java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.7.0_111]
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) ~[?:1.7.0_111]
    at java.net.SocketOutputStream.write(SocketOutputStream.java:147) ~[?:1.7.0_111]
    at com.codahale.metrics.graphite.PickledGraphite.writeMetrics(PickledGraphite.java:262) ~[metrics-graphite-3.1.2.jar:3.1.2]
    at com.codahale.metrics.graphite.PickledGraphite.send(PickledGraphite.java:215) ~[metrics-graphite-3.1.2.jar:3.1.2]
    at io.druid.emitter.graphite.GraphiteEmitter$ConsumerRunnable.run(GraphiteEmitter.java:165) [graphite-emitter-0.9.2-rc1.jar:0.9.2-rc1]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_111]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [?:1.7.0_111]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [?:1.7.0_111]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.7.0_111]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_111]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_111]
    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_111]
2016-11-03T15:07:23,533 ERROR [GraphiteEmitter-0] io.druid.emitter.graphite.GraphiteEmitter - Already connected
java.lang.IllegalStateException: Already connected
    at com.codahale.metrics.graphite.PickledGraphite.connect(PickledGraphite.java:175) ~[metrics-graphite-3.1.2.jar:3.1.2]
    at io.druid.emitter.graphite.GraphiteEmitter$ConsumerRunnable.run(GraphiteEmitter.java:177) [graphite-emitter-0.9.2-rc1.jar:0.9.2-rc1]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_111]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [?:1.7.0_111]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [?:1.7.0_111]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.7.0_111]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_111]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_111]
    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_111]

As you can see, GraphiteEmitter tries to reconnect to Graphite, but failes.
Looking at the code of GraphiteEmitter and PickledGraphite, it seems the socket is marked as connected, so it failes to reconnect.

We suggest calling the close() method before calling connect(), like so :

pickledGraphite.close();
log.info("trying to re-connect to graphite server");
pickledGraphite.connect();

And also calling it after calling flush() (in the end of run()), like so :

try {
          pickledGraphite.flush();
} catch ( IOException e) {
          log.error(e, e.getMessage());
          if (e instanceof SocketException) {
            log.warn("closing connection to graphite server");
            pickledGraphite.close();
         }
}

That way, we'll actually be able to reconnect (albeit loosing the current batch of metrics).
We've tried it in our environment, and it seems to work.

@b-slim @l15k4 - what do you think?

@b-slim
Copy link
Copy Markdown
Contributor Author

b-slim commented Nov 8, 2016

@itaiy can you check if #3669 fix the issue ?

@itaiy
Copy link
Copy Markdown

itaiy commented Nov 8, 2016

@b-slim - you're quick, thank you!
I've just edited my comment above, since we need something similar around flush(), can you add this to your new PR as well?
Thanks!!

@b-slim
Copy link
Copy Markdown
Contributor Author

b-slim commented Nov 8, 2016

@itaiy i don't think that's needed ? i will close it instead of flush i guess that should work.

@b-slim
Copy link
Copy Markdown
Contributor Author

b-slim commented Nov 8, 2016

@itaiy i have updated the fix please check it out

@b-slim b-slim deleted the fix_reconnect_graphite branch April 26, 2018 01:28
@l15k4
Copy link
Copy Markdown

l15k4 commented Sep 18, 2018

Guys I'm afraid that graphite emitter only works until graphite server is available. Once it disconnects from the server due to network partition or server restart, it will never reconnect back, so it stops emitting metrics. Druid would have to be restarted so that it would connect again ...

This happens with the plain text graphite protocol

@l15k4
Copy link
Copy Markdown

l15k4 commented Sep 18, 2018

Ok, this finally happened log.warn("Trying to re-connect to graphite server"); after 20 minutes ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants