When multiple stages are doing work on a document and one of them discards the document, other stages working on the same document will attempt to persist it and fail.
Ideally, stages would know if another stage has discarded the working document, and be able to act on that (perhaps by simply ignoring the document). Documents would need to remain in the documents collection for that to work, I think, and no new stages should be able to fetch the document.
The current behaviour yields logs filled with:
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): DEBUG Saving document to RemotePipeline..
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Node gave an unexpected response: HTTP/1.1 404 Not Found
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Message: No document found matching your query
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Node gave an unexpected response: HTTP/1.1 404 Not Found
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Message: No document found matching your query
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): $STACKTRACE$
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Unable to persist an error to the database
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): java.io.IOException: Unable to save changes to core
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): at com.findwise.hydra.stage.AbstractProcessStage.run(AbstractProcessStage.java:114)
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin):
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): $STACKTRACE$
When multiple stages are doing work on a document and one of them discards the document, other stages working on the same document will attempt to persist it and fail.
Ideally, stages would know if another stage has discarded the working document, and be able to act on that (perhaps by simply ignoring the document). Documents would need to remain in the
documentscollection for that to work, I think, and no new stages should be able to fetch the document.The current behaviour yields logs filled with: