-
Notifications
You must be signed in to change notification settings - Fork 331
SAMZA-1768: Handle corrupted OFFSET file #588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@prateekm : could you please take a look? |
| try { | ||
| offset = FileUtil.readWithChecksum(offsetFileRef); | ||
| } catch (Exception e) { | ||
| LOG.warn("Fail to read offset file of " + storePath, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: For consistency: LOG.warn("Failed to read offset file in storage partition directory: {}", offsetFileName, storepath, e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| } | ||
|
|
||
| //atomic swap of tmp and real offset file | ||
| Files.move(tmpFile.toPath, file.toPath, StandardCopyOption.ATOMIC_MOVE, StandardCopyOption.REPLACE_EXISTING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need REPLACE_EXISTING if ATOMIC_MOVE is specified? From the javadoc:
ATOMIC_MOVE: The move is performed as an atomic file system operation and all other options are ignored. If the target file exists then it is implementation specific if the existing file is replaced or this method fails by throwing an IOException.
Also, this should be in the try block since this can throw exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller already handles the exception by logging the error and throw a runtime exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but would be good to move to try so that oos and fos are closed in finally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sure, let me do that.
| (configuredMetrics ++ rocksDbMetrics) | ||
| .foreach(property => metrics.newGauge(property, () => rocksDb.getProperty(property))) | ||
| .foreach(property => metrics.newGauge(property, () => | ||
| if (rocksDb.isOwningHandle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Maybe add a comment explaining that isOwningHandle is false if the db is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Added the comment.
prateekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
| oos.writeUTF(data) | ||
|
|
||
| //atomic swap of tmp and real offset file | ||
| Files.move(tmpFile.toPath, file.toPath, StandardCopyOption.ATOMIC_MOVE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think we need to add
StandardCopyOption.REPLACE_EXISTINGto the existing options toFiles.movemethod.Files.movethrows upFileAlreadyExistsExceptionif target file already exists and we don't provide that Replace_existing option as an additional option. -
Atomic move is a file system level primtiive operation and not all file system support it by default. File.move throws up
AtomicMoveNotSupportedExceptionif underlying file system doesn't support it. It would be better to handle it as well.
Please change if it makes sense(Above details are taken from Files.move javadoc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need both REPLACE_EXISTING and ATOMIC_MOVE, but we might need to catch IOException as well where we catch AtomicMoveNotSupportedException. From the javadoc for ATOMIC_MOVE:
If the target file exists then it is implementation specific if the existing file is replaced or this method fails by throwing an IOException.
| LOG.info("Found offset file in storage partition directory: {}", storePath); | ||
| offset = FileUtil.readWithChecksum(offsetFileRef); | ||
| try { | ||
| offset = FileUtil.readWithChecksum(offsetFileRef); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add a unit test for these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test for the case of offset already exist.
|
@prateekm : I move the Files.move() out of the try{} block again since we should close the streams before we do the file movement. This is to enforce the write to the offset file before move. The IOExceptions are handled from the callers so we don't need to catch them in this function. The current handling of failing the container looks ok to me since it should happen very rarely, as we haven't seen it happening in production. |
|
@xinyuiscool Makes sense to move this to its own try-catch block. It might still be better to handle IOException explicitly and revert to a normal move + replace existing. The IOException can happen if the implementation doesn't support replacing existing files as a part of the move. My concern is that this might fail on a particular platform consistently and we'll always end up rebootstrapping. Alternatively, if you've tested that it works on OSX and Linux, that'll be good enough for now. Otherwise LGTM, feel free to merge this. |
|
Good point on testing on both OSX and linux. I only tested on linux. Let me try OSX. Thanks. |
|
|
||
| (configuredMetrics ++ rocksDbMetrics) | ||
| .foreach(property => metrics.newGauge(property, () => rocksDb.getProperty(property))) | ||
| .foreach(property => metrics.newGauge(property, () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this change do?
This patch addresses the following tickets:
SAMZA-1778: SIGSEGV when reading properties (metrics) on a closed RocksDB store
SAMZA-1777: Logged store OFFSET file write during flush should be atomic
SAMZA-1768: Handle corrupted OFFSET file elegantly