Conversation
| { | ||
| if(writerCurrentlyInUse) | ||
| { | ||
| return delegateSmooshedWriter (name , size); |
| private final List<File> outFiles = Lists.newArrayList(); | ||
| private final Map<String, Metadata> internalFiles = Maps.newTreeMap(); | ||
| private List<File> files = Lists.newArrayList(); | ||
| private List<File> createFiles = Lists.newArrayList(); |
There was a problem hiding this comment.
Should this be past-tense?
There was a problem hiding this comment.
Actually, how about a rename to filesInProcess?
| throw new IOException( | ||
| String.format("Expected [%,d] bytes, only saw [%,d], potential corruption?", size, bytesWritten) | ||
| ); | ||
| ); |
There was a problem hiding this comment.
This whitespace appears to be off...
| ); | ||
| } | ||
|
|
||
| if(!writerCurrentlyInUse) |
There was a problem hiding this comment.
Given that you just set writerCurrentlyInUse to false and that this object is not attempting to be thread-safe, I don't think this can ever be false.
| return addToOffset(Ints.checkedCast(java.nio.file.Files.copy(in, tmpFile.toPath()))); | ||
| } | ||
|
|
||
| public int addToOffset(int numBytesWritten) |
There was a problem hiding this comment.
This method has a really weird contract.
- Why is it
public? - Why does it return its incoming argument without any adjustment?
The name of the method makes it seem like this would be returning the total offset instead of the size. The methods in the non-delegate writer all do this call first with the name "verify" which makes it clear that the intent of the method is to do some state checks...
There was a problem hiding this comment.
Agree, It shouldn't be. Made the code changes to make it more meaningful.
|
|
||
| private final List<File> outFiles = Lists.newArrayList(); | ||
| private final Map<String, Metadata> internalFiles = Maps.newTreeMap(); | ||
| private List<File> files = Lists.newArrayList(); |
There was a problem hiding this comment.
How about renaming this to completedFiles.
| { | ||
| open = false; | ||
| out.close(); | ||
| files.add(tmpFile); |
There was a problem hiding this comment.
Why wouldn't we remove the file from the createFiles or the filesInProcess here?
| } | ||
| } | ||
| @Test | ||
| public void testWhenFirstWriterClosedInTheLast() throws Exception |
There was a problem hiding this comment.
Does "InTheLast" mean "AtTheEnd" or something else?
| { | ||
| if(writerCurrentlyInUse) | ||
| { | ||
| return delegateSmooshedWriter(name , size); |
|
|
||
| public SmooshedWriter addWithSmooshedWriter(final String name, final long size) throws IOException | ||
| { | ||
| if(writerCurrentlyInUse) |
There was a problem hiding this comment.
Logic should go after argument check (if block below).
There was a problem hiding this comment.
Here should be a space after if
| } | ||
| } | ||
|
|
||
| private SmooshedWriter delegateSmooshedWriter(final String name,final long size) throws IOException |
| completedFiles.add(tmpFile); | ||
| filesInProcess.remove(tmpFile); | ||
|
|
||
| if(!writerCurrentlyInUse ) { |
| { | ||
| final File tmpFile = new File(baseDir, name); | ||
| filesInProcess.add(tmpFile); | ||
| return new SmooshedWriter () { |
There was a problem hiding this comment.
no space before (), { on the new line
|
|
||
| public SmooshedWriter addWithSmooshedWriter(final String name, final long size) throws IOException | ||
| { | ||
| if(writerCurrentlyInUse) |
There was a problem hiding this comment.
Here should be a space after if
| //get processed elements from the stack and write. | ||
| List<File> fileToProcess = new ArrayList<>(completedFiles); | ||
| completedFiles = Lists.newArrayList(); | ||
| for(File file: fileToProcess) |
There was a problem hiding this comment.
Here should be a space after for
| //book keeping checks on created file. | ||
| if(!completedFiles.isEmpty()) | ||
| { | ||
| for(File file: completedFiles) |
There was a problem hiding this comment.
Here should be a space after for
| public void close() throws IOException | ||
| { | ||
| //book keeping checks on created file. | ||
| if(!completedFiles.isEmpty()) |
There was a problem hiding this comment.
Here should be a space after if
|
@logarithm @leventov @cheddar Thanks for reviewing the PR. I addressed all the comments and updated the pr. My Eclipse auto-formatter was turned off which caused the formatting issues. |
|
👍 |
|
Please don't merge until the end of this week, I want to review this. |
leventov
left a comment
There was a problem hiding this comment.
- Rebase onto the current master.
- There are still many code style problems. Please get this file: https://github.com/druid-io/druid/blob/master/druid_intellij_formatting.xml, import it into idea, and run "Reformat Code". Or, if you use Eclipse: https://github.com/druid-io/druid/blob/master/eclipse_formatting.xml
- I honestly tried to understand this code for some time but failed. I don't understand why we need two new collections of files, why they replace each other, the meaning of methods, etc. I cannot approve the PR in this state because it's too complex to understand (hence to support). You should either add extensive comments on the new fields and methods, explaining what is going on, or (better) find a simpler way to accomplish what you want.
|
@leventov Thanks for looking into this PR, I'm using https://github.com/druid-io/druid/blob/master/eclipse_formatting.xml to format code. I'll update my local settings if that setting got updated. |
|
@leventov Updated the PR and added comments to explain code logic. |
leventov
left a comment
There was a problem hiding this comment.
Ok, I see now. Thanks for the changes. I've also read original issue in Druid.
I think this whole thing should better be implemented using composition over FileSmoosher, rather than adding new logic to this class.
Please create a class like
class MultiSmoosher { // choose a better name please
private final FileSmoosher delegate;
private List<File> completedFiles;
private List<File> filesInProcess;
}And move the logic, which you have added to FileSmoosher in this PR, to this class.
I think this wrapper class could have specialized methods for writing to the delegate fileSmoosher directly and to temp files, because you know what do you need in the call sites in Druid. So the tricky writerCurrentlyInUse logic is not needed, I believe.
This class should be added to Druid codebase directly, not java-util.
|
|
||
| private final List<File> outFiles = Lists.newArrayList(); | ||
| private final Map<String, Metadata> internalFiles = Maps.newTreeMap(); | ||
| // list of files completed writing content using delegated smooshedWriter. |
There was a problem hiding this comment.
I think it is better to write such comments as Javadoc /** ... */, then on automatic refactorings and reformattings IDEs will move comments along with fields. Javadoc comments are allowed for private fields.
| private void mergeWithSmoosher() throws IOException | ||
| { | ||
| // Get processed elements from the stack and write. | ||
| List<File> fileToProcess = new ArrayList<>(completedFiles); |
There was a problem hiding this comment.
Maybe it can be just iteration over completedFiles (currently lines 251-256), then completedFiles.clear().
| { | ||
| for (File file: completedFiles) | ||
| { | ||
| file.delete(); |
There was a problem hiding this comment.
Maybe merge completedFiles here using mergeWithSmoosher(), rather than silently delete?
There was a problem hiding this comment.
Entry into the if statement bounding these for loops causes an exception to be thrown, the deletion is to clean up state before that exception is thrown.
| { | ||
| file.delete(); | ||
| } | ||
| for (File file: filesInProcess) |
There was a problem hiding this comment.
Maybe throw some exception here, if filesToProcess is not empty? Because presence of files in filesInProcess means somebody opened a smoosherWriter but didn't close it before closing FileSmoosher. It seems like a bug.
There was a problem hiding this comment.
There is an exception thrown just as soon as this for loop exits for exactly that reason. It's a bug if any files haven't yet been merged by the time the larger Smoosher gets closed (each of the individual Writers should have been closed and those should have caused everything to get merged).
|
@leventov the idea to do it with composition is definitely interesting. It poses a few problems as well because the only thing that currently has access to the Also, there is a bit of state that the I think that this code belongs inside Can you elaborate on why you are afraid to change |
From the description of apache/druid#3513 I understood that @akashdw is going to pass FileSmoosher somewhere while SmoosherWriter is open. AFAIU it's not recursion, i. e. FileSmoosher is passed to some different place. A wrapper class could essentially have
The only bit of state that new logic accesses is baseDir, to create temp files. However temp files don't need (and actually shouldn't) be created in the smoosher directory itself. They should better be created via Java standard, or Guava's utility methods for creating temp files, which end up placing it in java.io.temp dir or /tmp, which is likely to be mounted to tmpfs, while the smoosher directory is less likely to be there (?) So, I don't see the new logic is tightly coupled with the FileSmoosher state. Also a couple of weeks ago you mentioned that it's better to add code to druid than java-util. This is also the case, I suppose. |
|
@cheddar Also, I think it's quite likely that @akashdw's proposed design of adding functionality to Druid which he wants to add will be changed sooner or later. Maybe even as soon as on the PR review stage. So that this change to FileSmoosher is no longer needed, for whatever reason. But we will need to carry it because java-util is a library. While logic from Druid core could be removed freely, because it is not a library. Also don't forget about boilerplate of having separate PRs and release cycle of java-util and Druid... |
|
Ok, we will just move this to be part of the larger PR in druid. |
|
Btw,
We cannot and should not do that. By writing into random temporary locations, you are placing extra data in arbitrary places that the users do not have strong control over. The directories used inside of Druid for storing these things have a very specific lifecycle and get cleaned up according to that lifecycle. Reusing those directories is important and using tmp locations is a non-starter (and often leads to orphaned files and users not understanding why things have filled up). |
|
@cheddar thanks, this makes sense. So, smooshed file directories are temporal as a whole? And could be located in tmpfs? So if "MultiSmoosher" needs to access FileSmoosher's baseDir, it could be passed along with FileSmoosher in constructor, or |
drcrallen
left a comment
There was a problem hiding this comment.
On a softer note, I suggest checking the result of delete() and at least log a warning if you are unable to delete a file.
| * various "chunk" files will be varying sizes and it is not possible to add a | ||
| * file of size greater than Integer.MAX_VALUE. | ||
| * <p/> | ||
| * This class is not thread safe but allows writing multiple files even if main |
There was a problem hiding this comment.
Please change to a straight forward declaration like This class is no thread safe on its own line.
You can describe the handling of multiple files in another paragraph but such description should in no way mangled with explanations of thread safett.
| if (i==10) | ||
| { | ||
| writer.write(ByteBuffer.wrap(Ints.toByteArray(19))); | ||
| CloseQuietly.close(writer); |
| smoosher.add(String.format("%d", i), tmpFile); | ||
| tmpFile.delete(); | ||
| } | ||
| CloseQuietly.close(writer); |
| } | ||
| } | ||
| } | ||
| } No newline at end of file |
|
Thanks! |
* FileSmoosher requested changes from metamx/java-util#55 * Addressed code review requested changes.
* FileSmoosher requested changes from metamx/java-util#55 * Addressed code review requested changes.
* FileSmoosher requested changes from metamx/java-util#55 * Addressed code review requested changes.
This pr is to support
FileSmoosherchanges proposed in apache/druid#3513 .FileSmoosherdetect when it already has an OutputStream open and redirect newly opened OutputStreams to new files on the file system. When any of the open OutputStream objects are closed, they will also check to see if any of the other files have been closed in the meantime. If there are OutputStreams that have been closed, they will be copied on to the main smoosh file and the extra underlying file will be cleaned up. This has the downside of introducing an extra copy to these files, but it given that the strategy will only be used when absolutely necessary, this shouldn’t result in any noticeable performance degradation during indexing.