Skip to content

File smoosher changes#55

Closed
akashdw wants to merge 8 commits intometamx:masterfrom
akashdw:file_smoosher_changes
Closed

File smoosher changes#55
akashdw wants to merge 8 commits intometamx:masterfrom
akashdw:file_smoosher_changes

Conversation

@akashdw
Copy link
Copy Markdown
Contributor

@akashdw akashdw commented Sep 26, 2016

This pr is to support FileSmoosher changes proposed in apache/druid#3513 .

FileSmoosher detect when it already has an OutputStream open and redirect newly opened OutputStreams to new files on the file system. When any of the open OutputStream objects are closed, they will also check to see if any of the other files have been closed in the meantime. If there are OutputStreams that have been closed, they will be copied on to the main smoosh file and the extra underlying file will be cleaned up. This has the downside of introducing an extra copy to these files, but it given that the strategy will only be used when absolutely necessary, this shouldn’t result in any noticeable performance degradation during indexing.

{
if(writerCurrentlyInUse)
{
return delegateSmooshedWriter (name , size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird spacing

private final List<File> outFiles = Lists.newArrayList();
private final Map<String, Metadata> internalFiles = Maps.newTreeMap();
private List<File> files = Lists.newArrayList();
private List<File> createFiles = Lists.newArrayList();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be past-tense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how about a rename to filesInProcess?

throw new IOException(
String.format("Expected [%,d] bytes, only saw [%,d], potential corruption?", size, bytesWritten)
);
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whitespace appears to be off...

);
}

if(!writerCurrentlyInUse)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you just set writerCurrentlyInUse to false and that this object is not attempting to be thread-safe, I don't think this can ever be false.

return addToOffset(Ints.checkedCast(java.nio.file.Files.copy(in, tmpFile.toPath())));
}

public int addToOffset(int numBytesWritten)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a really weird contract.

  1. Why is it public?
  2. Why does it return its incoming argument without any adjustment?

The name of the method makes it seem like this would be returning the total offset instead of the size. The methods in the non-delegate writer all do this call first with the name "verify" which makes it clear that the intent of the method is to do some state checks...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, It shouldn't be. Made the code changes to make it more meaningful.


private final List<File> outFiles = Lists.newArrayList();
private final Map<String, Metadata> internalFiles = Maps.newTreeMap();
private List<File> files = Lists.newArrayList();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this to completedFiles.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
open = false;
out.close();
files.add(tmpFile);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we remove the file from the createFiles or the filesInProcess here?

}
}
@Test
public void testWhenFirstWriterClosedInTheLast() throws Exception
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "InTheLast" mean "AtTheEnd" or something else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
if(writerCurrentlyInUse)
{
return delegateSmooshedWriter(name , size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before ,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


public SmooshedWriter addWithSmooshedWriter(final String name, final long size) throws IOException
{
if(writerCurrentlyInUse)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic should go after argument check (if block below).

Copy link
Copy Markdown
Contributor

@logarithm logarithm Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be a space after if

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
}

private SmooshedWriter delegateSmooshedWriter(final String name,final long size) throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

completedFiles.add(tmpFile);
filesInProcess.remove(tmpFile);

if(!writerCurrentlyInUse ) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

{
final File tmpFile = new File(baseDir, name);
filesInProcess.add(tmpFile);
return new SmooshedWriter () {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no space before (), { on the new line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


public SmooshedWriter addWithSmooshedWriter(final String name, final long size) throws IOException
{
if(writerCurrentlyInUse)
Copy link
Copy Markdown
Contributor

@logarithm logarithm Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be a space after if

//get processed elements from the stack and write.
List<File> fileToProcess = new ArrayList<>(completedFiles);
completedFiles = Lists.newArrayList();
for(File file: fileToProcess)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be a space after for

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

//book keeping checks on created file.
if(!completedFiles.isEmpty())
{
for(File file: completedFiles)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be a space after for

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

public void close() throws IOException
{
//book keeping checks on created file.
if(!completedFiles.isEmpty())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be a space after if

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@leventov leventov mentioned this pull request Oct 4, 2016
@akashdw
Copy link
Copy Markdown
Contributor Author

akashdw commented Oct 4, 2016

@logarithm @leventov @cheddar Thanks for reviewing the PR. I addressed all the comments and updated the pr. My Eclipse auto-formatter was turned off which caused the formatting issues.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Oct 4, 2016

👍

@leventov
Copy link
Copy Markdown
Contributor

leventov commented Oct 4, 2016

Please don't merge until the end of this week, I want to review this.

Copy link
Copy Markdown
Contributor

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Rebase onto the current master.
  2. There are still many code style problems. Please get this file: https://github.com/druid-io/druid/blob/master/druid_intellij_formatting.xml, import it into idea, and run "Reformat Code". Or, if you use Eclipse: https://github.com/druid-io/druid/blob/master/eclipse_formatting.xml
  3. I honestly tried to understand this code for some time but failed. I don't understand why we need two new collections of files, why they replace each other, the meaning of methods, etc. I cannot approve the PR in this state because it's too complex to understand (hence to support). You should either add extensive comments on the new fields and methods, explaining what is going on, or (better) find a simpler way to accomplish what you want.

@akashdw
Copy link
Copy Markdown
Contributor Author

akashdw commented Oct 7, 2016

@leventov Thanks for looking into this PR, I'm using https://github.com/druid-io/druid/blob/master/eclipse_formatting.xml to format code. I'll update my local settings if that setting got updated.
I updated the code and addressed all the formatting comments, not sure why github is not showing the updated code in comment section, however files changes section is showing formatted code.

@akashdw
Copy link
Copy Markdown
Contributor Author

akashdw commented Oct 8, 2016

@leventov Updated the PR and added comments to explain code logic.

Copy link
Copy Markdown
Contributor

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now. Thanks for the changes. I've also read original issue in Druid.

I think this whole thing should better be implemented using composition over FileSmoosher, rather than adding new logic to this class.

Please create a class like

class MultiSmoosher { // choose a better name please
  private final FileSmoosher delegate;
  private List<File> completedFiles;
  private List<File> filesInProcess;
}

And move the logic, which you have added to FileSmoosher in this PR, to this class.

I think this wrapper class could have specialized methods for writing to the delegate fileSmoosher directly and to temp files, because you know what do you need in the call sites in Druid. So the tricky writerCurrentlyInUse logic is not needed, I believe.

This class should be added to Druid codebase directly, not java-util.


private final List<File> outFiles = Lists.newArrayList();
private final Map<String, Metadata> internalFiles = Maps.newTreeMap();
// list of files completed writing content using delegated smooshedWriter.
Copy link
Copy Markdown
Contributor

@leventov leventov Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to write such comments as Javadoc /** ... */, then on automatic refactorings and reformattings IDEs will move comments along with fields. Javadoc comments are allowed for private fields.

private void mergeWithSmoosher() throws IOException
{
// Get processed elements from the stack and write.
List<File> fileToProcess = new ArrayList<>(completedFiles);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be just iteration over completedFiles (currently lines 251-256), then completedFiles.clear().

{
for (File file: completedFiles)
{
file.delete();
Copy link
Copy Markdown
Contributor

@leventov leventov Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe merge completedFiles here using mergeWithSmoosher(), rather than silently delete?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entry into the if statement bounding these for loops causes an exception to be thrown, the deletion is to clean up state before that exception is thrown.

{
file.delete();
}
for (File file: filesInProcess)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe throw some exception here, if filesToProcess is not empty? Because presence of files in filesInProcess means somebody opened a smoosherWriter but didn't close it before closing FileSmoosher. It seems like a bug.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an exception thrown just as soon as this for loop exits for exactly that reason. It's a bug if any files haven't yet been merged by the time the larger Smoosher gets closed (each of the individual Writers should have been closed and those should have caused everything to get merged).

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Oct 10, 2016

@leventov the idea to do it with composition is definitely interesting. It poses a few problems as well because the only thing that currently has access to the SmooshedWriter to know if it is in use or not is the FileSmoosher. So, the wrapping class would essentially also have to wrap the SmooshedWriter that is returned.

Also, there is a bit of state that the FileSmoosher necessarily owns, namely the directory that files are being created in. Switching to composition would force this state to somehow be shared with the wrapping class (or the wrapping class would have to have yet another directory available to it for the storing of files).

I think that this code belongs inside FileSmoosher and going through the steps to make it work with composition will be just a bunch of code baggage without much benefit.

Can you elaborate on why you are afraid to change FileSmoosher?

@leventov
Copy link
Copy Markdown
Contributor

leventov commented Oct 10, 2016

@cheddar

It poses a few problems as well because the only thing that currently has access to the SmooshedWriter to know if it is in use or not is the FileSmoosher. So, the wrapping class would essentially also have to wrap the SmooshedWriter that is returned.

From the description of apache/druid#3513 I understood that @akashdw is going to pass FileSmoosher somewhere while SmoosherWriter is open. AFAIU it's not recursion, i. e. FileSmoosher is passed to some different place. A wrapper class could essentially have delegateSmooshedWriter() method public (however I would name it differently, somehow involving adjective "delayed"), and call this method from that other place. Don't need to track the FileSmoosher itself. I written this already in other words:

I think this wrapper class could have specialized methods for writing to the delegate fileSmoosher directly and to temp files, because you know what do you need in the call sites in Druid. So the tricky writerCurrentlyInUse logic is not needed, I believe.


Also, there is a bit of state that the FileSmoosher necessarily owns, namely the directory that files are being created in. Switching to composition would force this state to somehow be shared with the wrapping class (or the wrapping class would have to have yet another directory available to it for the storing of files).

The only bit of state that new logic accesses is baseDir, to create temp files. However temp files don't need (and actually shouldn't) be created in the smoosher directory itself. They should better be created via Java standard, or Guava's utility methods for creating temp files, which end up placing it in java.io.temp dir or /tmp, which is likely to be mounted to tmpfs, while the smoosher directory is less likely to be there (?)


So, I don't see the new logic is tightly coupled with the FileSmoosher state. Also a couple of weeks ago you mentioned that it's better to add code to druid than java-util. This is also the case, I suppose.

@leventov
Copy link
Copy Markdown
Contributor

leventov commented Oct 10, 2016

@cheddar Also, I think it's quite likely that @akashdw's proposed design of adding functionality to Druid which he wants to add will be changed sooner or later. Maybe even as soon as on the PR review stage. So that this change to FileSmoosher is no longer needed, for whatever reason. But we will need to carry it because java-util is a library. While logic from Druid core could be removed freely, because it is not a library. Also don't forget about boilerplate of having separate PRs and release cycle of java-util and Druid...

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Oct 11, 2016

Ok, we will just move this to be part of the larger PR in druid.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Oct 11, 2016

Btw,

They should better be created via Java standard, or Guava's utility methods for creating temp files, which end up placing it in java.io.temp dir or /tmp, which is likely to be mounted to tmpfs, while the smoosher directory is less likely to be there (?)

We cannot and should not do that. By writing into random temporary locations, you are placing extra data in arbitrary places that the users do not have strong control over. The directories used inside of Druid for storing these things have a very specific lifecycle and get cleaned up according to that lifecycle. Reusing those directories is important and using tmp locations is a non-starter (and often leads to orphaned files and users not understanding why things have filled up).

@leventov
Copy link
Copy Markdown
Contributor

leventov commented Oct 11, 2016

@cheddar thanks, this makes sense. So, smooshed file directories are temporal as a whole? And could be located in tmpfs?

So if "MultiSmoosher" needs to access FileSmoosher's baseDir, it could be passed along with FileSmoosher in constructor, or getBaseDir() accessor could be added to FileSmoosher.

Copy link
Copy Markdown
Contributor

@drcrallen drcrallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a softer note, I suggest checking the result of delete() and at least log a warning if you are unable to delete a file.

* various "chunk" files will be varying sizes and it is not possible to add a
* file of size greater than Integer.MAX_VALUE.
* <p/>
* This class is not thread safe but allows writing multiple files even if main
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to a straight forward declaration like This class is no thread safe on its own line.

You can describe the handling of multiple files in another paragraph but such description should in no way mangled with explanations of thread safett.

if (i==10)
{
writer.write(ByteBuffer.wrap(Ints.toByteArray(19)));
CloseQuietly.close(writer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why quietly?

smoosher.add(String.format("%d", i), tmpFile);
tmpFile.delete();
}
CloseQuietly.close(writer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why quietly?

}
}
}
} No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline

akashdw added a commit to akashdw/druid that referenced this pull request Nov 9, 2016
@drcrallen
Copy link
Copy Markdown
Contributor

Thanks!

fjy pushed a commit to apache/druid that referenced this pull request Nov 10, 2016
* FileSmoosher requested changes 

from metamx/java-util#55

* Addressed code review requested changes.
@leventov leventov closed this Nov 10, 2016
fundead pushed a commit to fundead/druid that referenced this pull request Dec 7, 2016
* FileSmoosher requested changes 

from metamx/java-util#55

* Addressed code review requested changes.
dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017
* FileSmoosher requested changes 

from metamx/java-util#55

* Addressed code review requested changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants