Skip to content

fix delegated smoosh writer and some new facilities for segment writeout medium#12132

Merged
clintropolis merged 2 commits intoapache:masterfrom
clintropolis:segment-writeout-improvements
Jan 11, 2022
Merged

fix delegated smoosh writer and some new facilities for segment writeout medium#12132
clintropolis merged 2 commits intoapache:masterfrom
clintropolis:segment-writeout-improvements

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

This PR fixes a bug in FileSmoosher when using addWithSmooshedWriter in "delegated" mode , which writes to temporary files until the "parent" SmooshedWriter is closed, in cases file names which look like valid paths, e.g. 'foo/bar'.

The added test case will explode with an error like:

java.nio.file.NoSuchFileException: /var/folders/v0/zc_jt5n12d39kx7dn6xmg73w0000gn/T/junit296384917261983992/base/foo/bar/0

prior to the changes in this PR, which is now using a counter to make a temp file name, and a map of the counter value to the actual file name when merging the delegated files back into the parent file.

There are also a couple of other improvements for some future changes that I would like to make. The first is FileSmoosher.addWithSmooshedWriter, when not delegating, now checks that it is still open when doing stuff inside of close, making it now be a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files)

Additionally, SegmentWriteOutMedium has a new method that allows callers to free-up WriteOutBytes when using a shared medium and are fully in control of the lifecycle of the WriteOutBytes.

  SegmentWriteOutMedium makeChildWriteOutMedium() throws IOException;

The contract of this method is that the child medium is registered to the closer of the parent, so need not be explicitly closed, but it also may be explicitly closed, to allow freeing the backing resources of WriteOutBytes created by the medium. When using a shared medium, such as is done when serializing columns during segment creation, these resources are typically not able to be released until the medium itself is closed.

The type of use case this method is targeting is actually present in IntermediateColumnarLongsSerializer, which does not actually use the SegmentWriteOutMedium until getSerializedSize/writeTo is called where they are immediately written to the channel. However, there is little benefit to changing to this method, because it is a very small number of additional files that will be created when writing out each column, and so not much to gain by releasing them early. Something like #10628 however, which creates a lot of files and immediately dumps them in the channel, could greatly reduce the number of open files on the system if there were for example a very large number of columns.

Finally, OnHeapMemorySegmentWriteOutMedium previously would never mark its WriteOutBytes as 'closed', meaning that callers could indefinitely write to one of these buffers even after the medium was closed. This seemed like it allows breaking the contract of SegmentWriteOutMedium to use WriteOutBytes after closed, and the underlying ByteBufferWriteOutBytes has methods to check that the medium is open, but the heap implementation has no way to set it as not opened. I have changed this so that all implementations of SegmentWriteOutMedium behave identically by pushing the free method down to ByteBufferWriteOutBytes (and overriding it in the direct implementation).


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

…out medium

changes:
* fixed issue with delegated `SmooshedWriter` when writing files that look like paths, causing `NoSuchFileException` exceptions when attempting to open a channel to the file
* `FileSmoosher.addWithSmooshedWriter` when _not_ delegating now checks that it is still open when closing, making it a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files)
* add `makeChildWriteOutMedium` to `SegmentWriteOutMedium` interface, which allows users of a shared medium to clean up `WriteOutBytes` if they fully control the lifecycle. there are no callers of this yet, adding for future functionality
* `OnHeapByteBufferWriteOutBytes` now can be marked as not open so it `OnHeapMemorySegmentWriteOutMedium` can now behave identically to other medium implementations
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jan 7, 2022

This pull request fixes 1 alert when merging b09a563 into 2299eb3 - view on LGTM.com

fixed alerts:

  • 1 for Uncontrolled data used in path expression

Copy link
Copy Markdown
Contributor

@cheddar cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One kinda nitty comment. Other than that, LGTM

private List<File> filesInProcess = new ArrayList<>();
// delegated smooshedWriter creates a new temporary file per file added. we use a counter to name these temporary
// files, and map the file number to the file name so we don't have to escape the file names (e.g. names with '/')
private int delegateFileCounter = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is likely only ever used in a single-threaded context. The extra overhead of an AtomicLong is basically nothing and it makes for a bit nicer guarantees, so I'd suggest using that instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class definitely isn't thread-safe, but i don't have a strong preference either way so went ahead and changed it

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jan 11, 2022

This pull request fixes 1 alert when merging e734b7b into 2a41b7b - view on LGTM.com

fixed alerts:

  • 1 for Uncontrolled data used in path expression

@clintropolis clintropolis merged commit 7cf9192 into apache:master Jan 11, 2022
@clintropolis clintropolis deleted the segment-writeout-improvements branch January 11, 2022 06:25
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
sachinsagare pushed a commit to pinterest/druid that referenced this pull request Oct 28, 2022
…out medium (apache#12132)

* fix delegated smoosh writer and some new facilities for segment writeout medium
changes:
* fixed issue with delegated `SmooshedWriter` when writing files that look like paths, causing `NoSuchFileException` exceptions when attempting to open a channel to the file
* `FileSmoosher.addWithSmooshedWriter` when _not_ delegating now checks that it is still open when closing, making it a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files)
* add `makeChildWriteOutMedium` to `SegmentWriteOutMedium` interface, which allows users of a shared medium to clean up `WriteOutBytes` if they fully control the lifecycle. there are no callers of this yet, adding for future functionality
* `OnHeapByteBufferWriteOutBytes` now can be marked as not open so it `OnHeapMemorySegmentWriteOutMedium` can now behave identically to other medium implementations

* fix to address nit - use AtomicLong

(cherry picked from commit 7cf9192)
sachinsagare pushed a commit to sachinsagare/druid that referenced this pull request Nov 2, 2022
…out medium (apache#12132)

* fix delegated smoosh writer and some new facilities for segment writeout medium
changes:
* fixed issue with delegated `SmooshedWriter` when writing files that look like paths, causing `NoSuchFileException` exceptions when attempting to open a channel to the file
* `FileSmoosher.addWithSmooshedWriter` when _not_ delegating now checks that it is still open when closing, making it a no-op if already closed (allowing column serializers to add additional files and avoid delegated mode if they are finished writing out their own content and ned to add additional files)
* add `makeChildWriteOutMedium` to `SegmentWriteOutMedium` interface, which allows users of a shared medium to clean up `WriteOutBytes` if they fully control the lifecycle. there are no callers of this yet, adding for future functionality
* `OnHeapByteBufferWriteOutBytes` now can be marked as not open so it `OnHeapMemorySegmentWriteOutMedium` can now behave identically to other medium implementations

* fix to address nit - use AtomicLong

(cherry picked from commit 7cf9192)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants