Skip to content

S3 input source#8903

Merged
clintropolis merged 35 commits intoapache:masterfrom
clintropolis:s3-input-source
Nov 26, 2019
Merged

S3 input source#8903
clintropolis merged 35 commits intoapache:masterfrom
clintropolis:s3-input-source

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

Following up to #8823, this PR adds an S3 InputSource and InputEntity implementation allowing it to be used with the new native batch indexing interfaces. This is currently re-uses the same configuration options as the s3 static firehose, but as an InputSource:

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "uris": ["s3://some/path/file.json"]
      },
      "inputFormat": {
        "type": "json"
      },
      "appendToExisting": false
    },
...

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • S3InputSource
  • S3Entity

@ccaominh
Copy link
Copy Markdown
Contributor

Travis failure looks related to PR changes:

[ERROR]   Run 1: S3InputSourceTest.testWithPrefixesSplit:131 expected:<[s3://foo/bar/file.gz, s3://bar/foo/file2.gz]> but was:<[s3://foo/bar, s3://bar/foo]>

https://travis-ci.org/apache/incubator-druid/jobs/613814127

}

protected InputSourceReader fixedFormatReader(InputRowSchema inputRowSchema, @Nullable File temporaryDirectory)
throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TeamCity is flagging this: The declared exception IOException is never thrown in this method, nor in its derivables

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, nothing implements this method yet, I was preemptively adding it because InputSource.createSplits is allows to throw an IOException and readers will most likely be calling this method. I can remove for now, though I suspect it will likely need to be added back later.

@clintropolis
Copy link
Copy Markdown
Member Author

Travis failure looks related to PR changes:

Yeah, made a mistake cleaning up the code for PR and pointed the mock to return the wrong uris, fixed.

public Stream<InputSplit<URI>> createSplits(InputFormat inputFormat, @Nullable SplitHintSpec splitHintSpec)
throws IOException
{
if (cacheSplitUris == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc for SplittableInputSource.createSplits() notes that implementations should NOT cache the created splits in memory.

@jihoonson Is my understanding of SplittleInputSource.createSplits() correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it shouldn’t be cached in memory.

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified, uris are already in memory so if given an explicit list there is nothing to be chill about.

prefixes are handled with an iterator that was based on a previous iterator implementation in s3Utils. This iterator uses list objects calls on each prefix in batches of 1024 objects (with a fallback to getObjectMetadata if a specific 403 is encountered), and creates an iterator on that set of summaries which is drained to the outer iterator. When the batch (or single summary) is done, it moves on to the next prefix and repeats, until all prefixes in the list have been iterated. Retries are baked into each call, so the caller of this method doesn't have to worry about such things.

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i should probably add something like this to a javadoc of the iterator method 😢

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm fine with this change since it fixes the race between createSplits() and getNumSplits().

this.uri = uri;
}

@Nullable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this never returns null. If it does, then uri should be checked for null below in open before calling uri.getAuthority().

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed @Nullable

import java.io.InputStream;
import java.net.URI;

public class S3Entity implements InputEntity
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be package-private

);

@Test
public void testSerde() throws Exception
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider splitting this into two separate tests (one for uris and one for prefixes).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split

{
final ObjectMapper mapper = createS3ObjectMapper();

final List<URI> uris = Arrays.asList(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making uris and prefixes static final variables since they're used in a few tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went ahead and did this though i'm not sure I actually like the change, since looking at each test it's a lot less obvious what it is testing

}

@Override
protected InputSourceReader formattableReader(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test coverage is missing for this method

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test through test of read method, which mocks the s3 client to do the list and getMetadata operations and mocked get object method that "returns" S3Object with csv file content

);
objects.addAll(Lists.newArrayList(objectSummaryIterator));
}
catch (AmazonS3Exception outerException) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test coverage is missing for the exception handling here and may be worth adding to verify the retry logic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for this and more

if (s3Object == null) {
throw new ISE("Failed to get an s3 object for bucket[%s] and key[%s]", bucket, key);
}
return CompressionUtils.decompress(s3Object.getObjectContent(), uri.toString());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: It doesn't matter much, but uri.getPath() would be better here, because uri.toString() URI-encodes its values.

Actually, key would be even better.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I meant to switch this to getPath to be consistent with google extension after your review of this there and forgot, but yeah key is better, fixed this and google extension to use key.

* {@link ServerSideEncryptingAmazonS3#getObjectMetadata} to check if the 'prefix' is an object in the event the
* list objects call responds with a 403 http status code
*/
private static Iterator<InputSplit<URI>> objectFetchingIterator(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like it could be shareable between the input source and the firehose. If true, please accomplish that sharing, ideally by having the firehose call into the input source.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, i was just ignoring the firehose.

I have moved this iterator to S3Utils and changed the signature to be Iterator<S3ObjectSummary>, and modified S3Utils.objectSummaryIterator, used by StaticS3FirehoseFactory and S3TimestampVersionedDataFinder, to take a URI instead of a bucket and key (mildly unfortunate since we will be converting it back to bucket and key, but all the callers have URI available), and defer the logic to this newer iterator.

Copy link
Copy Markdown
Contributor

@ccaominh ccaominh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment thread extensions-core/s3-extensions/pom.xml Outdated
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.10.5</version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove the version here so it uses the one defined in the root POM.

originalAuthority;
final String path = originalPath.startsWith("/") ? originalPath.substring(1) : originalPath;

return URI.create(StringUtils.format("s3://%s/%s", authority, path));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since authority and path are both strings, string concatenation may be better than StringUtils.format() here

EasyMock.expect(S3_CLIENT.listObjectsV2(EasyMock.anyObject(ListObjectsV2Request.class))).andReturn(result).once();
}

private static void addExpectedNonPrefixObjectsWithNoListPermission(URI uri)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter uri is not used in the method

originalAuthority;
final String path = originalPath.startsWith("/") ? originalPath.substring(1) : originalPath;

return URI.create(StringUtils.format("s3://%s/%s", authority, path));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad, because it won't encode funny characters in path. Imagine the path has a ? in it. It needs to be URI-encoded, or else pulling the key out later won't work. The tricky characters are / (which you don't want to encode) and ?, #, and others (which you do).

StringUtils.urlEncode might help you here.

Alternatively, don't use URIs internally, instead use bucket/key pairs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding this method to CloudObjectLocation:

public URI toUri()
{
  // Encode path, except leave '/' characters unencoded
  return URI.create(StringUtils.format("s3://%s/%s", bucket, StringUtils.urlEncode(path).replace("%2F", "/"));
}

And using it everywhere that is doing this sort of concatenation today.

It won't handle weird, invalid bucket names but it's better than the simple concatenation happening now, and weird paths are more likely anyway. For extra credit you could include validation for the bucket, throwing an error if it's not valid (AWS, Google, etc all have rules for what's a valid bucket, you could do a loose superset of them).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this method in S3Utils was only called by StaticS3FirehoseFactory, not the new stuff, so I wasn't worrying about it too much because we presumably will remove it in a future release. That said, I went ahead and did the thing to fix it

@vogievetsky
Copy link
Copy Markdown
Contributor

I noticed that the type of this is s3 (vs static-s3). I am 👍 on the change but should there be a release notes tag? My auto firehose to input source converter got caught out by this.

Comment on lines +81 to +87
if (!this.uris.isEmpty() && !this.prefixes.isEmpty()) {
throw new IAE("uris and prefixes cannot be used together");
}

if (this.uris.isEmpty() && this.prefixes.isEmpty()) {
throw new IAE("uris or prefixes must be specified");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, simplify to:

if (this.uris.isEmpty() == this.prefixes.isEmpty()) {
  throw new IAE("exactly one of either uris or prefixes must be specified)
}

}

@JsonProperty
@JsonProperty("uris")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ("uris") needed?

Comment on lines +119 to +120
@JsonProperty("objects")
public List<CloudObjectLocation> getObject()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you rename the getter to getObjects() then you wont need the ("objects").

org.apache.commons.io.FileUtils.forceMkdir(outDir);

final URI uri = URI.create(StringUtils.format("s3://%s/%s", s3Coords.bucket, s3Coords.path));
final URI uri = URI.create(StringUtils.format("s3://%s/%s", s3Coords.getBucket(), s3Coords.getPath()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem that described in https://github.com/apache/incubator-druid/pull/8903/files/a4f6ae9ae2f81381d865e87ec5e1219d275f299c..7125e3e94bd468bc82c95fad68538890a23c69ee#r348869424 possible here when the URI created here gets passed to the CloudObjectLocation constructor?

Is there a test to check the handling of tricky characters?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not such a test afaik, I guess can look into this, or maybe as a follow-up since this isn't really new code and sort of feels like the scope of this PR is creeping

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is definitely buggy, please at least include a comment warning future devs as much. (You aren't doing much here besides mechanical refactoring, so I wouldn't insist on fixing it or adding tests, but the comment is nice.)

Example of the bug:

scala> URI.create(String.format("s3://%s/%s", "mybucket", "path/to/myobject?question")).getPath
res1: String = /path/to/myobject

Also:

scala> URI.create(String.format("s3://%s/%s", "mybucket", "path/to/100%myobject")).getPath
java.lang.IllegalArgumentException: Malformed escape pair at index 25: s3://mybucket/path/to/100%myobject
  at java.net.URI.create(URI.java:852)
  ... 28 elided
Caused by: java.net.URISyntaxException: Malformed escape pair at index 25: s3://mybucket/path/to/100%myobject
  at java.net.URI$Parser.fail(URI.java:2848)
  at java.net.URI$Parser.scanEscape(URI.java:2978)
  at java.net.URI$Parser.scan(URI.java:3001)
  at java.net.URI$Parser.checkChars(URI.java:3019)
  at java.net.URI$Parser.parseHierarchical(URI.java:3105)
  at java.net.URI$Parser.parse(URI.java:3053)
  at java.net.URI.<init>(URI.java:588)
  at java.net.URI.create(URI.java:850)
  ... 28 more

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should always leave code better than we found it. Small bugs like this are not worth putting into an issue, and will likely never get worked on, but some poor soul somewhere on the interwebs will run into it and bang their head against it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing bugs is not scope creep...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a regression and so doesn't have to be fixed in this PR. It's up to the author in this case.

IMHO we should always leave code better than we found it. Small bugs like this are not worth putting into an issue, and will likely never get worked on, but some poor soul somewhere on the interwebs will run into it and bang their head against it.

I don't think this will be happening for this bug. This bug is pretty critical and should be fixed as soon as possible.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and fixed this for most of S3 by refactoring to use CloudObjectLocation to ensure URI handling is good, though the footprint of this PR has grown a lot, which is what I was worried about. In some sense, this is sort of related to the work done in #6761. I have opened #8941 to finish the remaining issues.

private static final String MIMETYPE_JETS3T_DIRECTORY = "application/x-directory";
private static final Logger log = new Logger(S3Utils.class);

public static final int MAX_S3_RETRIES = 10;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

);
}
return CompressionUtils.decompress(s3Object.getObjectContent(), key);
return s3Object.getObjectContent();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this decompress the stream like it did before?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is my bad. S3Entity is a RetryingInputEntity and the returned input stream here is wrapped with RetryingInputStream. Decompression logic should be done on RetryingInputStream.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth adding a test

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

|type|This should be `s3`.|N/A|yes|
|uris|JSON array of URIs where s3 files to be ingested are located.|N/A|`uris` or `prefixes` must be set|
|prefixes|JSON array of URI prefixes for the locations of s3 files to be ingested.|N/A|`uris` or `prefixes` must be set|

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your latest changes, need to add another row for objects here and update the required value for the other columns based on the presence of objects.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I actually wasn't going to document objects because it's primarily used internally for the splits for parallel subtasks to avoid converting bucket/path back into a URI, but I guess if people prefer to put in an array of objects instead of an array of uris I guess there is no harm in documenting it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only harm will be making it something we can't get rid of in the future. If you want a method to have public visibility but not be a public API, we should note that on the method.

Comment on lines +71 to +73
if (keyString.startsWith("/")) {
keyString = keyString.substring(1);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is used in a few places (e.g., CloudObjectLocation(URI uri). May be useful to add a helper function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added StringUtils.maybeRemoveLeadingSlash

getRetryCondition(),
RetryUtils.DEFAULT_MAX_TRIES
);
return CompressionUtils.decompress(retryingInputStream, getDecompressionPath());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: thank you for fixing this! I think The javadoc of readFrom and readFromStart now should mention the returned inputStream shouldn't decompress. I think I'm going to raise a PR for adding some unit tests for this bug and maybe I can update javadoc in my PR unless you want to do it here.

InputRowSchema inputRowSchema,
@Nullable InputFormat inputFormat,
File temporaryDirectory
@Nullable File temporaryDirectory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this nullable now? I think it shouldn't be nullable if it's null only in unit tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i don't remember adding this so I suspect I did this on accident, will fix

@clintropolis
Copy link
Copy Markdown
Member Author

I noticed that the type of this is s3 (vs static-s3). I am 👍 on the change but should there be a release notes tag? My auto firehose to input source converter got caught out by this.

I think all of the new native batch stuff will have to be addressed in the release notes. However all of the old stuff is still there so static-s3 still actually works for now if you are using the old parser/firehose way of doing things.

Copy link
Copy Markdown
Contributor

@ccaominh ccaominh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Nov 23, 2019

I noticed that the type of this is s3 (vs static-s3). I am 👍 on the change but should there be a release notes tag? My auto firehose to input source converter got caught out by this.

@vogievetsky — "auto firehose to input source converter" sounds a bit scary, I don't think special effort was put into making the sources consistent with the firehoses (rather, I believe the effort was instead put into making them consistent with other input sources). So there may be other pitfalls if you are assuming they can be converted without special knowledge of the specific source type.



/**
* Get path to decompress a compressed stream for the entity
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble making sense of what this comment means, could you please consider rewording it?

At first glance it sounds like a path on local disk that the compressed stream will be decompressed to, but looking at implementations, that doesn't seem right.

At second glance it looks like it's the filename corresponding to the input entity, and is just used to figure out if it needs to be decompressed or not. The javadoc should say something like that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redid javadocs, renamed method to more generic getPath since it's also a correct name given usage and maybe this is useful for other things

@@ -32,30 +33,39 @@ public interface RetryingInputEntity extends InputEntity
@Override
default InputStream open() throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that readFromStart() and readFrom(long) are clearer, could you please also update the javadocs for InputEntity#open as well, to say whether or not it should decompress?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added javadocs mentioning that default open implementation handle decompression

* Directly opens an {@link InputStream} on the input entity. Decompression should be handled externally, this should
* return the raw stream for the object.
*/
default InputStream readFromStart() throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These seem like "internal" methods that aren't actually meant to be called by users of the interface. In that case RetryingInputEntity probably makes more sense as an abstract class than as an interface, with these methods marked protected. I won't insist it be changed, but if it stays an interface, it'd be nice for the javadocs to say that external callers aren't meant to use these methods.

The reason is a general assumption that any method on an interface is meant for users of the interface.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to abstract class

@Override
public String toString()
{
return "CloudObjectLocation {"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting is a little weird here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, fixed, not really sure how it got mangled, i used intellij to generate it originally

Comment thread extensions-core/s3-extensions/pom.xml Outdated
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.10.5</version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do remove the version here, it should get pulled it via dependencyManagement from the parent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}
}

private S3InputSource(ServerSideEncryptingAmazonS3 s3Client, CloudObjectLocation inputSplit)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO, it's better to have only one constructor that actually does stuff, and have the others call this(...). It makes invariants easier to get right.

Or just have one constructor, period, and use static creator methods for other styles of creation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed extra constructor

{
this.s3Client = Preconditions.checkNotNull(s3Client, "s3Client");
this.uris = uris == null ? new ArrayList<>() : uris;
this.prefixes = prefixes == null ? new ArrayList<>() : prefixes;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for the inconsistency here: uris and prefixes are set to empty lists if they come in as null, but objects isn't?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

burned again for initially starting by porting over the S3StaticFirehoseFactory, reworked this to use nulls... though as I'm writing this comment I realize I should probably treat empties the same as null and not consider that invalid... will fix PR again

}

for (final URI inputURI : this.uris) {
Preconditions.checkArgument("s3".equals(inputURI.getScheme()), "input uri scheme == s3 (%s)", inputURI);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: be nice to extract "s3" into a constant like SCHEME.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

org.apache.commons.io.FileUtils.forceMkdir(outDir);

final URI uri = URI.create(StringUtils.format("s3://%s/%s", s3Coords.bucket, s3Coords.path));
final URI uri = URI.create(StringUtils.format("s3://%s/%s", s3Coords.getBucket(), s3Coords.getPath()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is definitely buggy, please at least include a comment warning future devs as much. (You aren't doing much here besides mechanical refactoring, so I wouldn't insist on fixing it or adding tests, but the comment is nice.)

Example of the bug:

scala> URI.create(String.format("s3://%s/%s", "mybucket", "path/to/myobject?question")).getPath
res1: String = /path/to/myobject

Also:

scala> URI.create(String.format("s3://%s/%s", "mybucket", "path/to/100%myobject")).getPath
java.lang.IllegalArgumentException: Malformed escape pair at index 25: s3://mybucket/path/to/100%myobject
  at java.net.URI.create(URI.java:852)
  ... 28 elided
Caused by: java.net.URISyntaxException: Malformed escape pair at index 25: s3://mybucket/path/to/100%myobject
  at java.net.URI$Parser.fail(URI.java:2848)
  at java.net.URI$Parser.scanEscape(URI.java:2978)
  at java.net.URI$Parser.scan(URI.java:3001)
  at java.net.URI$Parser.checkChars(URI.java:3019)
  at java.net.URI$Parser.parseHierarchical(URI.java:3105)
  at java.net.URI$Parser.parse(URI.java:3053)
  at java.net.URI.<init>(URI.java:588)
  at java.net.URI.create(URI.java:850)
  ... 28 more

originalAuthority;
final String path = originalPath.startsWith("/") ? originalPath.substring(1) : originalPath;

return URI.create(StringUtils.format("s3://%s/%s", authority, path));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding this method to CloudObjectLocation:

public URI toUri()
{
  // Encode path, except leave '/' characters unencoded
  return URI.create(StringUtils.format("s3://%s/%s", bucket, StringUtils.urlEncode(path).replace("%2F", "/"));
}

And using it everywhere that is doing this sort of concatenation today.

It won't handle weird, invalid bucket names but it's better than the simple concatenation happening now, and weird paths are more likely anyway. For extra credit you could include validation for the bucket, throwing an error if it's not valid (AWS, Google, etc all have rules for what's a valid bucket, you could do a loose superset of them).

@vogievetsky
Copy link
Copy Markdown
Contributor

@gianm I made a thing that if you paste in a firehose based input spec into the data loader it will be magically converted to an input source based one. I need that because the data loader will soon only work with input sources. Do you think this is a bad idea?

I know there are a lot of Druid users that have ingestion specs saved somewhere outside of Druid what were you imagining these people will do to get onto the new format? Convert it by hand? I figured that the data loader could be helpful there.

@vogievetsky
Copy link
Copy Markdown
Contributor

Also if the data loader does not have that feature should there still be a paragraph in the release notes that guides people how to convert between the specs? Or is it just "here are the new docs, figure it out"?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Nov 23, 2019

@vogievetsky let's continue this conversation in #8933

* implementations. {@link #bucket} and {@link #path} should NOT be URL encoded.
*
* The intention is that this is used as a common representation for storage objects as an alternative to dealing in
* {@link URI} directly, but still provide a mechansim to round-trip with a URI.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to push another commit later, there's a typo here: mechnsim -> mechanism

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had one more comment on the new stuff, thanks @clintropolis

{
this.bucket = Preconditions.checkNotNull(StringUtils.maybeRemoveTrailingSlash(bucket));
this.path = Preconditions.checkNotNull(StringUtils.maybeRemoveLeadingSlash(path));
Preconditions.checkArgument(this.bucket.equals(StringUtils.urlEncode(this.bucket)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception might get thrown in response to user input, so please add a nice error message. As is, the user would get an IllegalArgumentException with no message.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @clintropolis

@clintropolis clintropolis merged commit 4458113 into apache:master Nov 26, 2019
@clintropolis clintropolis deleted the s3-input-source branch November 26, 2019 06:31
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants