Skip to content

AWS S3 support for Atum#38

Merged
dk1844 merged 18 commits intomasterfrom
feature/37-aws-sdk-poc
Sep 21, 2020
Merged

AWS S3 support for Atum#38
dk1844 merged 18 commits intomasterfrom
feature/37-aws-sdk-poc

Conversation

@dk1844
Copy link
Collaborator

@dk1844 dk1844 commented Aug 31, 2020

AWS support for ATUM

  • new Atum entry point for S3 impl: sparkSession.enableControlMeasuresTrackingForS3 - no breaking changes for existing usage
  • original persistence implementation is moved under za.co.absa.atum.persistence.hdfs, while the s3 one is under za.co.absa.atum.persistence.s3 for the structures to make sense.
  • README updated
  • Unit tests both for HDFS and S3 added - ControlMeasures(Hdfs|S3)(Loader|Storer)JsonSpec
  • Spark-bound listener (SparkQueryExecutionListener) and output info file inference for the S3 case support added.

Test-run support

  • Sample measurement object (runnable locally, but with disabled test) available for S3 Impl as well - SampleMeasurementsS3RunnerSpec
    • to run the loader, one needs to provide s3 location with existing info file and an existing saml profile
    • to run the storer, one needs to provide writable s3 location, KMS key Id string, and an existing saml profile

Test-run

I have successfully run SampleS3Measurements1 and SampleS3Measurements2 with a real S3 location, the data were read from and written to. ✔️

@dk1844 dk1844 linked an issue Aug 31, 2020 that may be closed by this pull request
@@ -0,0 +1 @@
mock-maker-inline
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows mocking final classes for Mockito, which would not be otherwise possible.

README.md update with S3 specific example
@dk1844 dk1844 marked this pull request as ready for review September 1, 2020 07:02
@@ -0,0 +1,12 @@
package za.co.absa.atum.utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an idea to have FileUtils, HdfsFileUtils, S3Utils together in one object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, they could in theory. I thought the separation clearly communicates the different concerns they solve. 🤷 I will think about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, the less I think it would improve things. So unless there is an obvious upside in this (that I am not seeing at the moment), I'd rather keep it separate.

"implicit def StringToPath" changed to an implicit class wrapper.

version update to 0.3.0-SNAPSHOT
@lokm01
Copy link
Collaborator

lokm01 commented Sep 3, 2020

Very cool PR, please include @yruslan to review when ready ;)

…torer (s3 based or not). may expose the kmsKeyId this way, though!
todo = s3Location work, remove extensive logging
…ined storer, no storing otherwise.

toS3Location test added
import za.co.absa.atum.model.Measurement

class BigDecimalToJsonSerializationSpec extends FlatSpec with Matchers {
class BigDecimalToJsonSerializationSpec extends AnyFlatSpec with Matchers {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (org.scalatest.{FlatSpec, Matchers} -> org.scalatest.{flatspec.AnyFlatSpec, matchers.should.Matchers}) is due to ScalaTest version upgrade.

* Ability to view the storer if set.
* @return
*/
private[atum] def getStorer: Option[ControlMeasuresStorer] = if (isStorerLoaded) Some(storer) else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, the accumulator insides are opened to the rest of the Atum to behave differently based on what kind of storer (HDFS vs S3) is used.

@dk1844 dk1844 requested a review from yruslan September 7, 2020 14:34
@dk1844 dk1844 merged commit 16d3806 into master Sep 21, 2020
@dk1844 dk1844 deleted the feature/37-aws-sdk-poc branch September 21, 2020 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PoC _INFO files generation in S3

3 participants