-
Notifications
You must be signed in to change notification settings - Fork 5
Feature/687 hdfs service #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
041d34e
Implement KafkaService
kevinwallimann 8e175c9
Implement KafkaService
kevinwallimann 8166509
formatting
kevinwallimann 3e2d88c
wip
kevinwallimann 55f02f2
wip
kevinwallimann 94c0731
rename to HyperdriveOffsetComparisonService, add tests
kevinwallimann 3a4e8d7
fix tests
kevinwallimann 845f6f5
fix formatting
kevinwallimann e2e7d4d
make methods private, remove unnecessary stuff
kevinwallimann 7661d2a
Don't import hyperdrive -ingestor
kevinwallimann 12215d2
Merge branch 'develop' into feature/687-hdfs-service
kevinwallimann 69b0403
Exclude conflicting dependeny
kevinwallimann 9ce0843
Fix format
kevinwallimann d9ff9e2
PR fixes
kevinwallimann bc59081
Use futures, todo: return false if kafka topic doesn't exist
kevinwallimann a7cf78f
return false if kafka topic doesn't exist. Add tests
kevinwallimann 7955475
scalafmt
kevinwallimann eb57234
scalafmt
kevinwallimann da0177e
Merge branch 'develop' into feature/687-hdfs-service
kevinwallimann 15299f0
Remove temp file
kevinwallimann cddc675
login explicitly
kevinwallimann 541f8d1
Change parameter type to JobInstanceParameters
kevinwallimann f2e7f45
PR fix: Limit number of kafka consumers
kevinwallimann 97d8b89
PR fix: Refactor HdfsService -> CheckpointService
kevinwallimann 6770f85
PR fix: Make methods private
kevinwallimann 013e825
Add comments / logging
kevinwallimann 70a8013
Fix formatting
kevinwallimann c7f8309
Make kafka consumer cache per thread
kevinwallimann cd4c7f4
Add a default deserializer to read from kafka
kevinwallimann 99769c6
fix formatting
kevinwallimann 291e832
fix formatting 2
kevinwallimann c62bdaa
Add HdfsService, additional logging for kafka consumer
kevinwallimann b9c89e5
Merge branch 'develop' into feature/687-hdfs-service
kevinwallimann b04ac94
PR fix: Move parse method to HdfsService, add tests
kevinwallimann 61ab80b
Undo change for testing
kevinwallimann 08fea77
Add comment
kevinwallimann 8fb2aad
Fix formatting
kevinwallimann File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
152 changes: 152 additions & 0 deletions
152
src/main/scala/za/co/absa/hyperdrive/trigger/api/rest/services/CheckpointService.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| /* | ||
| * Copyright 2018 ABSA Group Limited | ||
| * | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package za.co.absa.hyperdrive.trigger.api.rest.services | ||
|
|
||
| import com.fasterxml.jackson.databind.ObjectMapper | ||
| import com.fasterxml.jackson.module.scala.DefaultScalaModule | ||
| import org.apache.hadoop.fs.{Path, PathFilter} | ||
| import org.apache.hadoop.security.UserGroupInformation | ||
| import org.slf4j.LoggerFactory | ||
| import org.springframework.stereotype.Service | ||
| import za.co.absa.hyperdrive.trigger.api.rest.utils.ScalaUtil.swap | ||
|
|
||
| import javax.inject.Inject | ||
| import scala.util.Try | ||
|
|
||
| trait CheckpointService { | ||
| type TopicPartitionOffsets = Map[String, Map[Int, Long]] | ||
|
|
||
| def getOffsetsFromFile(path: String)(implicit ugi: UserGroupInformation): Try[Option[TopicPartitionOffsets]] | ||
| def getLatestOffsetFilePath(params: HdfsParameters)(implicit | ||
| ugi: UserGroupInformation | ||
| ): Try[Option[(String, Boolean)]] | ||
| } | ||
|
|
||
| class HdfsParameters( | ||
| val keytab: String, | ||
| val principal: String, | ||
| val checkpointLocation: String | ||
| ) | ||
|
|
||
| @Service | ||
| class CheckpointServiceImpl @Inject() (hdfsService: HdfsService) extends CheckpointService { | ||
jozefbakus marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| private val logger = LoggerFactory.getLogger(this.getClass) | ||
| private val mapper = new ObjectMapper().registerModule(DefaultScalaModule) | ||
| private val offsetsDirName = "offsets" | ||
| private val commitsDirName = "commits" | ||
|
|
||
| /** | ||
| * See org.apache.spark.sql.execution.streaming.HDFSMetadataLog | ||
| */ | ||
| private val batchFilesFilter = new PathFilter { | ||
| override def accept(path: Path): Boolean = { | ||
| try { | ||
| path.getName.toLong | ||
| true | ||
| } catch { | ||
| case _: NumberFormatException => | ||
| false | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override def getOffsetsFromFile( | ||
| path: String | ||
| )(implicit ugi: UserGroupInformation): Try[Option[TopicPartitionOffsets]] = { | ||
| hdfsService.parseFileAndClose(path, parseKafkaOffsetStream) | ||
| } | ||
|
|
||
| /** | ||
| * @return an Option of a String, Boolean pair. The string contains the path to the latest offset file, while the | ||
| * boolean is true if the offset is committed (i.e. a corresponding commit file exists), and false otherwise. | ||
| * None is returned if the offset file does not exist. If the offset file does not exist, the corresponding | ||
| * commit file is assumed to also not exist. | ||
| */ | ||
| override def getLatestOffsetFilePath( | ||
| params: HdfsParameters | ||
| )(implicit ugi: UserGroupInformation): Try[Option[(String, Boolean)]] = { | ||
| getLatestOffsetBatchId(params.checkpointLocation).flatMap { offsetBatchIdOpt => | ||
| val offsetFilePath = offsetBatchIdOpt.map { offsetBatchId => | ||
| getLatestCommitBatchId(params.checkpointLocation).map { commitBatchIdOpt => | ||
| val committed = commitBatchIdOpt match { | ||
| case Some(commitBatchId) => offsetBatchId == commitBatchId | ||
| case None => false | ||
| } | ||
| val path = new Path(s"${params.checkpointLocation}/${offsetsDirName}/${offsetBatchId}") | ||
| (path.toString, committed) | ||
| } | ||
| } | ||
| if (offsetFilePath.isEmpty) { | ||
| logger.debug(s"No offset files exist under checkpoint location ${params.checkpointLocation}") | ||
| } | ||
| swap(offsetFilePath) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * see org.apache.spark.sql.execution.streaming.OffsetSeqLog | ||
| * and org.apache.spark.sql.kafka010.JsonUtils | ||
| * for details on the assumed format | ||
| */ | ||
| private def parseKafkaOffsetStream(lines: Iterator[String]): TopicPartitionOffsets = { | ||
| val SERIALIZED_VOID_OFFSET = "-" | ||
| def parseOffset(value: String): Option[TopicPartitionOffsets] = value match { | ||
| case SERIALIZED_VOID_OFFSET => None | ||
| case json => Some(mapper.readValue(json, classOf[TopicPartitionOffsets])) | ||
| } | ||
| if (!lines.hasNext) { | ||
| throw new IllegalStateException("Incomplete log file") | ||
| } | ||
|
|
||
| lines.next() // skip version | ||
| lines.next() // skip metadata | ||
| lines | ||
| .map(parseOffset) | ||
| .filter(_.isDefined) | ||
| .map(_.get) | ||
| .toSeq | ||
| .head | ||
| } | ||
|
|
||
| private def getLatestCommitBatchId(checkpointDir: String)(implicit ugi: UserGroupInformation): Try[Option[Long]] = { | ||
| val commitsDir = new Path(s"$checkpointDir/$commitsDirName") | ||
| getLatestBatchId(commitsDir) | ||
| } | ||
|
|
||
| private def getLatestOffsetBatchId(checkpointDir: String)(implicit ugi: UserGroupInformation): Try[Option[Long]] = { | ||
| val offsetsDir = new Path(s"$checkpointDir/$offsetsDirName") | ||
| getLatestBatchId(offsetsDir) | ||
| } | ||
|
|
||
| private def getLatestBatchId(path: Path)(implicit ugi: UserGroupInformation): Try[Option[Long]] = { | ||
| hdfsService.exists(path).flatMap { exists => | ||
| if (exists) { | ||
| hdfsService.listStatus(path, batchFilesFilter).map { statuses => | ||
| statuses | ||
| .map { status => | ||
| status.getPath.getName.toLong | ||
| } | ||
| .sorted | ||
| .lastOption | ||
|
|
||
| } | ||
| } else { | ||
| logger.debug(s"Could not find path $path") | ||
| Try(None) | ||
| } | ||
| } | ||
| } | ||
| } | ||
116 changes: 116 additions & 0 deletions
116
src/main/scala/za/co/absa/hyperdrive/trigger/api/rest/services/HdfsService.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| /* | ||
| * Copyright 2018 ABSA Group Limited | ||
| * | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package za.co.absa.hyperdrive.trigger.api.rest.services | ||
|
|
||
| import org.apache.commons.io.IOUtils | ||
| import org.apache.hadoop.fs._ | ||
| import org.apache.hadoop.security.UserGroupInformation | ||
| import org.apache.spark.deploy.SparkHadoopUtil | ||
| import org.slf4j.LoggerFactory | ||
| import org.springframework.stereotype.Service | ||
|
|
||
| import java.nio.charset.StandardCharsets.UTF_8 | ||
| import java.security.PrivilegedExceptionAction | ||
| import scala.io.Source | ||
| import scala.util.Try | ||
|
|
||
| trait HdfsService { | ||
jozefbakus marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def exists(path: Path)(implicit ugi: UserGroupInformation): Try[Boolean] | ||
| def open(path: Path)(implicit ugi: UserGroupInformation): Try[FSDataInputStream] | ||
| def listStatus(path: Path, filter: PathFilter)(implicit ugi: UserGroupInformation): Try[Array[FileStatus]] | ||
| def parseFileAndClose[R](pathStr: String, parseFn: Iterator[String] => R)(implicit | ||
| ugi: UserGroupInformation | ||
| ): Try[Option[R]] | ||
| } | ||
|
|
||
| @Service | ||
| class HdfsServiceImpl extends HdfsService { | ||
| private val logger = LoggerFactory.getLogger(this.getClass) | ||
| private lazy val conf = SparkHadoopUtil.get.conf | ||
| override def exists(path: Path)(implicit ugi: UserGroupInformation): Try[Boolean] = { | ||
| Try { | ||
| doAs { | ||
| fs.exists(path) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override def open(path: Path)(implicit ugi: UserGroupInformation): Try[FSDataInputStream] = { | ||
| Try { | ||
| doAs { | ||
| fs.open(path) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override def listStatus(path: Path, filter: PathFilter)(implicit | ||
| ugi: UserGroupInformation | ||
| ): Try[Array[FileStatus]] = { | ||
| Try { | ||
| doAs { | ||
| fs.listStatus(path, filter) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * @param pathStr path to the file as a string | ||
| * @param parseFn function that parses the file line by line. Caution: It must materialize the content, | ||
| * because the file is closed after the method completes. E.g. it must not return an iterator. | ||
| * @tparam R type of the parsed value | ||
| * @return None if the file doesn't exist, Some with the parsed content | ||
| */ | ||
| override def parseFileAndClose[R](pathStr: String, parseFn: Iterator[String] => R)(implicit | ||
| ugi: UserGroupInformation | ||
| ): Try[Option[R]] = { | ||
| for { | ||
| path <- Try(new Path(pathStr)) | ||
| exists <- exists(path) | ||
| parseResult <- | ||
| if (exists) { | ||
| open(path).map { input => | ||
| try { | ||
| val lines = Source.fromInputStream(input, UTF_8.name()).getLines() | ||
| Some(parseFn(lines)) | ||
| } catch { | ||
| case e: Exception => | ||
| // re-throw the exception with the log file path added | ||
| throw new Exception(s"Failed to parse file $path", e) | ||
| } finally { | ||
| IOUtils.closeQuietly(input) | ||
| } | ||
| } | ||
|
|
||
| } else { | ||
| logger.debug(s"Could not find file $path") | ||
| Try(None) | ||
| } | ||
| } yield parseResult | ||
| } | ||
|
|
||
| /** | ||
| * Must not be a lazy val, because different users should get different FileSystems. FileSystem is cached internally. | ||
| */ | ||
| private def fs = FileSystem.get(conf) | ||
|
|
||
| private def doAs[T](fn: => T)(implicit ugi: UserGroupInformation) = { | ||
| ugi.doAs(new PrivilegedExceptionAction[T] { | ||
| override def run(): T = { | ||
| fn | ||
| } | ||
| }) | ||
| } | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.