[HUDI-3478] Implement CDC Read in Spark #6727

YannByron · 2022-09-20T15:01:24Z

Change Logs

This PR is going to support CDC Read in Spark.

The changes are listed:

extract some cdc common classes in hudi-common that will be used to implement CDCReader for different engines;
CDCReader to response the CDC query on spark.

Impact

Low

YannByron · 2022-09-20T15:19:17Z

@xushiyan @prasannarajaperumal @alexeykudinkin please continue to review this.
and i will submit a following commit that use the avro format, not json format, to persist the fields.

nsivabalan · 2022-09-23T16:15:08Z

cancelling all azure CI runs for now to investigate CI flakiness. will retrigger build once we are in stable state. sorry about the inconvenience.

alexeykudinkin · 2022-09-23T19:47:37Z

...ark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

 * limitations under the License.
 */

 package org.apache.spark.sql.avro


Please avoid any changes to the borrowed classes -- we keep changes to them to absolutely necessary minimum to make sure they do not diverge from Spark impl, and we're able to cherry-pick and carry forward these changes whenever we backport new version (from Spark)

It has to. The original logical (use val converter: Any => Any = {) has a bug that it will return the same value when we call this method twice continuously. And HoodieCDCRDD need these changes.

It has to. The original logical (use val converter: Any => Any = {) has a bug that it will return the same value when we call this method twice continuously. And HoodieCDCRDD need these changes.

so looks like an improvement we can land and fix separately? better to track it separately as CDC impl. do not need to know about this fix, right? the APIs remain the same.

Another note: it's not very obvious to reviewers until you explained as above. so for the sake of faster review, please comment on it yourself and explain proactively.

I'm not sure i understand what the issue is. Can you please create a separate PR with it? If there's a bug let's make sure we're adding necessary tests for it.

alexeykudinkin · 2022-09-23T19:50:04Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/LogIteratorUtils.scala

+import scala.collection.JavaConverters._
+import scala.util.Try
+
+object LogIteratorUtils {


Let's consolidate this w/ LogFileIterator (let's name this LogFileIterator, there's no need for separate utils object)

alexeykudinkin · 2022-09-23T19:50:59Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SafeAvroProjection.scala

+// TODO extract to HoodieAvroSchemaUtils
+abstract class AvroProjection extends (GenericRecord => GenericRecord)
+
+class SafeAvroProjection(


nit: Please make sure we format params in-line w/ existing style formatting (i believe it'd be also captured in the style-guide):

def foo(a: Int, b: String, ...)

alexeykudinkin · 2022-09-23T19:53:39Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+      )
+    }
+
+    private lazy val mapper: ObjectMapper = {


Do we still need this, given we moved to Avro?

yes, we need this.

alexeykudinkin · 2022-09-23T19:53:49Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+      metaClient: HoodieTableMetaClient
+    ) extends Iterator[InternalRow] with SparkAdapterSupport with AvroDeserializerSupport with Closeable {
+
+    private val fs = metaClient.getFs.getFileSystem


These vals should be lazy by default

alexeykudinkin · 2022-09-23T20:19:24Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCFileSplit.java

+  /**
+   * * the change type, which decide to how to retrieve the change data. more details see: [[CDCFileTypeEnum]]
+   * */
+  private HoodieCDCLogicalFileType cdcFileType;


Please make all fields final

alexeykudinkin · 2022-09-23T20:20:16Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCLogicalFileType.java

+ * Here define four cdc file types. The different cdc file type will decide which file will be
+ * used to extract the change data, and how to do this.
+ *
+ * CDC_LOG_FILE:


Let's make sure these are in-sync w/ the RFC

@xushiyan did we end up revisiting this terminology in the RFC?

We'll keep consistent between this pr and rfc.

alexeykudinkin · 2022-09-23T20:20:57Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordReader.java

+
+import java.io.IOException;
+
+public class HoodieCDCLogRecordReader implements ClosableIterator<IndexedRecord> {


This is an Iterator rather than Reader

alexeykudinkin · 2022-09-23T20:21:59Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordReader.java

+
+  @Override
+  public boolean hasNext() {
+    if (itr == null || !itr.hasNext()) {


nit: If we flip this conditional we can decrease nested-ness

alexeykudinkin · 2022-09-23T20:25:25Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/LogFileIterator.scala

+ * Provided w/ instance of [[HoodieMergeOnReadFileSplit]], iterates over all of the records stored in
+ * Delta Log files (represented as [[InternalRow]]s)
+ */
+class LogFileIterator(split: HoodieMergeOnReadFileSplit,


@YannByron are you making any changes to these or just extracting this code practically as is (with minor changes to abstract params)?

just extract this code so that CDC can reuse them. (minor changes: extend some params to make them common and usable).

danny0405 · 2022-09-24T03:13:09Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java

 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.Path;
+import org.apache.hudi.exception.HoodieException;
 import org.apache.log4j.LogManager;


Please fix the import sequences of all the files. org.apache.hudi package should be in the front.

danny0405 · 2022-09-24T03:27:47Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCFileSplit.java

+ * at a single commit.
+ * <p>
+ * For [[cdcFileType]] = [[CDCFileTypeEnum.ADD_BASE_FILE]], [[cdcFile]] is a current version of
+ * the base file in the group, and [[beforeFileSlice]] is None.


[[]] is scala style doc, you can use {@code xxx} instead in java.

danny0405 · 2022-09-24T03:56:46Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java

+   * Then build a [[ChangeFileForSingleFileGroupAndCommit]] object.
+   */
+  private HoodieCDCFileSplit parseWriteStat(
+      HoodieFileGroupId fileGroupId,


For ChangeFileForSingleFileGroupAndCommit do you mean HoodieCDCFileSplit ?

danny0405 · 2022-09-24T04:01:45Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java

+      // no cdc log files can be used directly. we reuse the existing data file to retrieve the change data.
+      String path = writeStat.getPath();
+      if (path.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
+        // this is a base file


Better use FSUtils.isBaseFile because we support ORC format as well.

danny0405 · 2022-09-24T04:32:49Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java

+          HoodieCDCFileSplit changeFile =
+              parseWriteStat(fileGroupId, instant, writeStat, commitMetadata.getOperationType());
+          if (!fgToCommitChanges.containsKey(fileGroupId)) {
+            fgToCommitChanges.put(fileGroupId, new ArrayList<>());


Use computeIfAbsent should be fine.

danny0405 · 2022-09-24T04:34:18Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java


-    Schema mergedSchema = Schema.createRecord("CDC", null, tableSchema.getNamespace(), false);
+    Schema mergedSchema = Schema.createRecord("CDC", null, "", false);
    mergedSchema.setFields(fields);


Why this change ?

will restore it.

danny0405 · 2022-09-24T04:38:11Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordIterator.java

+      if (reader.hasNext()) {
+        HoodieDataBlock dataBlock = (HoodieDataBlock) reader.next();
+        if (dataBlock.getBlockType() == HoodieLogBlock.HoodieLogBlockType.CDC_DATA_BLOCK) {
+          itr = dataBlock.getRecordIterator();


Are there other data blocks here ? If not, we can remove this check for efficiency.

yes, it can be simplified.

danny0405 · 2022-09-24T04:57:24Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala

+   * Here we use the debezium format.
+   */
+  val FULL_CDC_SPARK_SCHEMA: StructType = {
+    StructType(


Do we need to expose the schema as debezium for reader internal ? Why not reuse the field _hoodie_operaton which is a hoodie format.

Curious how downstream pipeline handle these records ? For SQL users, they declare the table schema, for example with fields (a, b, c, d), now you return RDD with schema of avro, how and when to deserialize them into (a, b, c ,d) then ?

the returned RDD use the json string, not the avro format (that only used inside of hudi).
op, ts_ms, before and after are the fields given in debezium.

danny0405 · 2022-09-24T06:36:31Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java

+      List<FileStatus> touchedFiles = new ArrayList<>();
+      for (String touchedPartition : touchedPartitions) {
+        Path partitionPath = FSUtils.getPartitionPath(basePath, touchedPartition);
+        touchedFiles.addAll(Arrays.asList(fs.listStatus(partitionPath)));


Can improve to only add the files that belongs to the touched fileGroups.

danny0405

+1, nice work, let's address the left comments in subsequent PRs.

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCInferCase.java

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/EmptyRelation.scala

xushiyan · 2022-09-25T09:12:29Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SafeAvroProjection.scala

+// TODO extract to HoodieAvroSchemaUtils
+abstract class AvroProjection extends (GenericRecord => GenericRecord)
+
+class SafeAvroProjection(sourceSchema: Schema,


if you moved code around, please annotate it by commenting on PR yourself and explain it's moved from where and what was modified. otherwise it'll be hard for reviewers to make a call to approve or not

got it. From HoodieMergeOnReadRDD without any change.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala

xushiyan · 2022-09-25T09:26:04Z

...ark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

 * limitations under the License.
 */

 package org.apache.spark.sql.avro


It has to. The original logical (use val converter: Any => Any = {) has a bug that it will return the same value when we call this method twice continuously. And HoodieCDCRDD need these changes.

so looks like an improvement we can land and fix separately? better to track it separately as CDC impl. do not need to know about this fix, right? the APIs remain the same.

Another note: it's not very obvious to reviewers until you explained as above. so for the sake of faster review, please comment on it yourself and explain proactively.

hudi-bot · 2022-09-25T20:26:12Z

CI report:

e298986 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan

+1 let's make sure the in-code TODO can be followed up

xushiyan · 2022-09-26T00:50:51Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+    } else if (isCdcQuery) {
+      CDCRelation.getCDCRelation(sqlContext, metaClient, parameters)
    } else {
      (tableType, queryType, isBootstrappedTable) match {


this if-check could merge with the match below, to achieve code alignment

xushiyan · 2022-09-26T00:57:30Z

...udi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+  private val isCDCQuery = CDCRelation.isCDCEnabled(metaClient) &&
+    parameters.get(DataSourceReadOptions.QUERY_TYPE.key).contains(DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) &&
+    parameters.get(DataSourceReadOptions.INCREMENTAL_FORMAT.key).contains(DataSourceReadOptions.INCREMENTAL_FORMAT_CDC_VAL)


repeated check-logic. could have been extracted to a util

alexeykudinkin · 2022-09-26T20:25:15Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCInferCase.java

+ *   file is new-coming, so we can load this, mark all the records with `i`, and treat them as
+ *   the value of `after`. The value of `before` for each record is null.
+ *
+ * BASE_FILE_INSERT:


alexeykudinkin · 2022-09-26T20:26:28Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCInferCase.java

+ *   a whole file group. First we find this file group. Then load this, mark all the records with
+ *   `d`, and treat them as the value of `before`. The value of `after` for each record is null.
+ */
+public enum HoodieCDCInferCase {


"Infer" is a verb, we should rather call this HoodieCDCInferenceCase

alexeykudinkin · 2022-09-26T20:29:42Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+    originTableSchema: HoodieTableSchema,
+    cdcSchema: StructType,
+    requiredCdcSchema: StructType,
+    changes: Array[HoodieCDCFileGroupSplit])


@YannByron let's make sure we annotate this as @transient (these shouldn't be serialized and passed down to executor, similar to other RDDs)

alexeykudinkin · 2022-09-26T20:30:18Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+    changes: Array[HoodieCDCFileGroupSplit])
+  extends RDD[InternalRow](spark.sparkContext, Nil) with HoodieUnsafeRDD {
+
+  @transient private val hadoopConf = spark.sparkContext.hadoopConfiguration


Let's inline this to avoid mistakes

alexeykudinkin · 2022-09-26T20:32:00Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+
+    private lazy val fs = metaClient.getFs.getFileSystem
+
+    private lazy val conf = new Configuration(confBroadcast.value.value)


Let's avoid copying (and do it only if we modify it) -- it's not cheap

alexeykudinkin · 2022-09-26T20:35:52Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+
+    /**
+     * Two cases will use this to iterator the records:
+     * 1) extract the change data from the base file directly, including 'ADD_BASE_File' and 'REMOVE_BASE_File'.


@YannByron need to update this to align w/ HoodieCDCInferenceCase

alexeykudinkin · 2022-09-26T20:37:58Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+    private var currentInstant: HoodieInstant = _
+
+    // The change file that is currently being processed
+    private var currentChangeFile: HoodieCDCFileSplit = _


nit: currentCDCFileSplit

alexeykudinkin · 2022-09-26T20:40:31Z

hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java

+          FileSlice beforeFileSlice = new FileSlice(fileGroupId, writeStat.getPrevCommit(), beforeBaseFile, new ArrayList<>());
+          cdcFileSplit = new HoodieCDCFileSplit(BASE_FILE_DELETE, null, Option.empty(), Option.of(beforeFileSlice));
+        } else if (writeStat.getNumUpdateWrites() == 0L && writeStat.getNumDeletes() == 0
+            && writeStat.getNumWrites() == writeStat.getNumInserts()) {


@YannByron there's an issue right now where we undercount inserts (AFAIR) and so numWrites != numUpdates + numInserts we need to be careful with this conditionals (and address the underlying issue as well)

alexeykudinkin · 2022-09-26T20:42:24Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala

+ * Hoodie CDC Relation extends Spark's [[BaseRelation]], provide the schema of cdc
+ * and the [[buildScan]] to return the change-data in a specified range.
+ */
+class CDCRelation(


Let's make sure this is rebased onto HoodieBaseRelation

alexeykudinkin · 2022-09-26T20:46:57Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala

+        (cdcLogRecordIterator == null || !cdcLogRecordIterator.hasNext)
+    }
+
+    @tailrec final def hasNextInternal: Boolean = {


I think we should split CDCFileGroupIterator into N iterators for every HoodieCDCInferenceCase to make it more manageable and easier to understand

TengHuo · 2022-11-24T07:16:56Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/LogFileIterator.scala

+  //        - Projected schema
+  //       As such, no particular schema could be assumed, and therefore we rely on the caller
+  //       to correspondingly set the scheme of the expected output of base-file reader
+  private val baseFileReaderAvroSchema = sparkAdapter.getAvroSchemaConverters.toAvroType(baseFileReader.schema, nullable = false, "record")


Hi @YannByron

Recently, we found an Avro schema issue which is caused by the wrong record name (detail here: #7284).

May I ask if this line could cause the same problem? If so, we can discuss how to fix it in PR: #7297

The code is just moved from an other class. So I am not sure whether can work correctly in any cases.
But i have solved the very similar problem that caused by avro namespace: 60b62fc .

Got it, thanks @YannByron

Reviewed the code recently, turns out this Avro schema baseFileReaderAvroSchema is only used for resolveNullableType in AvroSerializer. It won't be involved in any serializer/deserializer process, so it's okay to use "record" as Avro schema name.

YannByron force-pushed the cdc_query branch 5 times, most recently from c4f22cd to e674bd8 Compare September 23, 2022 15:03

alexeykudinkin reviewed Sep 23, 2022

View reviewed changes

YannByron force-pushed the cdc_query branch from d11dbf1 to e9bbf49 Compare September 24, 2022 01:09

danny0405 reviewed Sep 24, 2022

View reviewed changes

danny0405 requested changes Sep 24, 2022

View reviewed changes

yihua added priority:blocker Production down; release blocker big-needle-movers reader-core labels Sep 24, 2022

danny0405 reviewed Sep 24, 2022

View reviewed changes

danny0405 approved these changes Sep 24, 2022

View reviewed changes

xushiyan reviewed Sep 25, 2022

View reviewed changes

YannByron added 6 commits September 25, 2022 23:45

[HUDI-3478] Implement CDC Read in Spark

957e477

resolve comments

41a25f7

resolve comments

5348c9c

update: rename cdc infer case

279654f

update

4d290f6

update

e298986

YannByron force-pushed the cdc_query branch from e1d21aa to e298986 Compare September 25, 2022 15:50

xushiyan approved these changes Sep 26, 2022

View reviewed changes

xushiyan merged commit df69aa7 into apache:master Sep 26, 2022

alexeykudinkin reviewed Sep 26, 2022

View reviewed changes

TengHuo reviewed Nov 24, 2022

View reviewed changes

YannByron mentioned this pull request Dec 8, 2022

[HUDI-5634] Rename CDC related classes #7410

Merged

4 tasks

hudi-bot mentioned this pull request Nov 30, 2025

schema field of EmptyRelation subtype of BaseRelation should not be null #15750

Open


		import java.io.IOException;

		public class HoodieCDCLogRecordReader implements ClosableIterator<IndexedRecord> {


		private lazy val fs = metaClient.getFs.getFileSystem

		private lazy val conf = new Configuration(confBroadcast.value.value)

[HUDI-3478] Implement CDC Read in Spark #6727

[HUDI-3478] Implement CDC Read in Spark #6727

Uh oh!

Conversation

YannByron commented Sep 20, 2022

Change Logs

Impact

Uh oh!

YannByron commented Sep 20, 2022

Uh oh!

nsivabalan commented Sep 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Sep 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Sep 24, 2022 •

edited

Loading