Skip to content

[Bug] Spark Loader example schema and struct mismatch #501

@liuxiaocs7

Description

@liuxiaocs7

Bug Type (问题类型)

exception / error (异常报错)

The current Spark example doesn't work properly.

Before submit

  • I had searched in the issues and found no similar issues.

Environment (环境信息)

Expected & Actual behavior (期望与实际表现)

java.lang.IllegalStateException: The id field must be empty or null when id strategy is 'PRIMARY_KEY' for vertex label 'software'
        at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:544)
        at org.apache.hugegraph.util.E.checkState(E.java:64)
        at org.apache.hugegraph.loader.builder.VertexBuilder.checkIdField(VertexBuilder.java:98)
        at org.apache.hugegraph.loader.builder.VertexBuilder.<init>(VertexBuilder.java:46)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.initPartition(HugeGraphSparkLoader.java:201)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.lambda$null$18e75a97$1(HugeGraphSparkLoader.java:155)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2(Dataset.scala:2923)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2$adapted(Dataset.scala:2923)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
23/08/03 23:36:07 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalStateException: The id field must be empty or null when id strategy is 'PRIMARY_KEY' for vertex label 'person'
        at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:544)
        at org.apache.hugegraph.util.E.checkState(E.java:64)
        at org.apache.hugegraph.loader.builder.VertexBuilder.checkIdField(VertexBuilder.java:98)
        at org.apache.hugegraph.loader.builder.VertexBuilder.<init>(VertexBuilder.java:46)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.initPartition(HugeGraphSparkLoader.java:201)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.lambda$null$18e75a97$1(HugeGraphSparkLoader.java:155)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2(Dataset.scala:2923)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2$adapted(Dataset.scala:2923)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Vertex/Edge example (问题点 / 边数据举例)

No response

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

from this file: https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/assembly/static/example/spark/schema.groovy
exec by client

  // Define schema
  schema.propertyKey("name").asText().ifNotExist().create();
  schema.propertyKey("age").asInt().ifNotExist().create();
  schema.propertyKey("city").asText().ifNotExist().create();
  schema.propertyKey("weight").asDouble().ifNotExist().create();
  schema.propertyKey("lang").asText().ifNotExist().create();
  schema.propertyKey("date").asText().ifNotExist().create();
  schema.propertyKey("price").asDouble().ifNotExist().create();

  schema.vertexLabel("person")
          .properties("name", "age", "city")
          .primaryKeys("name")
          .nullableKeys("age", "city")
          .ifNotExist()
          .create();

  schema.vertexLabel("software")
          .properties("name", "lang", "price")
          .primaryKeys("name")
          .ifNotExist()
          .create();

  schema.edgeLabel("knows")
          .sourceLabel("person")
          .targetLabel("person")
          .properties("date", "weight")
          .ifNotExist()
          .create();

  schema.edgeLabel("created")
          .sourceLabel("person")
          .targetLabel("software")
          .properties("date", "weight")
          .ifNotExist()
          .create();

InputSource from this file: https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/assembly/static/example/spark/struct.json

remove backendStoreInfo to use docker rocksdb

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "example/spark/vertex_person.json",
        "format": "JSON",
        "header": ["name", "age", "city"],
        "charset": "UTF-8",
        "skipped_line": {
          "regex": "(^#|^//).*"
        }
      },
      "id": "name",
      "null_values": ["NULL", "null", ""]
    },
    {
      "label": "software",
      "input": {
        "type": "file",
        "path": "example/spark/vertex_software.json",
        "format": "JSON",
        "header": ["id","name", "lang", "price","ISBN"],
        "charset": "GBK"
      },
      "id": "name",
      "ignored": ["ISBN"]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "source": ["source_name"],
      "target": ["target_name"],
      "input": {
        "type": "file",
        "path": "example/spark/edge_knows.json",
        "format": "JSON",
        "date_format": "yyyyMMdd",
        "header": ["source_name","target_name", "date", "weight"]
      },
      "field_mapping": {
        "source_name": "name",
        "target_name": "name"
      }
    }
  ]
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions