[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM #12088

liancheng · 2016-03-31T14:41:51Z

What changes were proposed in this pull request?

This PR implements FileFormat.buildReader() for the LibSVM data source. Besides that, a new interface method prepareRead() is added to FileFormat:

  def prepareRead(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Map[String, String] = options

After migrating from buildInternalScan() to buildReader(), we lost the opportunity to collect necessary global information, since buildReader() works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the numFeatures data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.

An alternative approach is to absorb inferSchema() into prepareRead(), since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while prepareRead() must be called whenever a HadoopFsRelation based data source relation is instantiated.

One unaddressed problem is that, when numFeatures is absent, now the input data will be scanned twice. The buildInternalScan() code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with FileScanRDD, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.

How was this patch tested?

Tested using existing test suites.

liancheng · 2016-03-31T14:43:34Z

cc @yhuai @cloud-fan @marmbrus

SparkQA · 2016-03-31T14:43:52Z

Test build #54632 has finished for PR 12088 at commit 64c3efb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T14:48:40Z

Test build #54633 has finished for PR 12088 at commit 9b0be94.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-31T14:55:53Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

+      requiredSchema: StructType,
+      filters: Seq[Filter],
+      options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
+    val numFeatures = options("numFeatures").toInt


add an assert?

Thanks, added.

cloud-fan · 2016-03-31T14:57:23Z

Overall LGTM

mengxr · 2016-03-31T16:18:07Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

+    def computeNumFeatures(): Int = {
+      val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
+      val path = if (dataFiles.length == 1) dataFiles.head.getPath.toUri.toString
+      else if (dataFiles.isEmpty) throw new IOException("No input path specified for libsvm data")


mengxr · 2016-03-31T16:30:52Z

@liancheng I main question is that this PR adds 100 lines of code without introducing new features to LibSVM source. The code in buildReader now mixes Tungsten internals with parsing code, requiring people who understand both to maintain. I'm okay with the changes but it would be great to think of a way to separate internals from data source implementation. Essentially, LIBSVM is a text-based source with a LIBSVM record parser and it might require two passes to the data.

SparkQA · 2016-03-31T16:53:19Z

Test build #54639 has finished for PR 12088 at commit 6336898.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-31T17:25:00Z

@mengxr For the extra lines added, we're planning to remove buildInternalScan after finishing migrating all the HadoopFsRelation data sources, and we'll do the cleanup then.

Separating Tungsten internals and LibSVM format parsing from each other is a good point.

And it's true that we may scan the data twice for computing total number of features, since now we can't cache the original RDD because it;s constructed in a different way from the final FileScanRDD. Forgot to mention this in the PR title. Haven't figured out a good solution for this problem yet.

On the other hand, the original code always cache the RDD in memory, does this imply we never intend to use the LibSVM data source to load large datasets that don't fit in memory? If that's true, we may want to special case LibSVM since the FileScanRDD code path may not bring much performance improvements to this data source.

liancheng · 2016-03-31T17:46:01Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

+
+      val sc = sqlContext.sparkContext
+      val parsed = MLUtils.parseLibSVMFile(sc, path, sc.defaultParallelism)
+      MLUtils.computeNumFeatures(parsed)


Note that unlike what we do in MLUtils, we don't cache parsed RDD here since it's constructed using different partitioning strategies as the final FileScanRDD. This is a potential performance regression.

SparkQA · 2016-03-31T19:06:02Z

Test build #54648 has finished for PR 12088 at commit 931a29f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-31T22:20:16Z

LGTM2

mengxr · 2016-04-01T06:46:18Z

Merged into master. Thanks!

liancheng · 2016-04-01T07:02:18Z

Hm, seems that this PR broke master build. I'm looking into it.

## What changes were proposed in this pull request? Fixes a compilation failure introduced in PR #12088 under Scala 2.10. ## How was this patch tested? Compilation. Author: Cheng Lian <lian@databricks.com> Closes #12107 from liancheng/spark-14295-hotfix.

## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read. (cherry picked from commit 9ea0d5e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.

Implements buildReader() for LibSVM

9b0be94

liancheng force-pushed the spark-14295-libsvm-build-reader branch from 64c3efb to 9b0be94 Compare March 31, 2016 14:43

cloud-fan reviewed Mar 31, 2016
View reviewed changes

Fixes import order and addresses PR comments

6336898

mengxr reviewed Mar 31, 2016
View reviewed changes

Addresses part of PR comments

931a29f

liancheng reviewed Mar 31, 2016
View reviewed changes

asfgit closed this in 1b07063 Apr 1, 2016

liancheng deleted the spark-14295-libsvm-build-reader branch April 1, 2016 07:01

liancheng mentioned this pull request Apr 1, 2016

[SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure #12107

Closed

liancheng mentioned this pull request Jun 16, 2016

[SPARK-15983][SQL] Removes FileFormat.prepareRead #13698

Closed

[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM #12088

[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM #12088

Uh oh!

Conversation

liancheng commented Mar 31, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

liancheng commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

cloud-fan Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 31, 2016

Uh oh!

mengxr Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

liancheng commented Mar 31, 2016

Uh oh!

liancheng Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

mengxr commented Mar 31, 2016

Uh oh!

mengxr commented Apr 1, 2016

Uh oh!

liancheng commented Apr 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants