Skip to content

Conversation

@liancheng
Copy link
Contributor

What changes were proposed in this pull request?

This PR implements FileFormat.buildReader() for the LibSVM data source. Besides that, a new interface method prepareRead() is added to FileFormat:

  def prepareRead(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Map[String, String] = options

After migrating from buildInternalScan() to buildReader(), we lost the opportunity to collect necessary global information, since buildReader() works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the numFeatures data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.

An alternative approach is to absorb inferSchema() into prepareRead(), since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while prepareRead() must be called whenever a HadoopFsRelation based data source relation is instantiated.

One unaddressed problem is that, when numFeatures is absent, now the input data will be scanned twice. The buildInternalScan() code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with FileScanRDD, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.

How was this patch tested?

Tested using existing test suites.

@liancheng liancheng force-pushed the spark-14295-libsvm-build-reader branch from 64c3efb to 9b0be94 Compare March 31, 2016 14:43
@liancheng
Copy link
Contributor Author

cc @yhuai @cloud-fan @marmbrus

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54632 has finished for PR 12088 at commit 64c3efb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54633 has finished for PR 12088 at commit 9b0be94.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = {
val numFeatures = options("numFeatures").toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added.

@cloud-fan
Copy link
Contributor

Overall LGTM

def computeNumFeatures(): Int = {
val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
val path = if (dataFiles.length == 1) dataFiles.head.getPath.toUri.toString
else if (dataFiles.isEmpty) throw new IOException("No input path specified for libsvm data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

@mengxr
Copy link
Contributor

mengxr commented Mar 31, 2016

@liancheng I main question is that this PR adds 100 lines of code without introducing new features to LibSVM source. The code in buildReader now mixes Tungsten internals with parsing code, requiring people who understand both to maintain. I'm okay with the changes but it would be great to think of a way to separate internals from data source implementation. Essentially, LIBSVM is a text-based source with a LIBSVM record parser and it might require two passes to the data.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54639 has finished for PR 12088 at commit 6336898.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

@mengxr For the extra lines added, we're planning to remove buildInternalScan after finishing migrating all the HadoopFsRelation data sources, and we'll do the cleanup then.

Separating Tungsten internals and LibSVM format parsing from each other is a good point.

And it's true that we may scan the data twice for computing total number of features, since now we can't cache the original RDD because it;s constructed in a different way from the final FileScanRDD. Forgot to mention this in the PR title. Haven't figured out a good solution for this problem yet.

On the other hand, the original code always cache the RDD in memory, does this imply we never intend to use the LibSVM data source to load large datasets that don't fit in memory? If that's true, we may want to special case LibSVM since the FileScanRDD code path may not bring much performance improvements to this data source.


val sc = sqlContext.sparkContext
val parsed = MLUtils.parseLibSVMFile(sc, path, sc.defaultParallelism)
MLUtils.computeNumFeatures(parsed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that unlike what we do in MLUtils, we don't cache parsed RDD here since it's constructed using different partitioning strategies as the final FileScanRDD. This is a potential performance regression.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54648 has finished for PR 12088 at commit 931a29f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 31, 2016

LGTM2

@mengxr
Copy link
Contributor

mengxr commented Apr 1, 2016

Merged into master. Thanks!

@asfgit asfgit closed this in 1b07063 Apr 1, 2016
@liancheng liancheng deleted the spark-14295-libsvm-build-reader branch April 1, 2016 07:01
@liancheng
Copy link
Contributor Author

Hm, seems that this PR broke master build. I'm looking into it.

asfgit pushed a commit that referenced this pull request Apr 1, 2016
## What changes were proposed in this pull request?

Fixes a compilation failure introduced in PR #12088 under Scala 2.10.

## How was this patch tested?

Compilation.

Author: Cheng Lian <lian@databricks.com>

Closes #12107 from liancheng/spark-14295-hotfix.
asfgit pushed a commit that referenced this pull request Jun 16, 2016
## What changes were proposed in this pull request?

Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13698 from liancheng/remove-prepare-read.

(cherry picked from commit 9ea0d5e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Jun 16, 2016
## What changes were proposed in this pull request?

Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13698 from liancheng/remove-prepare-read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants