-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM #12088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM #12088
Conversation
64c3efb to
9b0be94
Compare
|
Test build #54632 has finished for PR 12088 at commit
|
|
Test build #54633 has finished for PR 12088 at commit
|
| requiredSchema: StructType, | ||
| filters: Seq[Filter], | ||
| options: Map[String, String]): (PartitionedFile) => Iterator[InternalRow] = { | ||
| val numFeatures = options("numFeatures").toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added.
|
Overall LGTM |
| def computeNumFeatures(): Int = { | ||
| val dataFiles = files.filterNot(_.getPath.getName startsWith "_") | ||
| val path = if (dataFiles.length == 1) dataFiles.head.getPath.toUri.toString | ||
| else if (dataFiles.isEmpty) throw new IOException("No input path specified for libsvm data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
|
@liancheng I main question is that this PR adds 100 lines of code without introducing new features to LibSVM source. The code in |
|
Test build #54639 has finished for PR 12088 at commit
|
|
@mengxr For the extra lines added, we're planning to remove Separating Tungsten internals and LibSVM format parsing from each other is a good point. And it's true that we may scan the data twice for computing total number of features, since now we can't cache the original RDD because it;s constructed in a different way from the final On the other hand, the original code always cache the RDD in memory, does this imply we never intend to use the LibSVM data source to load large datasets that don't fit in memory? If that's true, we may want to special case LibSVM since the |
|
|
||
| val sc = sqlContext.sparkContext | ||
| val parsed = MLUtils.parseLibSVMFile(sc, path, sc.defaultParallelism) | ||
| MLUtils.computeNumFeatures(parsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that unlike what we do in MLUtils, we don't cache parsed RDD here since it's constructed using different partitioning strategies as the final FileScanRDD. This is a potential performance regression.
|
Test build #54648 has finished for PR 12088 at commit
|
|
LGTM2 |
|
Merged into master. Thanks! |
|
Hm, seems that this PR broke master build. I'm looking into it. |
## What changes were proposed in this pull request? Fixes a compilation failure introduced in PR #12088 under Scala 2.10. ## How was this patch tested? Compilation. Author: Cheng Lian <lian@databricks.com> Closes #12107 from liancheng/spark-14295-hotfix.
## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read. (cherry picked from commit 9ea0d5e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.
What changes were proposed in this pull request?
This PR implements
FileFormat.buildReader()for the LibSVM data source. Besides that, a new interface methodprepareRead()is added toFileFormat:After migrating from
buildInternalScan()tobuildReader(), we lost the opportunity to collect necessary global information, sincebuildReader()works in a per-partition manner. For example, LibSVM needs to infer the total number of features if thenumFeaturesdata source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.An alternative approach is to absorb
inferSchema()intoprepareRead(), since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, whileprepareRead()must be called whenever aHadoopFsRelationbased data source relation is instantiated.One unaddressed problem is that, when
numFeaturesis absent, now the input data will be scanned twice. ThebuildInternalScan()code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, withFileScanRDD, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.How was this patch tested?
Tested using existing test suites.