-
Notifications
You must be signed in to change notification settings - Fork 116
New Tree based Structure for index metadata #139
Conversation
# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/rules/JoinIndexRule.scala # src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala # src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala
src/main/scala/com/microsoft/hyperspace/index/ExperimentalStructure.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/ExperimentalStructure.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
|
|
||
| case class Directory( | ||
| name: String, | ||
| var files: Seq[FileInfo] = Seq(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do without var by doing something like the following.
def dfs(files: Seq[String]) {
val name = getTopDir(files.head)
val nextFiles = files.map(_.drop(name.length + SEPARATOR.length))
val (leafFiles, subDirFiles) = nextFiles.partition(_.contain(SEPARATOR)) // can be done in groupByTopDir
val groupedPaths: Seq[Seq[String]] = groupByTopDir(subDirFiles)
val subDirs = groupedPaths.map(dfs(_))
Directory( name, files=leafFiles, SubDirs=subDirs)
}
This is just a suggestion and I'd like to know your opinion on this.
edited)
def dfs(files: Seq[String], prefixLen: Int, name: String) {
val groupedPaths [string, seq[string]] = files.groupBy(p => p.slice(prefixLen, p.indexOf('/', prefixLen))
val subDirs = groupedPaths.collect( case (k, v) if k.nonEmpty => dfs(v, prefixLen + k.length, k))
Directory( name, files=groupedPaths.get("").map(_.drop(prefixLen + name.length), SubDirs=subDirs)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sezruby , I like the general idea of dfs but there are some string manipulations here which i am not sure about. Also I think the current existing implementation is better for pref for creating this structure which I have mentioned below. But I am ok if others also like this idea and prefer this Implementation.
Could you modify this to a Path based implementation?
Some doubts regarding perf:
-
files.drop(name.length + SEPARATOR.length)=> Do you meanfiles.map(_.drop(name.length + SEPARATOR.length))?
If that is the case, there will be a lot more string operations than we would like (lot of iterations over all file lists, removing the parent path one at a time). I guess this would hit performance very badly. (let me know if my understanding is wrong).
This would be (NM.log(NM)) (M = file name length, N = No. of files). Worst case: O(NMD) where D is depth at which all files are present. -
val (leafFiles, subDirFiles) = nextFiles.partition(_.contain(SEPARATOR)): This is also bad for perf because at every level in the directory tree, we are iterating over full file list to partition it into leafFiles and subdirs.
This would be (NM)Log(NM) (N = no. of files, M = file name length). Worst case: O(NMD) where D is depth at which all files are present. -
String manipulation is something I don't personally like
(drop(n))but I would be ok if the overall implementation is better for perf.
The current implementation is O (NM + number of unique directories) (N = no. of files, M = file length) so i think it's better perf wise. If var vs val is a major concern, we can explore changing Seq to ArrayBuffer so that it's expandable and mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes it's
files.map(_.drop(name.length + SEPARATOR.length)). - This can be done with groupByTopDir.
- I agree that string manipulation might be expensive and uncomfortable. But it seems Path.getParent also involves string operation. github
Regarding performance, I think this could be done in a more efficient way by using sorted file paths(?) & prefixLen for common prefix. (though I'm not sure it's practical in Scala). I added edited ver. Still it might be slower than HashMap ver., there should be some benefit from using val.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's mainly for val vs var, I would suggest we can explore using either a Builder() pattern or use ArrayBuffers() instead of immutable Seq(). In both of those cases we can stick with val and still get the performance of the existing implementation.
The one problematic issue with the suggestion is that it's really depth dependent. If all files are at depth 50, then we iterate over all files 50 times. This makes me feel the benefit of val doesn't outweigh the cost of perf.
It's possible that I might have not fully understood the implementation. Could you please add an implementation/psuedocode of groupByTopDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's not all files. Each file is iterated by groupBY [their depth] times as they are partitioned during the traversal. In the edited version, there is no groupedTopDir, but just groupBy with a mapping function.
val groupedPaths [string, seq[string]] = files.groupBy(p => p.slice(prefixLen, p.indexOf('/', prefixLen))
We might use sorted filePaths to reduce the cost of groupBy, but I think the edited ver. is not that bad.
Anyway, it's okay to keep cur ver. if it's preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I will get to this thread this weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 for using val for this case class.
Regarding the perf, we can address them in a separate PR with a proper benchmark if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach looks good to me.
src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/JoinIndexRuleTest.scala
Outdated
Show resolved
Hide resolved
| IndexLogEntry.schemaString(schemaFromAttributes(indexCols ++ includedCols: _*)), | ||
| 10)), | ||
| Content(getIndexDataFilesPath(name).toUri.toString, Seq()), | ||
| Content.fromLeafFiles(Seq(indexFile)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does it make sense to test with multiple files (>1) to make sure FileIndex is populated correctly? (e.g., val location = new InMemoryFileIndex(spark, index.content.files, Map(), None))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @imback82 , I updated the tests to support checking multiple files. Although as of now I havent added muli-level directory structure in index files. Please let me konw if we can push it for later or need to do it with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping this comment unresolved depending on your suggestion
src/test/scala/com/microsoft/hyperspace/index/rules/FilterIndexRuleTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rules/JoinIndexRuleTest.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did one more round of review. The overall approach looks good to me.
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @apoorvedave1!
|
LGTM, Thank you @apoorvedave1 |
|
LGTM, thanks for the work! |
| object FileInfo { | ||
| def apply(s: FileStatus): FileInfo = { | ||
| require(s.isFile, s"${FileInfo.getClass.getName} is applicable for files, not directories.") | ||
| FileInfo(s.getPath.getName, s.getLen, s.getModificationTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apoorvedave1 While resolving the conflict with my change, I found that s.getPath.getName discards the parent dir information. This might cause the loss of the location info unconsciously. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are correct. This was a conscious choice to remove the location info for removing duplication in the metadata.
Is this causing some problem somehow? Or do you foresee a problem in the future because of this? If so, we can do something like this: similar to Content.files api, we can expose a def Content.fileInfos: Seq[FileInfo] or similar api which adds location to file names. Alternately, we can discuss to keep full file paths in FileInfo object.
Please let me know your thoughts and we can explore either adding def Content.fileInfos: Seq[FileInfo] api or updating FileInfo object to hold full paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added fileInfos() in my PR though it needs to be revised.
https://github.com/microsoft/hyperspace/pull/123/files#r485951598
And I used FileInfo in this way:
val curFileSet = location.allFiles
.map(f => FileInfo(f.getPath.toString, f.getLen, f.getModificationTime))
The point is, with this apply function, the parent dir info can be removed unintentionally - without informative function name.. or at least we need some comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see, you are saying semantically it's not correct to allow both versions of FileInfo: one with just the name, another with full file path. Is my understanding correct?
I guess that's a fair point. Ideally we should have two different classes: One with just the name, One with full file path. E.g.
class FileInfo(filename: String, modifiedTime, size)
class FullPathFileInfo(path: Path, modifiedTime, size)Please let me know if my understanding is correct. We can create an issue fix this limitation in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just store the full path in the log and just compress the index json file? Looks like we will always be building the full paths when hybrid scan is enabled.
Never mind for now (it's a different discussion). I was a bit concerned with def rec since we could be creating a lot of Path objects, but we can get some number first (alternatively, we could simply store parent full dir name in Directory if needed, so that we don't have to traverse all the way up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that FileInfo.name should only store the final component of Path (same for Directory.anem) and def apply(s: FileStatus): FileInfo can be confusing when the parent directory is stripped away. Looks like we don't really need this helper function and be explicit about the behavior - i.e., call constructor? (used once in code, and many in tests, but you can define a helper function in IndexLogEntryTest. (or add appropriate comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With lazy val instead of def, the performance seems okay once allFileInfos is built.
On second thought, as FileInfo is defined under Content which is tree based structure, it's reasonable to keep only filename.
I need FileInfo( full path string, size, modification time) to compare & intersect 2 lists of files. I think it's okay to reuse the current FileInfo for hybrid scan as it's just a simple case class and we could revisit the structure & api later.
What changes were proposed in this pull request?
The new Content structure, built on a recursive Directory structure will look like below:
Here
Contentrepresents any data or index content, and holds the file contents (and maybe other info) about files. Directory semantically maps to a file system directory. It is composed of a directoryname, a list of childrensubDirsDirectory(s), and a list of childrenfiles. Thenameis the directory name (and not the whole directory path).Content APIs:
files: Seq[Path]: returns list of leaf files as hadoopPaths representing the files logged in the Content object.def fromDirectory(path: Path): Content: Creates aContentobject from a given directory path. It contains all leaf files at the directory subtree rooted at path.def fromLeafFiles(files: Seq[FileStatus]): Content: Creates aContentobject from a list of leaf files.Directory Apis:
def fromDirectory(path: Path): Directory: Creates aDirectoryobject from a given directory path. It contains all leaf files at the directory subtree rooted at path.def fromLeafFiles(files: Seq[FileStatus]): Directory: Creates aDirectoryobject from a list of leaf files.This is the output of the directory structure when we create a
Contentobject by passing a directory path. Here's how the new structure looks:Print the content object
Print the json conversion of the object
Convert back from json to content1 object. Note it's the same as the first one.
Asserting the return object is same as the original
trueList leaf files using
content.filesapiWhy are the changes needed?
Before this PR: the metadata content structure is limited in its expressiveness in case of partitioned folders. If there's multi-level hive-partitioning, it's further difficult to recognize where the root folder starts and where the partitioning starts.
Solution: To simplify this, we decided to keep a tree structure starting at the root of the file system, with every sub-directory a child to it's parent directory. Same for files. The files are still stored as FileInfo objects, maintaining their file name, size and last modified time.
Does this PR introduce any user-facing change?
Yes. This is a breaking change. Earlier, the file paths were complete paths themselves. Now file path is a combination of
/root/dir1/dir2/.../file.parquet. Indexes created before v0.2.0 may not be readable.How was this patch tested?
unit tests