New Tree based Structure for index metadata #139

apoorvedave1 · 2020-08-26T18:50:05Z

What changes were proposed in this pull request?

The new Content structure, built on a recursive Directory structure will look like below:

Content(
    Directory(
        name: String,
        subDirs: Seq[Directory],
        files: Seq[FileInfo]

Here Content represents any data or index content, and holds the file contents (and maybe other info) about files. Directory semantically maps to a file system directory. It is composed of a directory name, a list of children subDirs Directory(s), and a list of children files. The name is the directory name (and not the whole directory path).

Content APIs:

files: Seq[Path]: returns list of leaf files as hadoop Paths representing the files logged in the Content object.

def fromDirectory(path: Path): Content: Creates a Content object from a given directory path. It contains all leaf files at the directory subtree rooted at path.

def fromLeafFiles(files: Seq[FileStatus]): Content: Creates a Content object from a list of leaf files.

Directory Apis:

def fromDirectory(path: Path): Directory: Creates a Directory object from a given directory path. It contains all leaf files at the directory subtree rooted at path.

def fromLeafFiles(files: Seq[FileStatus]): Directory: Creates a Directory object from a list of leaf files.

This is the output of the directory structure when we create a Content object by passing a directory path. Here's how the new structure looks:

Print the content object

Content(Directory(file:/C:/,List(),List(Directory(Users,List(),List(Directory(apdave,List(),List(Directory(repo2,List(),List(Directory(testdata,List(),List(Directory(sampleparquet,WrappedArray(FileInfo(part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet,1473,1585280921507), FileInfo(part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet,1878,1585280853559)),List()))))))))))))

Print the json conversion of the object

{
  "root" : {
    "name" : "file:/C:/",
    "files" : [ ],
    "subDirs" : [ {
      "name" : "Users",
      "files" : [ ],
      "subDirs" : [ {
        "name" : "apdave",
        "files" : [ ],
        "subDirs" : [ {
          "name" : "repo2",
          "files" : [ ],
          "subDirs" : [ {
            "name" : "testdata",
            "files" : [ ],
            "subDirs" : [ {
              "name" : "sampleparquet",
              "files" : [ {
                "name" : "part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet",
                "size" : 1473,
                "modifiedTime" : 1585280921507
              }, {
                "name" : "part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet",
                "size" : 1878,
                "modifiedTime" : 1585280853559
              } ],
              "subDirs" : [ ]
            } ]
          } ]
        } ]
      } ]
    } ]
  }
}

Convert back from json to content1 object. Note it's the same as the first one.

Content(Directory(file:/C:/,List(),List(Directory(Users,List(),List(Directory(apdave,List(),List(Directory(repo2,List(),List(Directory(testdata,List(),List(Directory(sampleparquet,List(FileInfo(part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet,1473,1585280921507), FileInfo(part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet,1878,1585280853559)),List()))))))))))))

Asserting the return object is same as the original
true

List leaf files using content.files api

file:/C:/Users/apdave/repo2/testdata/sampleparquet/part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet
file:/C:/Users/apdave/repo2/testdata/sampleparquet/part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet

Why are the changes needed?

Before this PR: the metadata content structure is limited in its expressiveness in case of partitioned folders. If there's multi-level hive-partitioning, it's further difficult to recognize where the root folder starts and where the partitioning starts.
Solution: To simplify this, we decided to keep a tree structure starting at the root of the file system, with every sub-directory a child to it's parent directory. Same for files. The files are still stored as FileInfo objects, maintaining their file name, size and last modified time.

Does this PR introduce any user-facing change?

Yes. This is a breaking change. Earlier, the file paths were complete paths themselves. Now file path is a combination of /root/dir1/dir2/.../file.parquet. Indexes created before v0.2.0 may not be readable.

How was this patch tested?

unit tests

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/rules/JoinIndexRule.scala # src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala # src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

…nto newstructure

src/main/scala/com/microsoft/hyperspace/index/ExperimentalStructure.scala

…nto newstructure

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

sezruby · 2020-09-04T02:10:25Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

+
+case class Directory(
+    name: String,
+    var files: Seq[FileInfo] = Seq(),


I think we could do without var by doing something like the following.

def dfs(files: Seq[String]) { val name = getTopDir(files.head) val nextFiles = files.map(_.drop(name.length + SEPARATOR.length)) val (leafFiles, subDirFiles) = nextFiles.partition(_.contain(SEPARATOR)) // can be done in groupByTopDir val groupedPaths: Seq[Seq[String]] = groupByTopDir(subDirFiles) val subDirs = groupedPaths.map(dfs(_)) Directory( name, files=leafFiles, SubDirs=subDirs) }

This is just a suggestion and I'd like to know your opinion on this.

edited)

def dfs(files: Seq[String], prefixLen: Int, name: String) { val groupedPaths [string, seq[string]] = files.groupBy(p => p.slice(prefixLen, p.indexOf('/', prefixLen)) val subDirs = groupedPaths.collect( case (k, v) if k.nonEmpty => dfs(v, prefixLen + k.length, k)) Directory( name, files=groupedPaths.get("").map(_.drop(prefixLen + name.length), SubDirs=subDirs) }

thanks @sezruby , I like the general idea of dfs but there are some string manipulations here which i am not sure about. Also I think the current existing implementation is better for pref for creating this structure which I have mentioned below. But I am ok if others also like this idea and prefer this Implementation.

Could you modify this to a Path based implementation?

Some doubts regarding perf:

files.drop(name.length + SEPARATOR.length) => Do you mean files.map(_.drop(name.length + SEPARATOR.length))?
If that is the case, there will be a lot more string operations than we would like (lot of iterations over all file lists, removing the parent path one at a time). I guess this would hit performance very badly. (let me know if my understanding is wrong).
This would be (NM.log(NM)) (M = file name length, N = No. of files). Worst case: O(NMD) where D is depth at which all files are present.

val (leafFiles, subDirFiles) = nextFiles.partition(_.contain(SEPARATOR)): This is also bad for perf because at every level in the directory tree, we are iterating over full file list to partition it into leafFiles and subdirs.
This would be (NM)Log(NM) (N = no. of files, M = file name length). Worst case: O(NMD) where D is depth at which all files are present.

String manipulation is something I don't personally like (drop(n)) but I would be ok if the overall implementation is better for perf.

The current implementation is O (NM + number of unique directories) (N = no. of files, M = file length) so i think it's better perf wise. If var vs val is a major concern, we can explore changing Seq to ArrayBuffer so that it's expandable and mutable.

Yes it's files.map(_.drop(name.length + SEPARATOR.length)).

This can be done with groupByTopDir.

I agree that string manipulation might be expensive and uncomfortable. But it seems Path.getParent also involves string operation. github

Regarding performance, I think this could be done in a more efficient way by using ~~sorted file paths(?)~~ & prefixLen for common prefix. (though I'm not sure it's practical in Scala). I added edited ver. Still it might be slower than HashMap ver., there should be some benefit from using val.

if it's mainly for val vs var, I would suggest we can explore using either a Builder() pattern or use ArrayBuffers() instead of immutable Seq(). In both of those cases we can stick with val and still get the performance of the existing implementation.

The one problematic issue with the suggestion is that it's really depth dependent. If all files are at depth 50, then we iterate over all files 50 times. This makes me feel the benefit of val doesn't outweigh the cost of perf.

It's possible that I might have not fully understood the implementation. Could you please add an implementation/psuedocode of groupByTopDir

Actually it's not all files. Each file is iterated by groupBY [their depth] times as they are partitioned during the traversal. In the edited version, there is no groupedTopDir, but just groupBy with a mapping function.

val groupedPaths [string, seq[string]] = files.groupBy(p => p.slice(prefixLen, p.indexOf('/', prefixLen))

We might use sorted filePaths to reduce the cost of groupBy, but I think the edited ver. is not that bad.

Anyway, it's okay to keep cur ver. if it's preferable.

cc. @imback82, @pirz could you give a comment for this?

Sorry for the delay. I will get to this thread this weekend.

I am +1 for using val for this case class.

Regarding the perf, we can address them in a separate PR with a proper benchmark if needed.

Thanks @imback82 ,
I edited the code to use ListBuffer instead of var. This way we can still avoid multiple iterations over file list and keep the val. Please Let me know if this is ok or if dfs is preferred.
cc @sezruby

The current approach looks good to me.

src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/test/scala/com/microsoft/hyperspace/index/rules/JoinIndexRuleTest.scala

imback82 · 2020-09-08T21:13:42Z

src/test/scala/com/microsoft/hyperspace/index/rules/HyperspaceRuleTestSuite.scala

              IndexLogEntry.schemaString(schemaFromAttributes(indexCols ++ includedCols: _*)),
              10)),
-          Content(getIndexDataFilesPath(name).toUri.toString, Seq()),
+          Content.fromLeafFiles(Seq(indexFile)),


nit: does it make sense to test with multiple files (>1) to make sure FileIndex is populated correctly? (e.g., val location = new InMemoryFileIndex(spark, index.content.files, Map(), None))

thanks @imback82 , I updated the tests to support checking multiple files. Although as of now I havent added muli-level directory structure in index files. Please let me konw if we can push it for later or need to do it with this PR.

keeping this comment unresolved depending on your suggestion

src/test/scala/com/microsoft/hyperspace/index/rules/FilterIndexRuleTest.scala

src/test/scala/com/microsoft/hyperspace/index/rules/JoinIndexRuleTest.scala

src/test/scala/com/microsoft/hyperspace/index/IndexLogEntryTest.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

imback82

I did one more round of review. The overall approach looks good to me.

imback82

LGTM, thanks @apoorvedave1!

imback82 · 2020-09-09T00:36:06Z

@sezruby @pirz Can you take a look one more time since there have been quite a bit of changes since you last reviewed?

pirz · 2020-09-09T01:52:34Z

LGTM, Thank you @apoorvedave1

sezruby · 2020-09-09T02:42:36Z

LGTM, thanks for the work!
nit: remove the limitation part of the pr description.

sezruby · 2020-09-09T06:11:32Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

+object FileInfo {
+  def apply(s: FileStatus): FileInfo = {
+    require(s.isFile, s"${FileInfo.getClass.getName} is applicable for files, not directories.")
+    FileInfo(s.getPath.getName, s.getLen, s.getModificationTime)


@apoorvedave1 While resolving the conflict with my change, I found that s.getPath.getName discards the parent dir information. This might cause the loss of the location info unconsciously. What do you think?

yeah you are correct. This was a conscious choice to remove the location info for removing duplication in the metadata.

Is this causing some problem somehow? Or do you foresee a problem in the future because of this? If so, we can do something like this: similar to Content.files api, we can expose a def Content.fileInfos: Seq[FileInfo] or similar api which adds location to file names. Alternately, we can discuss to keep full file paths in FileInfo object.

Please let me know your thoughts and we can explore either adding def Content.fileInfos: Seq[FileInfo] api or updating FileInfo object to hold full paths.

I added fileInfos() in my PR though it needs to be revised.
https://github.com/microsoft/hyperspace/pull/123/files#r485951598

And I used FileInfo in this way:

val curFileSet = location.allFiles .map(f => FileInfo(f.getPath.toString, f.getLen, f.getModificationTime))

The point is, with this apply function, the parent dir info can be removed unintentionally - without informative function name.. or at least we need some comment here.

oh i see, you are saying semantically it's not correct to allow both versions of FileInfo: one with just the name, another with full file path. Is my understanding correct?

I guess that's a fair point. Ideally we should have two different classes: One with just the name, One with full file path. E.g.

class FileInfo(filename: String, modifiedTime, size) class FullPathFileInfo(path: Path, modifiedTime, size)

Please let me know if my understanding is correct. We can create an issue fix this limitation in another PR.

~~How about we just store the full path in the log and just compress the index json file? Looks like we will always be building the full paths when hybrid scan is enabled.~~

Never mind for now (it's a different discussion). I was a bit concerned with def rec since we could be creating a lot of Path objects, but we can get some number first (alternatively, we could simply store parent full dir name in Directory if needed, so that we don't have to traverse all the way up).

I agree that FileInfo.name should only store the final component of Path (same for Directory.anem) and def apply(s: FileStatus): FileInfo can be confusing when the parent directory is stripped away. Looks like we don't really need this helper function and be explicit about the behavior - i.e., call constructor? (used once in code, and many in tests, but you can define a helper function in IndexLogEntryTest. (or add appropriate comments)

With lazy val instead of def, the performance seems okay once allFileInfos is built.
On second thought, as FileInfo is defined under Content which is tree based structure, it's reasonable to keep only filename.

I need FileInfo( full path string, size, modification time) to compare & intersect 2 lists of files. I think it's okay to reuse the current FileInfo for hybrid scan as it's just a simple case class and we could revisit the structure & api later.

apoorvedave1 and others added 17 commits August 21, 2020 12:28

experimental commit

ea24552

initial refactoring

804ad04

review comments

7021d8a

tree structure with functions to generate tree from list of files

ccd6ad1

use in code, compiler errors fixed

e41a9f1

fix join index tests, many still failing

767e558

fix filter index rule tests. Others still failing

d3e26d8

fix index log entry test

6b2820b

fix Index manager tests

ea0a089

fix e2ehyperspaceRulesTest

64604d6

fix explain test. All tests must be fixed now. phew.

ab5786a

testing why build is failing for IndexManagerTest

058d70b

fix for supporting linux systems.

6c8b838

cleanup

a2ebb97

test fix for explain test with _SUCCESS file

1dbcff0

Merge branch 'newstructure' of github.com:apoorvedave1/hyperspace-1 i…

8b9502a

…nto newstructure

apoorvedave1 requested review from imback82, pirz and sezruby August 26, 2020 18:50

apoorvedave1 self-assigned this Aug 26, 2020

apoorvedave1 added this to the 0.3.0 milestone Aug 26, 2020

apoorvedave1 mentioned this pull request Aug 26, 2020

Incremental Indexing: Support for incremental changes to index when new data is added #29

Closed

9 tasks

imback82 reviewed Aug 26, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/ExperimentalStructure.scala Outdated Show resolved Hide resolved

imback82 reviewed Aug 26, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/ExperimentalStructure.scala Outdated Show resolved Hide resolved

apoorvedave1 added 2 commits August 26, 2020 16:16

remove unnecessary files, refactor object names

e84ee2f

Merge branch 'newstructure' of github.com:apoorvedave1/hyperspace-1 i…

fb2c6b1

…nto newstructure

apoorvedave1 requested a review from imback82 August 26, 2020 23:38

apoorvedave1 commented Aug 26, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala Show resolved Hide resolved

apoorvedave1 commented Aug 26, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala Outdated Show resolved Hide resolved

rapoth modified the milestones: 0.3.0, 0.4.0 Sep 4, 2020

sezruby reviewed Sep 4, 2020

View reviewed changes

apoorvedave1 requested a review from sezruby September 4, 2020 05:10

use ListBuffer to eliminate use of var in Directory

2930ac6