Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@apoorvedave1
Copy link
Contributor

@apoorvedave1 apoorvedave1 commented Aug 26, 2020

What changes were proposed in this pull request?

The new Content structure, built on a recursive Directory structure will look like below:

Content(
    Directory(
        name: String,
        subDirs: Seq[Directory],
        files: Seq[FileInfo]

Here Content represents any data or index content, and holds the file contents (and maybe other info) about files. Directory semantically maps to a file system directory. It is composed of a directory name, a list of children subDirs Directory(s), and a list of children files. The name is the directory name (and not the whole directory path).

Content APIs:

files: Seq[Path]: returns list of leaf files as hadoop Paths representing the files logged in the Content object.

def fromDirectory(path: Path): Content: Creates a Content object from a given directory path. It contains all leaf files at the directory subtree rooted at path.

def fromLeafFiles(files: Seq[FileStatus]): Content: Creates a Content object from a list of leaf files.

Directory Apis:

def fromDirectory(path: Path): Directory: Creates a Directory object from a given directory path. It contains all leaf files at the directory subtree rooted at path.

def fromLeafFiles(files: Seq[FileStatus]): Directory: Creates a Directory object from a list of leaf files.

This is the output of the directory structure when we create a Content object by passing a directory path. Here's how the new structure looks:

Print the content object

Content(Directory(file:/C:/,List(),List(Directory(Users,List(),List(Directory(apdave,List(),List(Directory(repo2,List(),List(Directory(testdata,List(),List(Directory(sampleparquet,WrappedArray(FileInfo(part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet,1473,1585280921507), FileInfo(part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet,1878,1585280853559)),List()))))))))))))

Print the json conversion of the object

{
  "root" : {
    "name" : "file:/C:/",
    "files" : [ ],
    "subDirs" : [ {
      "name" : "Users",
      "files" : [ ],
      "subDirs" : [ {
        "name" : "apdave",
        "files" : [ ],
        "subDirs" : [ {
          "name" : "repo2",
          "files" : [ ],
          "subDirs" : [ {
            "name" : "testdata",
            "files" : [ ],
            "subDirs" : [ {
              "name" : "sampleparquet",
              "files" : [ {
                "name" : "part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet",
                "size" : 1473,
                "modifiedTime" : 1585280921507
              }, {
                "name" : "part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet",
                "size" : 1878,
                "modifiedTime" : 1585280853559
              } ],
              "subDirs" : [ ]
            } ]
          } ]
        } ]
      } ]
    } ]
  }
}

Convert back from json to content1 object. Note it's the same as the first one.

Content(Directory(file:/C:/,List(),List(Directory(Users,List(),List(Directory(apdave,List(),List(Directory(repo2,List(),List(Directory(testdata,List(),List(Directory(sampleparquet,List(FileInfo(part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet,1473,1585280921507), FileInfo(part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet,1878,1585280853559)),List()))))))))))))

Asserting the return object is same as the original
true

List leaf files using content.files api

file:/C:/Users/apdave/repo2/testdata/sampleparquet/part-00000-5782bdd3-4729-44e6-b54b-51b557f66792-c000.snappy.parquet
file:/C:/Users/apdave/repo2/testdata/sampleparquet/part-00000-e8c8821f-a1b2-4b3b-be9e-687b6fa6d057-c000.snappy.parquet

Why are the changes needed?

Before this PR: the metadata content structure is limited in its expressiveness in case of partitioned folders. If there's multi-level hive-partitioning, it's further difficult to recognize where the root folder starts and where the partitioning starts.
Solution: To simplify this, we decided to keep a tree structure starting at the root of the file system, with every sub-directory a child to it's parent directory. Same for files. The files are still stored as FileInfo objects, maintaining their file name, size and last modified time.

Does this PR introduce any user-facing change?

Yes. This is a breaking change. Earlier, the file paths were complete paths themselves. Now file path is a combination of /root/dir1/dir2/.../file.parquet. Indexes created before v0.2.0 may not be readable.

How was this patch tested?

unit tests

@apoorvedave1 apoorvedave1 requested a review from imback82 August 26, 2020 23:38
@rapoth rapoth modified the milestones: 0.3.0, 0.4.0 Sep 4, 2020

case class Directory(
name: String,
var files: Seq[FileInfo] = Seq(),
Copy link
Collaborator

@sezruby sezruby Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do without var by doing something like the following.

def dfs(files: Seq[String]) {
  val name = getTopDir(files.head)
  val nextFiles = files.map(_.drop(name.length + SEPARATOR.length))
  val (leafFiles, subDirFiles) = nextFiles.partition(_.contain(SEPARATOR)) // can be done in groupByTopDir
  val groupedPaths: Seq[Seq[String]] = groupByTopDir(subDirFiles)
  val subDirs = groupedPaths.map(dfs(_))
  Directory( name, files=leafFiles, SubDirs=subDirs)
}

This is just a suggestion and I'd like to know your opinion on this.

edited)

def dfs(files: Seq[String], prefixLen: Int, name: String) {
  val groupedPaths [string, seq[string]] = files.groupBy(p => p.slice(prefixLen, p.indexOf('/', prefixLen))
  val subDirs = groupedPaths.collect( case (k, v) if k.nonEmpty => dfs(v, prefixLen + k.length, k))
  Directory( name, files=groupedPaths.get("").map(_.drop(prefixLen + name.length), SubDirs=subDirs)
}

Copy link
Contributor Author

@apoorvedave1 apoorvedave1 Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sezruby , I like the general idea of dfs but there are some string manipulations here which i am not sure about. Also I think the current existing implementation is better for pref for creating this structure which I have mentioned below. But I am ok if others also like this idea and prefer this Implementation.

Could you modify this to a Path based implementation?

Some doubts regarding perf:

  1. files.drop(name.length + SEPARATOR.length) => Do you mean files.map(_.drop(name.length + SEPARATOR.length))?
    If that is the case, there will be a lot more string operations than we would like (lot of iterations over all file lists, removing the parent path one at a time). I guess this would hit performance very badly. (let me know if my understanding is wrong).
    This would be (NM.log(NM)) (M = file name length, N = No. of files). Worst case: O(NMD) where D is depth at which all files are present.

  2. val (leafFiles, subDirFiles) = nextFiles.partition(_.contain(SEPARATOR)): This is also bad for perf because at every level in the directory tree, we are iterating over full file list to partition it into leafFiles and subdirs.
    This would be (NM)Log(NM) (N = no. of files, M = file name length). Worst case: O(NMD) where D is depth at which all files are present.

  3. String manipulation is something I don't personally like (drop(n)) but I would be ok if the overall implementation is better for perf.

The current implementation is O (NM + number of unique directories) (N = no. of files, M = file length) so i think it's better perf wise. If var vs val is a major concern, we can explore changing Seq to ArrayBuffer so that it's expandable and mutable.

Copy link
Collaborator

@sezruby sezruby Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes it's files.map(_.drop(name.length + SEPARATOR.length)).
  2. This can be done with groupByTopDir.
  3. I agree that string manipulation might be expensive and uncomfortable. But it seems Path.getParent also involves string operation. github

Regarding performance, I think this could be done in a more efficient way by using sorted file paths(?) & prefixLen for common prefix. (though I'm not sure it's practical in Scala). I added edited ver. Still it might be slower than HashMap ver., there should be some benefit from using val.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's mainly for val vs var, I would suggest we can explore using either a Builder() pattern or use ArrayBuffers() instead of immutable Seq(). In both of those cases we can stick with val and still get the performance of the existing implementation.

The one problematic issue with the suggestion is that it's really depth dependent. If all files are at depth 50, then we iterate over all files 50 times. This makes me feel the benefit of val doesn't outweigh the cost of perf.

It's possible that I might have not fully understood the implementation. Could you please add an implementation/psuedocode of groupByTopDir

Copy link
Collaborator

@sezruby sezruby Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's not all files. Each file is iterated by groupBY [their depth] times as they are partitioned during the traversal. In the edited version, there is no groupedTopDir, but just groupBy with a mapping function.

 val groupedPaths [string, seq[string]] = files.groupBy(p => p.slice(prefixLen, p.indexOf('/', prefixLen))

We might use sorted filePaths to reduce the cost of groupBy, but I think the edited ver. is not that bad.

Anyway, it's okay to keep cur ver. if it's preferable.

cc. @imback82, @pirz could you give a comment for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I will get to this thread this weekend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 for using val for this case class.

Regarding the perf, we can address them in a separate PR with a proper benchmark if needed.

Copy link
Contributor Author

@apoorvedave1 apoorvedave1 Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @imback82 ,
I edited the code to use ListBuffer instead of var. This way we can still avoid multiple iterations over file list and keep the val. Please Let me know if this is ok or if dfs is preferred.
cc @sezruby

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach looks good to me.

@apoorvedave1 apoorvedave1 requested a review from sezruby September 4, 2020 05:10
IndexLogEntry.schemaString(schemaFromAttributes(indexCols ++ includedCols: _*)),
10)),
Content(getIndexDataFilesPath(name).toUri.toString, Seq()),
Content.fromLeafFiles(Seq(indexFile)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does it make sense to test with multiple files (>1) to make sure FileIndex is populated correctly? (e.g., val location = new InMemoryFileIndex(spark, index.content.files, Map(), None))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @imback82 , I updated the tests to support checking multiple files. Although as of now I havent added muli-level directory structure in index files. Please let me konw if we can push it for later or need to do it with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this comment unresolved depending on your suggestion

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one more round of review. The overall approach looks good to me.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @apoorvedave1!

@imback82
Copy link
Contributor

imback82 commented Sep 9, 2020

@sezruby @pirz Can you take a look one more time since there have been quite a bit of changes since you last reviewed?

@pirz
Copy link
Contributor

pirz commented Sep 9, 2020

LGTM, Thank you @apoorvedave1

@sezruby
Copy link
Collaborator

sezruby commented Sep 9, 2020

LGTM, thanks for the work!
nit: remove the limitation part of the pr description.

@imback82 imback82 merged commit 3dc7fe8 into microsoft:master Sep 9, 2020
object FileInfo {
def apply(s: FileStatus): FileInfo = {
require(s.isFile, s"${FileInfo.getClass.getName} is applicable for files, not directories.")
FileInfo(s.getPath.getName, s.getLen, s.getModificationTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apoorvedave1 While resolving the conflict with my change, I found that s.getPath.getName discards the parent dir information. This might cause the loss of the location info unconsciously. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are correct. This was a conscious choice to remove the location info for removing duplication in the metadata.

Is this causing some problem somehow? Or do you foresee a problem in the future because of this? If so, we can do something like this: similar to Content.files api, we can expose a def Content.fileInfos: Seq[FileInfo] or similar api which adds location to file names. Alternately, we can discuss to keep full file paths in FileInfo object.

Please let me know your thoughts and we can explore either adding def Content.fileInfos: Seq[FileInfo] api or updating FileInfo object to hold full paths.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added fileInfos() in my PR though it needs to be revised.
https://github.com/microsoft/hyperspace/pull/123/files#r485951598

And I used FileInfo in this way:

val curFileSet = location.allFiles
          .map(f => FileInfo(f.getPath.toString, f.getLen, f.getModificationTime))

The point is, with this apply function, the parent dir info can be removed unintentionally - without informative function name.. or at least we need some comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, you are saying semantically it's not correct to allow both versions of FileInfo: one with just the name, another with full file path. Is my understanding correct?

I guess that's a fair point. Ideally we should have two different classes: One with just the name, One with full file path. E.g.

class FileInfo(filename: String, modifiedTime, size)

class FullPathFileInfo(path: Path, modifiedTime, size)

Please let me know if my understanding is correct. We can create an issue fix this limitation in another PR.

Copy link
Contributor

@imback82 imback82 Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just store the full path in the log and just compress the index json file? Looks like we will always be building the full paths when hybrid scan is enabled.

Never mind for now (it's a different discussion). I was a bit concerned with def rec since we could be creating a lot of Path objects, but we can get some number first (alternatively, we could simply store parent full dir name in Directory if needed, so that we don't have to traverse all the way up).

Copy link
Contributor

@imback82 imback82 Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that FileInfo.name should only store the final component of Path (same for Directory.anem) and def apply(s: FileStatus): FileInfo can be confusing when the parent directory is stripped away. Looks like we don't really need this helper function and be explicit about the behavior - i.e., call constructor? (used once in code, and many in tests, but you can define a helper function in IndexLogEntryTest. (or add appropriate comments)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With lazy val instead of def, the performance seems okay once allFileInfos is built.
On second thought, as FileInfo is defined under Content which is tree based structure, it's reasonable to keep only filename.

I need FileInfo( full path string, size, modification time) to compare & intersect 2 lists of files. I think it's okay to reuse the current FileInfo for hybrid scan as it's just a simple case class and we could revisit the structure & api later.

@apoorvedave1 apoorvedave1 deleted the newstructure branch September 9, 2020 17:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

breaking changes enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants