Skip to content

Conversation

@bobbai00
Copy link
Contributor

@bobbai00 bobbai00 commented Feb 10, 2024

This PR introduces the dataset-related table schema design, as well as a class that provides version control using Git on the local file system.

File service with Git Version Control

I added a class called GitVersionControlLocalFileStorageService, which consists of several static methods:

1. File Write and Delete

The following methods handle file writing/deleting in the repository and directory deletion, using standard Java IO alongside Git for version tracking:

public static void writeFileToRepo(Path repoPath, Path filePath, InputStream inputStream) throws IOException, GitAPIException;

public static void removeFileFromRepo(Path repoPath, Path filePath) throws IOException, GitAPIException;

public static void deleteRepo(Path directoryPath) throws IOException;

For writeFileToRepo and removeFileFromRepo, the changes will be staged by git add and rm using JGit.

2. Version Init and Creation

The following methods for repository initialization and version creation:

public static String initRepo(Path baseRepoPath) throws IOException, GitAPIException;

This method does the git init using JGit.

public static String createVersion(Path baseRepoPath, String versionName) throws IOException, GitAPIException;

This method does a git commit -m {versionName} to create a commit.

3. Read File/FileTree of a certain version

Since a repository can have multiple versions, reads on files of different version can happen frequently. To make reads be able to happen simultaneously, we need to avoid checking out during reads.

In order to avoid checking out between different commits when doing reads, I utilized git show and git ls-tree, passing the commit hash value to these commands to accomplish read a file/filetree of a certain commit without checking out.

public static Set<FileNode> retrieveFileTreeOfVersion(Path baseRepoPath, String versionCommitHashVal) throws Exception;

Utilizes git ls-tree to fetch the repository's file tree at a specific commit, parsed into a Set of FileNode objects representing the file hierarchy.

public static void retrieveFileContentOfVersion(Path baseRepoPath, String commitHash, Path filePath, OutputStream outputStream) throws IOException, GitAPIException;

Leverages git show to output the content of a file at a specific commit directly to an OutputStream, facilitating version-specific file content retrieval without altering the working directory's state.

Dataset-related DB schema

Three tables are added:

  • dataset table
CREATE TABLE IF NOT EXISTS dataset
(
    `did`             INT UNSIGNED AUTO_INCREMENT NOT NULL,
    `owner_uid`       INT UNSIGNED NOT NULL,
    `name`            VARCHAR(128) NOT NULL,
    `is_public`       TINYINT NOT NULL DEFAULT 1,
    `storage_path`    VARCHAR(512) NOT NULL,
    `description`     VARCHAR(512) NOT NULL,
    `creation_time`   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY(`did`),
    FOREIGN KEY (`owner_uid`) REFERENCES `user` (`uid`) ON DELETE CASCADE
) ENGINE = INNODB;
  • dataset_user_access table
CREATE TABLE IF NOT EXISTS dataset_user_access
(
    `did`             INT UNSIGNED NOT NULL,
    `uid`             INT UNSIGNED NOT NULL,
    `privilege`    ENUM('NONE', 'READ', 'WRITE') NOT NULL DEFAULT 'NONE',
    PRIMARY KEY(`did`, `uid`),
    FOREIGN KEY (`did`) REFERENCES `dataset` (`did`) ON DELETE CASCADE,
    FOREIGN KEY (`uid`) REFERENCES `user` (`uid`) ON DELETE CASCADE
) ENGINE = INNODB;
  • dataset_version table
CREATE TABLE IF NOT EXISTS dataset_version
(
    `dvid`            INT UNSIGNED AUTO_INCREMENT NOT NULL,
    `did`             INT UNSIGNED NOT NULL,
    `creator_uid`     INT UNSIGNED NOT NULL,
    `name`            VARCHAR(128) NOT NULL,
    `version_hash`    VARCHAR(64) NOT NULL,
    `creation_time`   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY(`dvid`),
    FOREIGN KEY (`did`) REFERENCES `dataset` (`did`) ON DELETE CASCADE
) ENGINE = INNODB;

I introduce a table dataset_version to store the version metadata, instead of relying on the git commands to check all the versions. The reasons of this decision are:
it reduces the number of system call(executing git commands), as APIs like checking the versions of a dataset will be called very frequently.

The relationship between dataset and dataset_version are 1 to N: 1 dataset can have multiple versions, but one dataset version can only belong to one dataset.

@bobbai00 bobbai00 self-assigned this Feb 10, 2024
@bobbai00 bobbai00 force-pushed the jiadong-introduce-dataset-schema-and-version-control-fs-service branch from f06d2bf to ab44757 Compare February 11, 2024 19:04
@bobbai00 bobbai00 changed the title Add Dataset-related schema and the file system service with Git version control Add Dataset-related relational schemas and the file system service with Git version control Feb 11, 2024
Copy link
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. it looks really good and clean. Some general comments:

  1. think about thread-safe issues, and see if we need to make it thread-safe;
  2. consider about different OS support, and make the design general if possible;
  3. consider use standard git library and parsers to reduce maintenance effort.

@bobbai00 bobbai00 force-pushed the jiadong-introduce-dataset-schema-and-version-control-fs-service branch 2 times, most recently from d070e5c to 87108ee Compare February 14, 2024 06:59
@bobbai00 bobbai00 force-pushed the jiadong-introduce-dataset-schema-and-version-control-fs-service branch 2 times, most recently from 1fbae6e to 7cd9a48 Compare February 16, 2024 00:49
Copy link
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments in code

@bobbai00 bobbai00 force-pushed the jiadong-introduce-dataset-schema-and-version-control-fs-service branch from b560c23 to 98c9b73 Compare February 16, 2024 23:47
@bobbai00 bobbai00 merged commit 8de5c03 into master Feb 16, 2024
@bobbai00 bobbai00 deleted the jiadong-introduce-dataset-schema-and-version-control-fs-service branch February 16, 2024 23:55
@bobbai00 bobbai00 added the ddl-change Changes to the TexeraDB DDL label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddl-change Changes to the TexeraDB DDL webserver

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants