AustrianDataLAB · toms-place · Apr 23, 2023 · Mar 10, 2023 · Mar 10, 2023 · Mar 16, 2023
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,8 @@
+[submodule "execDAT-CLI"]
+	path = execDAT-CLI
+	url = git@github.com:AustrianDataLAB/execDAT-CLI.git
+	branch = main
+[submodule "execDAT-operator"]
+	path = execDAT-operator
+	url = git@github.com:AustrianDataLAB/execDAT-operator.git
+	branch = main
diff --git a/README.md b/README.md
@@ -1,2 +1,16 @@
 # execDAT
-execDAT - remote code execution for research
+
+execDAT - remote code execution for research 
+
+## Getting Started
+
+### Prerequisites
+
+* k3d
+* docker
+
+### Start k3d cluster
+
+```shell
+k3d cluster create -c k3d-dev.yaml
+```
diff --git a/docs/Index.md b/docs/Index.md
@@ -0,0 +1,5 @@
+# ExecDat
+
+Table of Contents
+
+- [Value Proposition](./ValueProposition.md)
diff --git a/docs/ValueProposition.md b/docs/ValueProposition.md
@@ -0,0 +1,91 @@
+# ExecDat - Value Proposition
+
+## What is the core value being generated?
+
+The goal of this project is to provide researchers with an easy and efficient way to execute and verify scientific evaluations, regardless of whether the required data and code is local or remote.
+
+This will be achieved through the development of a user-friendly tool, which could take the form of a CLI tool, kubectl plugin, or API endpoint. The tool will enable easy remote execution of code that requires access to remote or local datasets, simplifying the research process and reducing the technical barriers to entry.
+
+## Team
+
+### Project owner / Deputy owner
+
+DAT Team
+
+### Team members
+
+Daniel Hofstätter, Alexander Woda and Thomas Weber
+
+## Problem Space
+
+Why are we doing this?  How do we judge success?
+
+### Problem statement
+
+Researchers face difficulties executing code that requires access to remote or local datasets.
+
+I.e., executing scientific evaluations on those datasets exclusively local might be problematic because users face large dataset sizes and have certain dependencies on, for example Operating Systems or Hardware. Furthermore, the current coupling of code to local hardware leads to limitations in parallel executions, resulting in high evaluation and iteration times.
+
+### Impact of this problem
+
+The impact of the problem is that it can slow down the progress of research and create barriers to entry for researchers with limited technical expertise. The manual setup and management of research environments can be time-consuming, distracting, and prone to errors. This can limit the ability of researchers to explore and analyze data, and ultimately, hinder the development of new scientific insights and breakthroughs. The impact is especially significant in fields such as data science and machine learning, where access to large and complex datasets is crucial for research.
+
+E.g., imagine a scientific paper is published, or going to be published, and reviewers want to verify results in them, maybe even for different datasets. Downloading Gigabytes of data or demanding hours of runtime on limited hardware slows down the review process.
+
+### Who is the customer/ target audience
+
+The target audiences for the proposed software tool are researchers and scientists who require access to remote or local datasets for their research. This includes researchers in fields such as data science, machine learning, and other areas that require extensive data analysis.
+
+For Example:
+
+Everyone interested in research, but with an initial scope limited to Universities (Professors, students, etc.)
+Universities to host our service and provide access to staff
+Research teams at any organization
+
+### Criteria for Success
+
+We provide simplicity of execution, reusability of environments, proofable validity of results and asynchronicity in the evaluation process. Our solution is to create a user-friendly software tool that simplifies the process, reduces time and effort, and allows researchers to focus on their research questions.
+
+According to these goals, we define the following criteria:
+
+Usability: One simple function call should be enough.
+Scalability: Multiple users should be able to do evaluations in parallel.
+Flexibility: Should support multiple languages and a variety of operating systems.
+Repeatability: Different users should get the same results for the same evaluation.
+
+## MVP
+
+### What needs to be true in order for a prototype to be ready for release?
+
+We can ship an MVP to researchers and scientists, as soon as we have the following MUST-HAVEs:
+
+#### Functional MUST-HAVEs
+
+Remote code execution: The tool enables remote execution of code that requires access to remote datasets.
+
+Parallel execution support: The tool supports parallel execution of scientific evaluations to reduce evaluation and iteration times.
+
+Result size: A maximum size of one Gigabyte in the result file is supported.
+User interaction: Only one CLI command and one config file is needed to run a job.
+E.g., "execDAT <src_code_dir>" or "kubectl apply -f <spec_file>"
+
+#### Non-Functional MUST-HAVEs
+
+Flexibility: We support at least two different environment configurations.
+Scalability: Scales to at least two users each having at least two jobs running.
+Validity of results: Two users executing with the same configuration file get the same result.
+
+### What crucial factors are we missing?
+
+Definition of work packages
+Technical Overview Diagram
+Cluster Access
+
+### What is the key question we would ask to understand if we are on the right track?
+
+Do we simplify the research process?
+In a side-by-side comparison, are users preferring to use our service, compared to a local execution of the task on their hardware?
+
+### Who are the alpha testers that we can use for validating our assumptions?
+
+DAT Team
diff --git a/docs/adrs/.gitignore b/docs/adrs/.gitignore
diff --git a/docs/adrs/2023-03-16-public-vs-private-data.md b/docs/adrs/2023-03-16-public-vs-private-data.md
@@ -0,0 +1,18 @@
+# Decide if source code an data of users can be private.
+
+Date: 2023-03-16
+
+## Status
+__ACCEPTED__
+
+## Context
+
+The jobs need to access the data and source code of the user in order to create the image and run the task. Private repositories need additional user authentication whereas public ones don't.
+
+## Decision
+
+For now we only allow public code repositories and data sources. This means that the code and data of the user is public. This is the easiest way to implement the jobs. We can always change this later.
+
+## Consequences
+
+This means that the user has to make the code and data public. This is not a problem for the user, because the user wants to publish the code and data anyway. The user can always make the code and data private later.
diff --git a/docs/adrs/2023-03-23-branch-naming.md b/docs/adrs/2023-03-23-branch-naming.md
@@ -0,0 +1,25 @@
+# Branch Naming
+
+Date: 2023-03-23
+
+## Status
+__PROPOSED__
+
+## Context
+We want a unified naming scheme for the naming of branches. Currently, no concrete scheme was decided on and we had a discussion between `feature/`, `features/` and `issue/` prefixes for branches beside `dev` or `main`.
+
+## Decision
+We decided on the following namings:
+* `main` for the main branch
+* `dev` for the development branch
+* `feature/` for all branches that implement a new functionality or feature
+* `issue/` for all branches that are concerned with a bug-fixe or issue
+* `testing/` for all branches that fit neither `feature/` or `issue/`
+
+We researched Pre-Commit-Hooks to enforce this, however a local installation of the CLI tool would be required and we do not want the added tool requirements and complexity.
+
+Instead we will use the __branch protection rules__ to pattern match all other names and lock the corresponding branches. This should correspond some type of enforcing.
+
+## Consequences
+* developers need to adhere to the naming scheme for branches
+* tighter control over the branch protection rules because we only have a small set of legal names
diff --git a/docs/adrs/2023-03-23-cicd-solution.md b/docs/adrs/2023-03-23-cicd-solution.md
@@ -0,0 +1,19 @@
+# Title
+
+Date: 2023-03-23
+
+## Status
+
+__PROPOSED__
+
+## Context
+
+We need to dicide on a CI/CD solution for our project, so we can automate certain tasks, e.g., building, testing, releasing, etc..
+
+## Decision
+
+Some choices for our CICD platform would be GitHub Actions, Tekton, Jenkins or Argo CD. Solutions like Tekton or Argo CD are build up upon Kubernetes, are cloud-native and platform agnostic. GitHub Actions workflows are much simpler, have predefined workflow steps and only require a yaml file for configuration. We decided to use GitHub Actions workflows because we already use GitHub for other related tasks, such as branch naming and protection rules, and therfore we have all of our configuration in one place. Additionally, GitHub Actions are much easier to setup and there are many already existing yaml configurations we can build up upon.
+
+## Consequences
+
+By choosing GitHub Action workflows, compared to running custom workflows in a Kubernetes environment with other solutions, we have a much simpler setup. But we are also more limited in our possibilities, as you have more options in a custom Kubernetes cluster.
diff --git a/docs/adrs/2023-03-DD-regsitry-solution.md b/docs/adrs/2023-03-DD-regsitry-solution.md
@@ -0,0 +1,26 @@
+# Title
+
+Date: YYYY-MM-DD
+
+## Status
+
+What is the status if the ADR?
+
+Possible options:
+* __PROPOSED__
+* __ACCEPTED__
+* __REJECTED__
+* __DEPRECATED__ (include reference to the superseding ADR)
+* __SUPERSEDED__ (include reference to the deprecating ADR)
+
+## Context
+
+What is the context of this ADR? What is the issue that we are seeing? What is motivating this decision or change?
+
+## Decision
+
+What is the change we are proposing? What do we plan on doing to solve the issue?
+
+## Consequences
+
+What are the consequences of the change? What will be more difficult? What will be easier?
diff --git a/docs/adrs/2023-03-DD-storage-bucket-solution.md b/docs/adrs/2023-03-DD-storage-bucket-solution.md
@@ -0,0 +1,26 @@
+# Title
+
+Date: YYYY-MM-DD
+
+## Status
+
+What is the status if the ADR?
+
+Possible options:
+* __PROPOSED__
+* __ACCEPTED__
+* __REJECTED__
+* __DEPRECATED__ (include reference to the superseding ADR)
+* __SUPERSEDED__ (include reference to the deprecating ADR)
+
+## Context
+
+What is the context of this ADR? What is the issue that we are seeing? What is motivating this decision or change?
+
+## Decision
+
+What is the change we are proposing? What do we plan on doing to solve the issue?
+
+## Consequences
+
+What are the consequences of the change? What will be more difficult? What will be easier?
diff --git a/docs/adrs/YYYY-MM-DD-template.md b/docs/adrs/YYYY-MM-DD-template.md
@@ -0,0 +1,26 @@
+# Title
+
+Date: YYYY-MM-DD
+
+## Status
+
+What is the status if the ADR?
+
+Possible options:
+* __PROPOSED__
+* __ACCEPTED__
+* __REJECTED__
+* __DEPRECATED__ (include reference to the superseding ADR)
+* __SUPERSEDED__ (include reference to the deprecating ADR)
+
+## Context
+
+What is the context of this ADR? What is the issue that we are seeing? What is motivating this decision or change?
+
+## Decision
+
+What is the change we are proposing? What do we plan on doing to solve the issue?
+
+## Consequences
+
+What are the consequences of the change? What will be more difficult? What will be easier?