Skip to content
Merged

Dev #13

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
7ded4a1
added cli submodule
Naexon Mar 10, 2023
2c8581d
Merge pull request #1 from AustrianDataLAB/feature/cli-submodule
Naexon Mar 10, 2023
551b840
use dev branch of execDAT-CLI
toms-place Mar 16, 2023
708cee2
Merge pull request #2 from AustrianDataLAB:issue/move-submodules-branch
toms-place Mar 16, 2023
ce0acc6
use main branch for checkout of cli in main
toms-place Mar 16, 2023
d5a6432
Merge pull request #3 from AustrianDataLAB:issue/inital-setup
toms-place Mar 16, 2023
41983f4
add index and valueproposition
toms-place Mar 16, 2023
eff6fa7
Merge pull request #4 from AustrianDataLAB:issue/valueproposition
toms-place Mar 16, 2023
514bcf7
Template for the ADRs
Naexon Mar 16, 2023
0ba0a9b
add operator as submodule
toms-place Mar 16, 2023
6c5895b
add architecture and local k3d setup
toms-place Mar 16, 2023
1cfb11e
Merge pull request #5 from AustrianDataLAB:issue/operator
toms-place Mar 16, 2023
e13307a
Template for the ADRs (#6)
Naexon Mar 16, 2023
1c75569
Merge remote-tracking branch 'origin/dev' into feature/ADRs
Naexon Mar 16, 2023
52e8a32
public vs private ADR
Mar 16, 2023
af29327
Merge pull request #9 from AustrianDataLAB:feature/ADRs
Sokadyn Mar 16, 2023
190d664
testing for gpg signing
Naexon Mar 21, 2023
40b93d7
accepted public-vs-private-data ADR
Naexon Mar 23, 2023
edc039c
proposed branch-naming ADR
Naexon Mar 23, 2023
707f428
fix: use the port of the devel (#7)
toms-place Mar 23, 2023
215aff6
Start CICD ADR
Mar 23, 2023
01afee7
test signed commit
Mar 23, 2023
d76f550
Fix commit sign
Sokadyn Mar 23, 2023
9f465ad
Finish CICD ADR
Sokadyn Mar 23, 2023
969b92b
Some small changes to cicd ADR
Sokadyn Mar 25, 2023
d749bd2
Merge pull request #11 from AustrianDataLAB:feature/ADRs
Sokadyn Mar 25, 2023
861a225
Start lecture 9 readme
Sokadyn Apr 23, 2023
b18bf08
finish lecture 9
Sokadyn Apr 23, 2023
8e1ce9c
Merge pull request #12 from AustrianDataLAB/lecture/9
Sokadyn Apr 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[submodule "execDAT-CLI"]
path = execDAT-CLI
url = git@github.com:AustrianDataLAB/execDAT-CLI.git
branch = main
[submodule "execDAT-operator"]
path = execDAT-operator
url = git@github.com:AustrianDataLAB/execDAT-operator.git
branch = main
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,16 @@
# execDAT
execDAT - remote code execution for research

execDAT - remote code execution for research

## Getting Started

### Prerequisites

* k3d
* docker

### Start k3d cluster

```shell
k3d cluster create -c k3d-dev.yaml
```
5 changes: 5 additions & 0 deletions docs/Index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# ExecDat

Table of Contents

- [Value Proposition](./ValueProposition.md)
91 changes: 91 additions & 0 deletions docs/ValueProposition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# ExecDat - Value Proposition

## What is the core value being generated?

The goal of this project is to provide researchers with an easy and efficient way to execute and verify scientific evaluations, regardless of whether the required data and code is local or remote.

This will be achieved through the development of a user-friendly tool, which could take the form of a CLI tool, kubectl plugin, or API endpoint. The tool will enable easy remote execution of code that requires access to remote or local datasets, simplifying the research process and reducing the technical barriers to entry.

## Team

### Project owner / Deputy owner

DAT Team

### Team members

Daniel Hofstätter, Alexander Woda and Thomas Weber

## Problem Space

Why are we doing this? How do we judge success?

### Problem statement

Researchers face difficulties executing code that requires access to remote or local datasets.

I.e., executing scientific evaluations on those datasets exclusively local might be problematic because users face large dataset sizes and have certain dependencies on, for example Operating Systems or Hardware. Furthermore, the current coupling of code to local hardware leads to limitations in parallel executions, resulting in high evaluation and iteration times.

### Impact of this problem

The impact of the problem is that it can slow down the progress of research and create barriers to entry for researchers with limited technical expertise. The manual setup and management of research environments can be time-consuming, distracting, and prone to errors. This can limit the ability of researchers to explore and analyze data, and ultimately, hinder the development of new scientific insights and breakthroughs. The impact is especially significant in fields such as data science and machine learning, where access to large and complex datasets is crucial for research.

E.g., imagine a scientific paper is published, or going to be published, and reviewers want to verify results in them, maybe even for different datasets. Downloading Gigabytes of data or demanding hours of runtime on limited hardware slows down the review process.

### Who is the customer/ target audience

The target audiences for the proposed software tool are researchers and scientists who require access to remote or local datasets for their research. This includes researchers in fields such as data science, machine learning, and other areas that require extensive data analysis.

For Example:

Everyone interested in research, but with an initial scope limited to Universities (Professors, students, etc.)
Universities to host our service and provide access to staff
Research teams at any organization

### Criteria for Success

We provide simplicity of execution, reusability of environments, proofable validity of results and asynchronicity in the evaluation process. Our solution is to create a user-friendly software tool that simplifies the process, reduces time and effort, and allows researchers to focus on their research questions.

According to these goals, we define the following criteria:

Usability: One simple function call should be enough.
Scalability: Multiple users should be able to do evaluations in parallel.
Flexibility: Should support multiple languages and a variety of operating systems.
Repeatability: Different users should get the same results for the same evaluation.

## MVP

### What needs to be true in order for a prototype to be ready for release?

We can ship an MVP to researchers and scientists, as soon as we have the following MUST-HAVEs:

#### Functional MUST-HAVEs

Remote code execution: The tool enables remote execution of code that requires access to remote datasets.

Parallel execution support: The tool supports parallel execution of scientific evaluations to reduce evaluation and iteration times.

Result size: A maximum size of one Gigabyte in the result file is supported.
User interaction: Only one CLI command and one config file is needed to run a job.
E.g., "execDAT <src_code_dir>" or "kubectl apply -f <spec_file>"

#### Non-Functional MUST-HAVEs

Flexibility: We support at least two different environment configurations.
Scalability: Scales to at least two users each having at least two jobs running.
Validity of results: Two users executing with the same configuration file get the same result.

### What crucial factors are we missing?

Definition of work packages
Technical Overview Diagram
Cluster Access

### What is the key question we would ask to understand if we are on the right track?

Do we simplify the research process?
In a side-by-side comparison, are users preferring to use our service, compared to a local execution of the task on their hardware?

### Who are the alpha testers that we can use for validating our assumptions?

DAT Team
Empty file added docs/adrs/.gitignore
Empty file.
18 changes: 18 additions & 0 deletions docs/adrs/2023-03-16-public-vs-private-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Decide if source code an data of users can be private.

Date: 2023-03-16

## Status
__ACCEPTED__

## Context

The jobs need to access the data and source code of the user in order to create the image and run the task. Private repositories need additional user authentication whereas public ones don't.

## Decision

For now we only allow public code repositories and data sources. This means that the code and data of the user is public. This is the easiest way to implement the jobs. We can always change this later.

## Consequences

This means that the user has to make the code and data public. This is not a problem for the user, because the user wants to publish the code and data anyway. The user can always make the code and data private later.
25 changes: 25 additions & 0 deletions docs/adrs/2023-03-23-branch-naming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Branch Naming

Date: 2023-03-23

## Status
__PROPOSED__

## Context
We want a unified naming scheme for the naming of branches. Currently, no concrete scheme was decided on and we had a discussion between `feature/`, `features/` and `issue/` prefixes for branches beside `dev` or `main`.

## Decision
We decided on the following namings:
* `main` for the main branch
* `dev` for the development branch
* `feature/` for all branches that implement a new functionality or feature
* `issue/` for all branches that are concerned with a bug-fixe or issue
* `testing/` for all branches that fit neither `feature/` or `issue/`

We researched Pre-Commit-Hooks to enforce this, however a local installation of the CLI tool would be required and we do not want the added tool requirements and complexity.

Instead we will use the __branch protection rules__ to pattern match all other names and lock the corresponding branches. This should correspond some type of enforcing.

## Consequences
* developers need to adhere to the naming scheme for branches
* tighter control over the branch protection rules because we only have a small set of legal names
19 changes: 19 additions & 0 deletions docs/adrs/2023-03-23-cicd-solution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Title

Date: 2023-03-23

## Status

__PROPOSED__

## Context

We need to dicide on a CI/CD solution for our project, so we can automate certain tasks, e.g., building, testing, releasing, etc..

## Decision

Some choices for our CICD platform would be GitHub Actions, Tekton, Jenkins or Argo CD. Solutions like Tekton or Argo CD are build up upon Kubernetes, are cloud-native and platform agnostic. GitHub Actions workflows are much simpler, have predefined workflow steps and only require a yaml file for configuration. We decided to use GitHub Actions workflows because we already use GitHub for other related tasks, such as branch naming and protection rules, and therfore we have all of our configuration in one place. Additionally, GitHub Actions are much easier to setup and there are many already existing yaml configurations we can build up upon.

## Consequences

By choosing GitHub Action workflows, compared to running custom workflows in a Kubernetes environment with other solutions, we have a much simpler setup. But we are also more limited in our possibilities, as you have more options in a custom Kubernetes cluster.
26 changes: 26 additions & 0 deletions docs/adrs/2023-03-DD-regsitry-solution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Title

Date: YYYY-MM-DD

## Status

What is the status if the ADR?

Possible options:
* __PROPOSED__
* __ACCEPTED__
* __REJECTED__
* __DEPRECATED__ (include reference to the superseding ADR)
* __SUPERSEDED__ (include reference to the deprecating ADR)

## Context

What is the context of this ADR? What is the issue that we are seeing? What is motivating this decision or change?

## Decision

What is the change we are proposing? What do we plan on doing to solve the issue?

## Consequences

What are the consequences of the change? What will be more difficult? What will be easier?
26 changes: 26 additions & 0 deletions docs/adrs/2023-03-DD-storage-bucket-solution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Title

Date: YYYY-MM-DD

## Status

What is the status if the ADR?

Possible options:
* __PROPOSED__
* __ACCEPTED__
* __REJECTED__
* __DEPRECATED__ (include reference to the superseding ADR)
* __SUPERSEDED__ (include reference to the deprecating ADR)

## Context

What is the context of this ADR? What is the issue that we are seeing? What is motivating this decision or change?

## Decision

What is the change we are proposing? What do we plan on doing to solve the issue?

## Consequences

What are the consequences of the change? What will be more difficult? What will be easier?
26 changes: 26 additions & 0 deletions docs/adrs/YYYY-MM-DD-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Title

Date: YYYY-MM-DD

## Status

What is the status if the ADR?

Possible options:
* __PROPOSED__
* __ACCEPTED__
* __REJECTED__
* __DEPRECATED__ (include reference to the superseding ADR)
* __SUPERSEDED__ (include reference to the deprecating ADR)

## Context

What is the context of this ADR? What is the issue that we are seeing? What is motivating this decision or change?

## Decision

What is the change we are proposing? What do we plan on doing to solve the issue?

## Consequences

What are the consequences of the change? What will be more difficult? What will be easier?
Loading