GitHub - oceanoverflow/alphadb: distributed relational database

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
consensus		consensus
db		db
index/b+tree		index/b+tree
io		io
network		network
sql		sql
table		table
transaction		transaction
README		README
configure		configure
main.cc		main.cc
makefile		makefile
Repository files navigation

----------------------------------------------------------------------------------------------------
CONSENSUS:
----------------------------------------------------------------------------------------------------

CAP theorem

In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer,
states that it is impossible for a distributed data store to simultaneously provide more than two out of the follow-
ing three guarantees:
    * Consistency: Every read receives the most recent write or an error
    * Availability: Every request receives a (non-error) response - without the guarantee that it contains the most 
                    recent write
    * Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped
                    (or delayed) by the network between nodes
In particular, the CAP theorem implies that in the presence of a network partition, one has to choose between consi-
stency and availability. Note that consistency as defined in the CAP theorem is quite different from the consistency
guaranteed in ACID database transactions.

Explanation:

No distributed system is safe from network failures, thus network partitioning generally has to be tolerated. In the
presence of a partition, one is then left with two options: consistency or availability. When choosing consistency
over availability, the system will always process the query and try to return the most recent available version of
the information, even if it cannot gurantee it is up to date due to network partitioning.
In the absence of network failure - that is, when the distributed system is running normally - both availability and
consistency can be satisfied.
CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact
, the choice is really between consistency and availability only when a network partition or failure happens; at all
other times, no trade-off has to be made.
Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability
, whereas system designed around the BASE philosophy, common in the NoSQL movement for example, choose availability
over consistency.
The PACELC theorem builds on CAP by stating that even in the absence of partitioning, anther trade-off between laten-
cy and consistency occurs.

History

According to University of California, Berkeley computer scientist Eric Brewer, the theorem first appeared in autumn
1998. It was published as the CAP principle in 1999 and presented as a conjecture by Brewer at the 2000 Symposium
on Principles of Distributed Computing (PODC). In 2002, Seth Gilbert and Nancy Lynch of MIT published a formal proof
of Brewer's conjecture, rendering it a theorem.
In 2012, Brewer clarified some of his positions, including why the often-used "two out of three" concept can be mis-
leading or misapplied, and the different definition if consistency used in CAP relative to the one used in ACID.
A similar theorem stating the trade-off between consistency and availability in distributed systems was published
by Birman and Friedman in 1996. The result of Birman and Friedman restricted this lower bound to non-commuting ope-
rations.

                          +--------+
                          |raft    |
                          |Follower|
                          +--------+
                          / storage \
                         /  engine!  \
               +--------+-------------+--------+
               |raft    |             |raft    |
               |Leader  |             |Follower|
               +--------+             +--------+
                    ^
                    |
                    |
                    |
+------+            | 
|client|------------+
+------+


replicate then reply || reply then replicate ?
for read  ==> should be reply then replicate
for write ==> should be replicate then reply
                                                                              ______ raft follower1
                                                                             /
                                                                            /               
  client -----> server(s1) -----> state machine (db) -----> raft leader(s2)
                                                                            \
                                                                             \______ raft follower2

----------------------------------------------------------------------------------------------------
BUFFER:
----------------------------------------------------------------------------------------------------

+----+----+----+----+----+----+
|    |    |    |    |    |    |
+----+----+----+----+----+----+
|    |    |    |    |    |    |
+----+----+----+----+----+----+
|    |    |    |    |    |    |
+----+----+----+----+----+----+
|    |    |    |    |    |    |
+----+----+----+----+----+----+
|    |    |    |    |    |    |
+----+----+----+----+----+----+

----------------------------------------------------------------------------------------------------
TABLE:
----------------------------------------------------------------------------------------------------

   page_id 
      |                          +----- slot_id
      |            base --+      |    
      v                   v      v               4092B
      +----+--------------+------+------+------+------+
      | Nt |table_page_hdr| ptr0 | ptr1 | ptr2 | ptr3 |
      +----+-+--------+---+--+---+------+------+------+
      | ptr4 | .....  | ptrX |<|                      |
      +---+--+--------+---+--+                        |
      |   |               |                           |
      |   |  +------------+                           |
      |   |  |                                        |
      |   +--+--------------+                         |
      |      |              |                  |      |
      |      v              v                  v      |
      |      +--------+-----+------------------+------+
      |    |>|(tupleX)| ... |>    (tuple4)     |tuple3|
      +------+-+------+-----+---------+--------+------+
      |(tuple2)|>     (tuple1)        |>   (tuple0)   |  
      +--------+----------------------+---------------+
                           largest  <--------  smallest
                       tuple_len[i] = ptr[i] - ptr[i-1]
            for example: tuple_len[4] = ptr[3] - ptr[4]
  special case: tuple_len[0] = INDEX_PAGE_SIZE - ptr[0]

----------------------------------------------------------------------------------------------------
INDEX:
----------------------------------------------------------------------------------------------------

Disk structure:

platter -> track -> sector (block=4096Bytes)==> addressable <track_no, sector_no> offset

page layout: 

degree_: N
page_size_ = (4KB-4B)
degree can also be calculated: N = std::floor( page_size_ / (sizeof(RID) + sizeof(KEY)) )

for [internal] node:
KEY[n-1] is a PSEUDO key, RID<page, slot>, for interval node, just p works, s is not important. <===> RID<p, -1>  

         RID0 < KEY0 <= RID1 < KEY1 <= RID2 < KEY2 <= RID3 < KEY3 <= ...... < KEY[n-2] <= RID[n-1] < (<<<<<<<<<<KEY[n-1]>>>>>>>>>)p
          |              |              |              |                                    |
p{0}<-----+              |              |              |                                    +------>p{n-1}
                         |              |              |
                p1<------+              |              +------>p{3}
                                        |
                                        v
                                      p{2}

for   {leaf}   node:
RID[n-1] is the pointer to the next leaf, KEY[n-1] is also a PSEUDO key,  
other RIDi points to the record in the table, and KEYi is the index key, here i < [n-1]

       (RID0, KEY0) <= (RID1, KEY1) <= (RID2, KEY2) <= (RID3, KEY3)  ........... <= (RID[n-1], KEY[n-1]) <= (RID[n-1], KEY[n-1]x)      +------------------------->***nextleaf***
          |               |               |               |                              |                      |                      |
          |               |               |               |                              |                      +----------------------+
          |               |               |               |                              v
          |               |               |               v                       (p{n-2},s{n-2})
          |               |               v            (p3,s3)
          |               v            (p2,s2)
          v            (p1,s1)
       (p0,s0)

                                                logical view:
+--------------+--------------+--------------+--------------+-------------------------------------+----------------------+
| (RID0, KEY0) | (RID1, KEY1) | (RID2, KEY2) | (RID3, KEY3) | ....................................| (RID[n-1], KEY[n-1]) |
+--------------+--------------+--------------+--------------+-------------------------------------+----------------------+



                                                physical view:
+---+------+------+------+----------+------+------+------+----------+------------------------------------------------------+
| n | RID0 | RID1 |......| RID[n-1] | KEY0 | KEY1 |......| KEY[n-1] |                     free_space                       |
+---+------+------+------+----------+------+------+------+----------+------------------------------------------------------+

   

                              +--------------+
                              |p0|k0|p1|k1|p2|    internal_node
                              +--------------+
                             /       |        \
                            /        |         \
                           /         v          \
           +--------------+   +--------------+   +--------------+
 first --> |r0|k0|r1|k1|n2|-->|r3|k3|r4|k4|n2|-->|r5|k5|r6|k6|n3| --> last (nullptr)  leaf_node
           +--------------+   +--------------+   +--------------+
             |     |            |     |            |     |
             |     |            |     |            |     |
             v     v            v     v            v     v
            ***   ***          ***   ***          ***   ***
     
     degree = 7  max_node = 6; min_node = 3

   1.  node->move_half_to(node* receiver)         [when splitting]  
           begin:  |*******|             end: |****| -> |***|                  ***|||insert|||*** 
   2.  node->move_all_to(node* receiver)          [when the node is underflow]   
           begin:  |****| <- |**|        end: |*****|                          ***|||delete|||*** coalesce
   3.  node->move_first_to_end_of(node* receiver) [when the node is overflow, and left neighbor is not]  
           begin:  |*****| <- |*******|  end: |******| <-> |******|            ***|||insert|||*** redistribute
   4.  node->move_last_to_front_of(node* receiver) [same as last one, but this time the right neighbor is not] 
           begin:  |*******| -> |*****|  end: |******| <-> |******|            ***|||insert|||*** redistribute

----------------------------------------------------------------------------------------------------
SQL:
----------------------------------------------------------------------------------------------------

lex is a scanner generator.
    input is a set of regular expressions and associated actions (written in C)
    output is a table-driven scanner (lex.yy.c)

flex: an open source implementation of the original UNIX lex utility

lex: semantic analysis 
    -> splits the input file into tokens
yacc: yet another compiler compiler 
    -> parses and does semantic processing on the stream of tokens produced by lex
bison: GNU parser parser, upward compatibility with yacc.

            lex input                             example(.l) exl.l

        FIRST PART(optional)                   %{ #include "myscanner.h" %}
        %%                                     %%
        keyword first!!!                       "select"        select();
        pattern      action                    "hello world"   printf("GOODBYE\n")
        ....(regular expressions)              :               return COLON;
        %%                                     %%
        THIRD PART                             int yywrap(void) { return 1; }

             yacc input                        
        FIRST PART                              %{ C declarations; yacc definitions;  %}
        %%                                      %%
        production   action                     statements: statement {printf("statement");}
        ....                                        | statement statements {printf("statements");}
                                                    ;
        %%                                      %%
        THIRD PART

$ yacc -d sql.y                // generate y.tab.h && y.tab.c
$ lex sql.l                    // process the lex file to generate a scanner (gets saved as lex.yy.c)
$ gcc lex.yy.c y.tab.c -o sqls // compile the scanner and grab main() from the lex library (-ll option)
$ ./a.out                      // run the scanner taking input from standard input

              lex pattern examples
+-----------+-------------------------------------------------------------------------------+
|abc        | match the string "abc"                                                        |
+-----------+-------------------------------------------------------------------------------+
|[a-zA-Z]   | match any lower or uppercase letter                                           |
+-----------+-------------------------------------------------------------------------------+
|dog.*cat   | match any string starting with dog, and ending with cat.                      |
+-----------+-------------------------------------------------------------------------------+
|(ab)+      | match one or more occurrences of "ab" concatenated.                           |
+-----------+-------------------------------------------------------------------------------+
|[^a-z]+    | matches any string of one or more characters that do not includ lower case a-z|
+-----------+-------------------------------------------------------------------------------+
|[+-]?[0-9]+| match any string of one or more digists with an optional prefix of + or -.    |
+-----------+-------------------------------------------------------------------------------+

  source code         a = b + c * d
                            |
                            v
                   +----------------+         +---+
                   |Lexical Analyzer|<--------|Lex|<------- patterns
                   +----------------+         +---+
                            |
    tokens                  v
                    +---------------+         +----+
                    |Syntax Analyzer|<--------|Yacc|<------ grammers (context free)
                    +---------------+         +----+
                            |
  syntax tree               v
                          /   \
                        id1    +
                             /   \
                           id2    *
                            |   /   \
                            | id3    id4
                            v
                     +--------------+
                     |code generator|
                     +--------------+
                            |
   generated code           v
                        load  id3
                        mul   id4
                        add   id2
                        store id1


                   +------------+
     mylang.y  --> |    yacc    | --> y.tab.c _ 
                   +------------+              \               source
                         |                      \               code
                         |                       \               |
                         v                        +--------+     v
                      y.tab.h                     |  gcc   |--> mylang
                         |                        +--------+     |
                         v                       /               v
                   +-----------+                /           compiled code
     mylang.l -->  |    lex    | --> lex.yy.c _/          interpreter output
                   +-----------+

                   Supported SQL Queries
=====================

## Select Statements

```sql
SELECT name, city, *
    FROM students AS t1 JOIN students AS t2 ON t1.city = t2.city
    WHERE t1.grade < 2.0 AND
          t2.grade > 2.0 AND
          t1.city = 'Frohnau'
    ORDER BY t1.grade DESC;

SELECT city, AVG(grade) AS average,
    MIN(grade) AS best, MAX(grade) AS worst
    FROM students
    GROUP BY city;
```

## Data Definition & Modification

**Create Tables**
```sql
CREATE TABLE students (
    name TEXT,
    student_number INTEGER,
    city TEXT,
    grade DOUBLE
);
```

**Update and Delete**
```sql
UPDATE students SET name='Max Mustermann' WHERE name = 'Ralf Mustermann';

DELETE FROM students WHERE name = 'Max Mustermann';
```

A compilr or interpreter for a progromming language is often decomposed into two parts:

1. Read the source program and discover its structure.
2. Process this structure, e.g. to generate the target program.

Lex and Yacc can generate program fragments that solve the first task.

The task of discovering the source structure again is decomposed into subtasks:

1. Split the source file into tokens (Lex)
2. Find the hierarchical structure of the program (Yacc).

----------------------------------------------------------------------------------------------------
NETWORK:
----------------------------------------------------------------------------------------------------

                              server                  client

                            +--------+              +--------+ 
                            | socket |              | socket |  
                            +--------+              +--------+
                                |                        |
                                v                        |    
                            +--------+                   |
                            |  bind  |                   |
                            +--------+                   |
                                |                        | 
                                v                        |
                            +--------+                   |
                            | listen |                   |
                            +--------+                   |
                                |                        |
                                v                        v
                            +--------+  TCP 3 time  +---------+
      (ESTABLISHED) SYN+ACK | accept |<-------------| connect | SYN, ACK (ESTABLISHED)
                            +--------+   handshake  +---------+
                                |                        |
                                v                        v
                            +--------+              +---------+
                            |  read  |<-------------|  write  |
                            +--------+              +---------+
                                
                            +--------+              +---------+
                            | write  |------------->|  read   |
                            +--------+              +---------+
                                |                        |
                                v                        v
                            +-------+                +-------+
                            | close |                | close |
                            +-------+                +-------+


+------------+     +---------------+       +------+       +----------------+       +----------------+
| user input |---->| child process | ----> | pipe | ----> | parent process | ----> |    server      |
+------------+     +---------------+       +------+       +----------------+       +----------------+
                                                                   ^                        |
                                                                   |                        |
                                                                   +------------------------+


TCP is a finite state machine, it has 11 state.


                 SYNseq=x
  SYN_SENT |------------------>|  LISTEN (listen())
  (connect)| SYNseq=y,ACK=x+1  |  
ESTABLISHED|<------------------|  SYN_RCVD
           |     ACK=y+1       |
           |------------------>|  ESTABLISHED
  write()  |                   |___________________      
           |                   |
           |  seq=x+1,ACK=y+1  |
           |------------------>|  read()
           |     ACK=x+2       |
           |<------------------|
           |                   |___________________
           | FINseq=x+2ACK=y+1 |
FIN_WAIT_1 |------------------>| CLOSE_WAIT
           |      ACK=x+3      |
FIN_WAIT_2 |<------------------| LAST_ACK
TIME_WAIT  |<------------------| (close())
           |     FINseq=y+1    |
           |------------------>|
           |      ACK=y+2      |

CLOSED --> SYN_SENT --> ESTABLISHED 
LISTEN 
SYN_RCVD CLOSE_WAIT LAST_ACK CLOSING FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT

1. synchronize blocking iterative

bottleneck: disk or network I/O 

for (;;) {
    fd = accept(...);
    read(fd, buf, n);
    dosomething(buf);
    write(fd, buf, n);
    close(fd);
}


    +-----------+
    |application|
    +-----------+
          |
         +-+
         | |
         | |  System call - kernel context switch
         +-+ -------------------------------------> +-+
          |                                         | |  initiate read I/O
          |                                         +-+ ---->
          |                                          |
          |                                          |
          |                                          |
          |                                          |
          |                                          |   read response
          |          data movement from             +-+ <----
          |       kernel space to user space        | |
         +-+ <------------------------------------- +-+
         | |                                         |
         | |                                         |
         +-+                                         |
          |                                          |
          |                                          |
          |                                          |
          |                                          |

2. multiple process

for (;;) {
    fd = accept(...);
    ret = fork();
    swtich (ret) {
        case -1:
            do_err_handler();
            break;
        case 0:  // child process
            do_handler_fd(fd);
            break;
        default: // parent process
            continue;
    }
}

// need signal handling to handle zombie child process

pre-fork: process pool

3. multiple thread

void* thread_entry(void* args) {
    int fd = *(int *)args;
    do_handler_fd(fd);
}

for (;;) {
    fd = accept();
    pthread_create(..., thread_entry, &fd);
}

----------------------------------------------------------------------------------------------------
TRANSACTION:
----------------------------------------------------------------------------------------------------
In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of
database transactions intended to guarantee validity even in the event of errors, power failures, etc.
In the context of database, a sequence of database operations that satisfies the ACID properties (and
these can be perceived as a single logical operation on the data) is called a transaction. For example,
a transfer of funds from one bank account to another, even involving multiple changes such as debiting
one account and crediting another, is a single transaction.