Too much retry pushes clients into livelock

I reimplemented the retry harness in Scala using the [ScalikeJDBC](http://scalikejdbc.org/) library, which is a thin wrapper around JDBC. During our tests it happened to us to see our application go into a livelock, retrying conflicting transactions infinitely.

I'd appreciate some help on either of these:
1. whether the retry harness' code looks correct,
2. confirm or refute if our use-case and the type of workload are adapted to use with cockroach.

I created a [test application](https://github.com/kosii/CockroachIssueReprod), which tries to emulate our workload and reproduce the livelock. The application works on a table with the following schema:

```
+-----------+--------+-------+---------+
|   Field   |  Type  | Null  | Default |
+-----------+--------+-------+---------+
| namespace | STRING | false | NULL    |
| key       | STRING | false | NULL    |
| created   | INT    | false | NULL    |
| expires   | INT    | false | NULL    |
+-----------+--------+-------+---------+
```

And executes the following transactions in parallel:
1. Producer transaction (with a single statement)
   
   `UPSERT INTO ttl (namespace, key, created, expires) values ('click', 'somekey', a_timestamp, a_second_timestamp)`
2. Cleanup transaction
   1. `SELECT namespace, max(created) AS created FROM ttl GROUP BY namespace`
   2. for each `(namespace, maxCreatedTime)` pair returned from the first statement:
      
      `DELETE FROM ttl WHERE expires < ${maxCreatedTime} and namespace = ${namespace}`

In order to run the application you have to execute `./sbt run`. The application can run in two modes: synchronized and unsynchronized. The synchronized case serves to create such an adversary situation, where each thread executing the cleanup job waits for the others with a [CyclicBarrier](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CyclicBarrier.html) before executing the 2nd step of the transaction. To switch between synchronized and unsynchronized modes one should modify the `app.synchronized` key in the `src/main/resources/application.conf` file. 

Sadly enough, the app cannot reliably reproduce the livelock, but even in unsynchronized mode it often takes at least 2-3 minutes before a cleanup task could be successfully executed, which disturbs the normal operation of our application. We have around 2-3 inserts per seconds and the cleaning jobs execute in random intervals, once about every minutes, but multiple nodes can execute it the same time.

For my tests I used version 20161006 of cockroach.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much retry pushes clients into livelock #9878

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Too much retry pushes clients into livelock #9878

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions