Skip to content

Too much retry pushes clients into livelock #9878

@kosii

Description

@kosii

I reimplemented the retry harness in Scala using the ScalikeJDBC library, which is a thin wrapper around JDBC. During our tests it happened to us to see our application go into a livelock, retrying conflicting transactions infinitely.

I'd appreciate some help on either of these:

  1. whether the retry harness' code looks correct,
  2. confirm or refute if our use-case and the type of workload are adapted to use with cockroach.

I created a test application, which tries to emulate our workload and reproduce the livelock. The application works on a table with the following schema:

+-----------+--------+-------+---------+
|   Field   |  Type  | Null  | Default |
+-----------+--------+-------+---------+
| namespace | STRING | false | NULL    |
| key       | STRING | false | NULL    |
| created   | INT    | false | NULL    |
| expires   | INT    | false | NULL    |
+-----------+--------+-------+---------+

And executes the following transactions in parallel:

  1. Producer transaction (with a single statement)

    UPSERT INTO ttl (namespace, key, created, expires) values ('click', 'somekey', a_timestamp, a_second_timestamp)

  2. Cleanup transaction

    1. SELECT namespace, max(created) AS created FROM ttl GROUP BY namespace

    2. for each (namespace, maxCreatedTime) pair returned from the first statement:

      DELETE FROM ttl WHERE expires < ${maxCreatedTime} and namespace = ${namespace}

In order to run the application you have to execute ./sbt run. The application can run in two modes: synchronized and unsynchronized. The synchronized case serves to create such an adversary situation, where each thread executing the cleanup job waits for the others with a CyclicBarrier before executing the 2nd step of the transaction. To switch between synchronized and unsynchronized modes one should modify the app.synchronized key in the src/main/resources/application.conf file.

Sadly enough, the app cannot reliably reproduce the livelock, but even in unsynchronized mode it often takes at least 2-3 minutes before a cleanup task could be successfully executed, which disturbs the normal operation of our application. We have around 2-3 inserts per seconds and the cleaning jobs execute in random intervals, once about every minutes, but multiple nodes can execute it the same time.

For my tests I used version 20161006 of cockroach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-questionA question rather than an issue. No code/spec/doc change needed.O-communityOriginated from the community

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions