Skip to content

Enhance Connection Collapse support in ATS core#6028

Closed
sudheerv wants to merge 1 commit intoapache:masterfrom
sudheerv:collapsed_fwd
Closed

Enhance Connection Collapse support in ATS core#6028
sudheerv wants to merge 1 commit intoapache:masterfrom
sudheerv:collapsed_fwd

Conversation

@sudheerv
Copy link
Copy Markdown
Contributor

@sudheerv sudheerv commented Oct 16, 2019

Add an option to support cache open read retry on a write lock
failure. With this option, as long as read-while-writer is set
up following the guidelines in the docs, there should be no need
for any plugins to augment the core. Eventual plan is to deprecate
collapsed_forwarding plugin with this new support.

For more context on this, see
https://cwiki.apache.org/confluence/display/TS/Presentations+-+2019?preview=/112821251/132320653/Collapsed%20Forwarding%20.pdf

@sudheerv sudheerv added this to the 10.0.0 milestone Oct 16, 2019
@sudheerv sudheerv requested review from ezelkow1 and zwoop October 16, 2019 22:59
@sudheerv sudheerv self-assigned this Oct 16, 2019
@scw00
Copy link
Copy Markdown
Member

scw00 commented Oct 18, 2019

Hi Some question for this pr .

Does this increase the CACHE_LOOKUP_COMPLETE hook calls ?
What's the latency do we lost in these retries in worst case ?
IIUC we can only redo read logic for DOC_BUSY ERROR ?

Comment thread proxy/http/HttpSM.cc
Comment thread proxy/http/HttpTransact.cc Outdated
@scw00
Copy link
Copy Markdown
Member

scw00 commented Oct 18, 2019

Please update:

case CACHE_WL_READ_RETRY:
// Write failed but retried and got a vector to read
// We need to clean up our state so that transact does
// not assert later on. Then handle the open read hit

@sudheerv
Copy link
Copy Markdown
Contributor Author

sudheerv commented Oct 18, 2019

Hi Some question for this pr .

Does this increase the CACHE_LOOKUP_COMPLETE hook calls ?

No, the API callbacks with CACHE_LOOKUP_COMPLETE shouldn't be impacted. The call back for CACHE_LOOKUP_COMPLETE happens from HttpSM after HttpCacheSM finishes all retries and returns the final lookup status.

What's the latency do we lost in these retries in worst case ?

The additional latency due to this new feature can be up to (max_cache_open_read_retries X cache_open_read_retry_time) and the total max latency including the first round of read retries will be equal to 2 X (max_cache_open_read_retries X cache_open_read_retry_time).

Typical values used are max_cache_open_read_retries = 5-10, cache_open_read_retry_time = 10-50 msec, so a worst case of 1 sec (2 X 10 X 50 msec)

IIUC we can only redo read logic for DOC_BUSY ERROR ?

Yes, that's to avoid having to penalize simple cache miss scenarios (i.e when there's no concurrent request). The new change uses write_lock_failure as an additional signal to identify the DOC_BUSY scenario. See below for more context on this.

https://cwiki.apache.org/confluence/display/TS/Presentations+-+2019?preview=/112821251/132320653/Collapsed%20Forwarding%20.pdf

@scw00
Copy link
Copy Markdown
Member

scw00 commented Oct 19, 2019

@sudheerv Thanks for your response. Looks great.

@scw00
Copy link
Copy Markdown
Member

scw00 commented Oct 19, 2019

Thread 34 "[ET_NET 31]" received signal SIGABRT, Aborted.
0x00007ffff593d207 in raise () from /lib64/libc.so.6
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y^CQuit
(gdb) bt
#0  0x00007ffff593d207 in raise () from /lib64/libc.so.6
#1  0x00007ffff593e8f8 in abort () from /lib64/libc.so.6
#2  0x00007ffff7aff0da in ink_abort (message_format=0x7ffff7b6c688 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00007ffff7afc7d2 in _ink_assert (expression=0xa49b10 "open_write_tries <= master_sm->t_state.txn_conf->max_cache_open_write_retries", file=0xa49868 "HttpCacheSM.cc", line=205) at ink_assert.cc:37
#4  0x00000000007a3bc0 in HttpCacheSM::state_cache_open_write (this=0x7fff7828e630, event=2, data=0x1146000) at HttpCacheSM.cc:205
#5  0x00000000006b05df in Continuation::handleEvent (this=0x7fff7828e630, event=2, data=0x1146000) at /home/scw00/trafficserver/iocore/eventsystem/I_Continuation.h:190
#6  0x000000000099dc27 in EThread::process_event (this=0x32f6810, e=0x1146000, calling_code=2) at UnixEThread.cc:136
#7  0x000000000099e202 in EThread::execute_regular (this=0x32f6810) at UnixEThread.cc:249
#8  0x000000000099e52b in EThread::execute (this=0x32f6810) at UnixEThread.cc:338
#9  0x000000000099cf1d in spawn_thread_internal (a=0x6578010) at Thread.cc:92
#10 0x00007ffff66fedd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff5a04ead in clone () from /lib64/libc.so.6
(gdb) up
(gdb) p master_sm->t_state.txn_conf->max_cache_open_write_retries
$1 = 1
(gdb) p open_write_tries
$2 = 2
(gdb)

@sudheerv
Copy link
Copy Markdown
Contributor Author

Thread 34 "[ET_NET 31]" received signal SIGABRT, Aborted.
0x00007ffff593d207 in raise () from /lib64/libc.so.6
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y^CQuit
(gdb) bt
#0  0x00007ffff593d207 in raise () from /lib64/libc.so.6
#1  0x00007ffff593e8f8 in abort () from /lib64/libc.so.6
#2  0x00007ffff7aff0da in ink_abort (message_format=0x7ffff7b6c688 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00007ffff7afc7d2 in _ink_assert (expression=0xa49b10 "open_write_tries <= master_sm->t_state.txn_conf->max_cache_open_write_retries", file=0xa49868 "HttpCacheSM.cc", line=205) at ink_assert.cc:37
#4  0x00000000007a3bc0 in HttpCacheSM::state_cache_open_write (this=0x7fff7828e630, event=2, data=0x1146000) at HttpCacheSM.cc:205
#5  0x00000000006b05df in Continuation::handleEvent (this=0x7fff7828e630, event=2, data=0x1146000) at /home/scw00/trafficserver/iocore/eventsystem/I_Continuation.h:190
#6  0x000000000099dc27 in EThread::process_event (this=0x32f6810, e=0x1146000, calling_code=2) at UnixEThread.cc:136
#7  0x000000000099e202 in EThread::execute_regular (this=0x32f6810) at UnixEThread.cc:249
#8  0x000000000099e52b in EThread::execute (this=0x32f6810) at UnixEThread.cc:338
#9  0x000000000099cf1d in spawn_thread_internal (a=0x6578010) at Thread.cc:92
#10 0x00007ffff66fedd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff5a04ead in clone () from /lib64/libc.so.6
(gdb) up
(gdb) p master_sm->t_state.txn_conf->max_cache_open_write_retries
$1 = 1
(gdb) p open_write_tries
$2 = 2
(gdb)

Ah, that assert needs to be moved down. Thanks for testing and catching it!

@sudheerv
Copy link
Copy Markdown
Contributor Author

Thread 34 "[ET_NET 31]" received signal SIGABRT, Aborted.
0x00007ffff593d207 in raise () from /lib64/libc.so.6
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y^CQuit
(gdb) bt
#0  0x00007ffff593d207 in raise () from /lib64/libc.so.6
#1  0x00007ffff593e8f8 in abort () from /lib64/libc.so.6
#2  0x00007ffff7aff0da in ink_abort (message_format=0x7ffff7b6c688 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00007ffff7afc7d2 in _ink_assert (expression=0xa49b10 "open_write_tries <= master_sm->t_state.txn_conf->max_cache_open_write_retries", file=0xa49868 "HttpCacheSM.cc", line=205) at ink_assert.cc:37
#4  0x00000000007a3bc0 in HttpCacheSM::state_cache_open_write (this=0x7fff7828e630, event=2, data=0x1146000) at HttpCacheSM.cc:205
#5  0x00000000006b05df in Continuation::handleEvent (this=0x7fff7828e630, event=2, data=0x1146000) at /home/scw00/trafficserver/iocore/eventsystem/I_Continuation.h:190
#6  0x000000000099dc27 in EThread::process_event (this=0x32f6810, e=0x1146000, calling_code=2) at UnixEThread.cc:136
#7  0x000000000099e202 in EThread::execute_regular (this=0x32f6810) at UnixEThread.cc:249
#8  0x000000000099e52b in EThread::execute (this=0x32f6810) at UnixEThread.cc:338
#9  0x000000000099cf1d in spawn_thread_internal (a=0x6578010) at Thread.cc:92
#10 0x00007ffff66fedd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff5a04ead in clone () from /lib64/libc.so.6
(gdb) up
(gdb) p master_sm->t_state.txn_conf->max_cache_open_write_retries
$1 = 1
(gdb) p open_write_tries
$2 = 2
(gdb)

Ah, that assert needs to be moved down. Thanks for testing and catching it!

Fixed it

@sudheerv
Copy link
Copy Markdown
Contributor Author

Please update:

case CACHE_WL_READ_RETRY:
// Write failed but retried and got a vector to read
// We need to clean up our state so that transact does
// not assert later on. Then handle the open read hit

Done

@sudheerv sudheerv closed this Oct 19, 2019
@sudheerv
Copy link
Copy Markdown
Contributor Author

Please update:

case CACHE_WL_READ_RETRY:
// Write failed but retried and got a vector to read
// We need to clean up our state so that transact does
// not assert later on. Then handle the open read hit

@sudheerv sudheerv reopened this Oct 19, 2019
@sudheerv
Copy link
Copy Markdown
Contributor Author

Please update:

case CACHE_WL_READ_RETRY:
// Write failed but retried and got a vector to read
// We need to clean up our state so that transact does
// not assert later on. Then handle the open read hit

Done

@sudheerv sudheerv closed this Oct 19, 2019
@sudheerv sudheerv reopened this Oct 19, 2019
@scw00
Copy link
Copy Markdown
Member

scw00 commented Oct 21, 2019

Fatal: HttpSM.cc:2460: failed assertion `t_state.cache_open_write_fail_action == HttpTransact::CACHE_WL_FAIL_ACTION_READ_RETRY`

Thread 34 "[ET_NET 31]" received signal SIGABRT, Aborted.
[Switching to Thread 0x65b1700 (LWP 139755)]
0x00007ffff593d207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-2.el7.x86_64 elfutils-libs-0.176-2.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libstdc++-4.8.5-36.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.2-13.el7_6.5.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007ffff593d207 in raise () from /lib64/libc.so.6
#1  0x00007ffff593e8f8 in abort () from /lib64/libc.so.6
#2  0x00007ffff7aff0da in ink_abort (message_format=0x7ffff7b6c688 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00007ffff7afc7d2 in _ink_assert (expression=0xa63208 "t_state.cache_open_write_fail_action == HttpTransact::CACHE_WL_FAIL_ACTION_READ_RETRY", file=0xa61c52 "HttpSM.cc", line=2460)
    at ink_assert.cc:37
#4  0x000000000075d5ee in HttpSM::state_cache_open_write (this=0x7fff7828d0a0, event=1102, data=0x117ffa0) at HttpSM.cc:2460
#5  0x000000000075e1e4 in HttpSM::main_handler (this=0x7fff7828d0a0, event=1102, data=0x117ffa0) at HttpSM.cc:2614
#6  0x00000000006cddaf in Continuation::handleEvent (this=0x7fff7828d0a0, event=1102, data=0x117ffa0) at /home/scw00/trafficserver/iocore/eventsystem/I_Continuation.h:190
#7  0x00000000007c357b in HttpCacheSM::state_cache_open_write (this=0x7fff7828ee10, event=2, data=0x117ffa0) at HttpCacheSM.cc:209
#8  0x00000000006cddaf in Continuation::handleEvent (this=0x7fff7828ee10, event=2, data=0x117ffa0) at /home/scw00/trafficserver/iocore/eventsystem/I_Continuation.h:190
#9  0x00000000009bf4f1 in EThread::process_event (this=0x3330430, e=0x117ffa0, calling_code=2) at UnixEThread.cc:136
#10 0x00000000009bfacc in EThread::execute_regular (this=0x3330430) at UnixEThread.cc:249
#11 0x00000000009bfdf5 in EThread::execute (this=0x3330430) at UnixEThread.cc:338
#12 0x00000000009be7e7 in spawn_thread_internal (a=0x65b2010) at Thread.cc:92
#13 0x00007ffff66fedd5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007ffff5a04ead in clone () from /lib64/libc.so.6

Add an option to support cache open read retry on a write lock
failure. With this option, as long as read-while-writer is set
up following the guidelines in the docs, there should be no need
for any plugins to augment the core. Eventual plan is to deprecate
collapsed_forwarding plugin with this new support

Fix the assert and comments

init t_state.cache_open_write_fail_action to fix the assert
@sudheerv
Copy link
Copy Markdown
Contributor Author

che_open_wri

Fixed the assert. Thanks @scw00 for catching it!

Comment thread proxy/http/HttpCacheSM.cc
@scw00
Copy link
Copy Markdown
Member

scw00 commented Oct 22, 2019

I have some concerns :

  • It involves COMPLETE_LOOK_UP_HOOK twice and it might leading some unexpected behaviours in plugins. Please update doc if it is expected.
  • What should we do for update scenario. It seems we do return wrong object directly once we failed to get DOC_BUSY error (without retry).

Comment thread proxy/http/HttpCacheSM.cc
@zwoop
Copy link
Copy Markdown
Contributor

zwoop commented Oct 23, 2019

@sudheerv Are you making another PR for this? :)

@zwoop zwoop removed this from the 10.0.0 milestone Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants