Skip to content

2.3.1 Release Proposal#1908

Merged
janl merged 26 commits into2.3.xfrom
2.3.1-draft
Feb 17, 2019
Merged

2.3.1 Release Proposal#1908
janl merged 26 commits into2.3.xfrom
2.3.1-draft

Conversation

@janl
Copy link
Copy Markdown
Member

@janl janl commented Feb 7, 2019

Mainly to get a CI build status for this set of cherry-picked commits between 2.3.0 and master.

nickva and others added 19 commits February 7, 2019 10:49
This avoids needlessly making cross-cluster fabric:update_docs(Db, [], Opts)
calls.
   -  fix function_clause error on invalid DB security objects
   when the request body of PUT db/_security endpoint is not
   a correct json format

Closes #1384
Previously `end_time` was generated converting the start_time to universal,
then passing that to `httpd_util:rfc1123_date/1`. However, `rfc1123_date/1`
also transates its argument from local to UTC time, that is it accepts input to
be in local time format.

Fixes #1841
There was a subtle bug when opening specific revisions in
fabric_doc_open_revs due to a race condition between updates being
applied across a cluster.

The underlying cause here was due to the stemming after a document had
been updated more than revs_limit number of times along with concurrent
reads to a node that had not yet made the update. To illustrate lets
consider a document A which has a revision history from `{N, RevN}` to
`{N+1000, RevN+1000}` (assuming revs_limit is the default 1000). If we
consider a single node perspective when an update comes in we added the
new revision and stem the oldest revision. The docs the revisions on the
node would be `{N+1, RevN+1}` to `{N+1001, RevN+1001}`.

The bug exists when we attempt to open revisions on a different node
that has yet to apply the new update. In this case when
fabric_doc_open_revs could be called with `{N+1000, RevN+1000}`. This
results in a response from fabric_doc_open_revs that includes two
different `{ok, Doc}` results instead of the expected one instance. The
reason for this is that one document has revisions `{N+1, RevN+1}` to
`{N+1000, RevN+1000}` from the node that has applied the update, while
the node without the update responds with revisions `{N, RevN}` to
{N+1000, RevN+1000}`.

To rephrase that, a node that has applied an update can end up returning
a revision path that contains `revs_limit - 1` revisions while a node
wihtout the update returns all `revs_limit` revisions. This slight
change in the path prevented the responses from being properly combined
into a single response.

This bug has existed for many years. However, read repair effectively
prevents it from being a significant issue by immediately fixing the
revision history discrepancy. This was discovered due to the recent bug
in read repair during a mixed cluster upgrade to a release including
clustered purge. In this situation we end up crashing the design
document cache which then leads to all of the design document requests
being direct reads which can end up causing cluster nodes to OOM and
die. The conditions require a significant number of design document
edits coupled with already significant load to those modified design
documents. The most direct example observed was a clustered that had a
significant number of filtered replications in and out of the cluster.
This server admin-only endpoint forces an n-way sync of all shards
across all nodes on which they are hosted.

This can be useful for an administrator adding a new node to the
cluster, after updating _dbs so that the new node hosts an existing db
with content, to force the new node to sync all of that db's shards.

Users may want to bump their `[mem3] sync_concurrency` value to a
larger figure for the duration of the shards sync.

Closes #1807
It has a fix to revert user socket buffer size to 8192 and also
allow setting this buffer values directly (not necessarily
via {recbuf, ...}).

Fixes #1810

Warning:

2.19.0 blacklists a series of OTP releases: 21.2, 21.2.1, 21.2.2
This is done via a runtime check of the ssl application version.

The blacklist seems valid as there is a bug which prevents data from
being delivered on TSL sockets. That could affect either CouchDB
server side (chttpd) or replication client side (ibrowse).
This restrict _purge and _purged_infos_limit to server admin
in terms of the security level required to run them.

Fixes #1799
This commit introduces a new option `snooze_period_ms` (measured in
milliseconds), and deprecates `snooze_period` while still supporting it
for obvious legacy reasons.
The Makefile target builds a python3 venv at .venv and installs
black if possible. Since black is Python 3.6 and up only, we
skip the check on systems with an older Python 3.x.
@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 7, 2019

needs apache/couchdb-documentation#392

@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 7, 2019

and apache/couchdb-fauxton#1180

@wohali
Copy link
Copy Markdown
Member

wohali commented Feb 7, 2019

Hi @janl ,

c5d9cfe (#1766) <-- you missed this one
(you already pulled 33e3625)

Would you consider also cherry-picking these minor fixes? These are all small in scope but moderate in covering some corner cases, especially upgrade-related, so I think they'd be good fits for a patch release.

c68863a (#1808)
c6b095b (#1860, fixes mixed-cluster situation)
17f05b7 (#1874, reported by user in Slack/IRC)

#1794 which is:

#1824 which is:

@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 8, 2019

Heya @wohali most of these are in. Can you double check and dedupe? :)

@wohali
Copy link
Copy Markdown
Member

wohali commented Feb 8, 2019

@janl updated to dedupe, sorry about that, was going off the wrong info

@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 8, 2019

@wohali thanks for the dedupe, sorry it wasn’t clearer what was included already.

c6b095b (#1860, fixes mixed-cluster situation)

I may have read things wrong when skimming, but I assumed this was only relevant past the partitioned databases commit. Happy to reconsider if this is generally useful cc @davisp.

17f05b7 (#1874, reported by user in Slack/IRC)

I had ruled dep ups for other than critical things to be out of scope for a .1, but on reread I do agree we should do this one.

#1794 which is: …

felt a bit risqué for a .1, but happy to include.

#1824 which is: …

Also thought this was moving around things too much for a .1, but am happy to be convinced otherwise (cc @nickva)


Any I didn’t comment on I agree on adding. Will do so over the weekend while Fauxton gets into shape.

@wohali
Copy link
Copy Markdown
Member

wohali commented Feb 8, 2019

@jan thanks. The other we should think about is #1803, but @jaydoane may need help.

@nickva
Copy link
Copy Markdown
Contributor

nickva commented Feb 8, 2019

@janl

#1824 which is: …
Also thought this was moving around things too much for a .1, but am happy to be convinced otherwise (cc @nickva)

Most changes in the PR was a code move to copy streams logic to its own fabric module in a separate commit:

19048fd

The main logic was here:

41757cd

I think it would mostly affect larger clusters with many requests timing out and being cleaned up improperly so they'd leak their rexi workers. The other ones affect might be smaller embedded system with restrictive resource (low max_dbs_open value). But maybe not as critical for average CouchDB deployments and it's a bug that's been there for years, so I can see keeping it back to reduce the .1 commit set.

Oh and thank you for helping with 2.3.1!

davisp and others added 2 commits February 12, 2019 12:36
This enables backwards compatbility with nodes still running the old
version of fabric_rpc when a cluster is upgraded to master. This has no
effect once all nodes are upgraded to the latest version.
This fixes inability to set keys with regex symbols in them
This adds an API call for looking up a single design doc regardless of
whether the database is clustered or not.
@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 12, 2019

#1766 and #1824 end up being no trivial merges, so I’ll leave those out for now.

I’ve added everything else.

The underlying clustered _all_docs call can cause significant extra load
during compaction.
@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 12, 2019

#1803 doesn’t look ready yet. cc @jaydoane @iilyak

I won’t have time to review it, but if it lands in master in the next ~48 hours, I can hold 2.3.1 until then.

@jaydoane
Copy link
Copy Markdown
Contributor

@janl #1803 has landed in master

This ensures that admin password hashes are the same on all nodes when
passwords are set directly on each node rather than through the
coordinator node.
@janl
Copy link
Copy Markdown
Member Author

janl commented Feb 17, 2019

@jaydoane merged, thanks!

@janl janl merged commit d8c29c4 into 2.3.x Feb 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.