Skip to content

Limit nut-scanner thread count#1158

Merged
jimklimov merged 29 commits intonetworkupstools:masterfrom
jimklimov:fix-nut-scanner-threadcount
Nov 6, 2021
Merged

Limit nut-scanner thread count#1158
jimklimov merged 29 commits intonetworkupstools:masterfrom
jimklimov:fix-nut-scanner-threadcount

Conversation

@jimklimov
Copy link
Copy Markdown
Member

@jimklimov jimklimov commented Nov 3, 2021

If extremely many threads are spawned, e.g. scanning a very large IP address range, the nut-scanner program can crash due to resource exhaustion or hitting some constraints in pthreads implementation (various effects were seen on different generations of OSes) or third-party libraries used in particular scanners.

This PR adds a (CLI-configurable as --thread N) limit to amount of threads that would be spawned by different parallelized scanners (snmp, nut, netxml, serial), and if the limit gets hit - it would pthread_tryjoin_np() to wait for some of the threads to complete their work, and only then proceed to spawn another.

Note that while there is locking to increase or decrease the overall thread counter, checks are not mutex'ed and I anticipate that small accounting errors are possible (e.g. launching up to +1 "extra" thread for each scan type).

Given that scanning usually means sending a request and waiting if any reply comes within a timeout, the processing load of each item scan is negligible while considerable wall-clock time is spent. Lack of parallelism does add up: in tests, an SNMP scan of a /24 subnet took ~5 seconds without hitting a limit, 20 seconds with a limit of 64 threads, and ~1200 seconds with a limit of 1 thread.

The default limit in this PR is arbitrarily set to 1024 threads overall, to accomodate fast scans of a typical /24 subnet with several protocols at once.
UPDATE1: The default can get reduced if allowed file descriptor ulimit is smaller.
UPDATE2: Also individual protocol scanners can be subjected to their own limits; whichever is smaller (and gets hit first) wins.

PS: This is a separate issue from broken netmask processing, addressed in PR #1157

@jimklimov jimklimov requested review from aquette, clepple and zykh November 3, 2021 21:15
@aquette
Copy link
Copy Markdown
Member

aquette commented Nov 4, 2021

side note:
years ago, I uncovered a limitation in Net SNMP shared lib: when spawning more than 1024 handles, you crash it.
Worth to track this

Comment thread tools/nut-scanner/scan_snmp.c Outdated
Copy link
Copy Markdown
Member

@aquette aquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as discussed

@jimklimov
Copy link
Copy Markdown
Member Author

jimklimov commented Nov 4, 2021

Thanks, though trying to confirm that (in a Debian 11 derived enviromnent) by scanning 8 * /24 subnets and a huge limit of jobs, I did not get a crash and got many responses. Need to check with older systems...

It did however not work well from the first try: needed to bump ulimit -n 131072 in the shell (default 1024) at least for net-snmp to be able to search for its global and per-host config files, probably for net sockets, etc. It did complain otherwise.

Notably, with ulimit -n 2048 and nut-scanner -T 2048 ... and a scan of those 8 * /24 subnets (so 2048 IPs) there were no stderr complaints. With ulimit -n 1536 there were, e.g.:

...
Failed to open SNMP session for 10.0.7.252.
/usr/share/snmp/hosts/10.0.7.254.conf: Too many open files
/usr/share/snmp/hosts/10.0.7.254.local.conf: Too many open files
/usr/share/snmp/hosts/10.0.7.253.local.conf: Too many open files
/var/lib/snmp/hosts/10.0.7.254.conf: Too many open files
/var/lib/snmp/hosts/10.0.7.254.local.conf: Too many open files
/var/lib/snmp/hosts/10.0.7.253.conf: Too many open files
/var/lib/snmp/hosts/10.0.7.253.local.conf: Too many open files
/etc/snmp/hosts/10.0.7.255.conf: Too many open files
/etc/snmp/hosts/10.0.7.255.local.conf: Too many open files
Failed to open SNMP session for 10.0.7.254.
/usr/share/snmp/hosts/10.0.7.255.conf: Too many open files
/usr/share/snmp/hosts/10.0.7.255.local.conf: Too many open files
/var/lib/snmp/hosts/10.0.7.255.conf: Too many open files
/var/lib/snmp/hosts/10.0.7.255.local.conf: Too many open files
Failed to open SNMP session for 10.0.7.255.
Failed to open SNMP session for 10.0.7.253.

... so as far as netsnmp constraints are concerned, whether files or sockets, a 1:1 ulimit/host ratio is okay.

NetXML scans however crashed with a jobs allowance over 1021, and failed to create sockets (Error creating socket) with job allowance greater than ulimit - 3 (so also nearly 1:1):

*** buffer overflow detected ***: terminated
Aborted (core dumped)

...one "buffer overflow" (comes from libc AFAIK) for each thread above 1021 (so starting with nut-scanner --thread 1022); also with ulimit -n 1024 it fails to create sockets starting with --thread 1022 (again, ulimit minus 3 fires here).

It seems, the next improvement here would be to not hardcode the default thread count limit, but to derive it from detected current file descriptor ulimit (probably not the other way around - forcing the ulimit bump to match, since this should not generally run as root to be able to bump).

And probably add a (hard? configurable?) limit on thread-count of individual scanners as impacted by their tech :\

@jimklimov jimklimov force-pushed the fix-nut-scanner-threadcount branch 4 times, most recently from 860e455 to ee403d5 Compare November 4, 2021 17:18
@jimklimov
Copy link
Copy Markdown
Member Author

jimklimov commented Nov 4, 2021

After recent commits, I can no longer crash nut-scanner -M for NetXML scans alone (even for larger subnets); however when scanning with several protocols (e.g. oldnut and/or netxml + snmp), and using over 1000 threads, I get buffer overflow detected and core-dumps again.

Double-checked that similar crash is possible in 42ITy fork of NUT with its different mechanism of limiting the thread count.

@jimklimov jimklimov force-pushed the fix-nut-scanner-threadcount branch 2 times, most recently from b1ee48b to a5f4cf4 Compare November 4, 2021 20:16
jimklimov and others added 16 commits November 4, 2021 23:52
…anner.c: not all glibc versions HAVE_PTHREAD_TRYJOIN
@jimklimov jimklimov force-pushed the fix-nut-scanner-threadcount branch from 475534f to 05b5b2b Compare November 4, 2021 22:52
@jimklimov jimklimov changed the title Fix nut-scanner thread count Limit nut-scanner thread count Nov 5, 2021
@jimklimov
Copy link
Copy Markdown
Member Author

TODO later: replicate 42ity approach with semaphores (where available) instead of simple counters (possibly not too thread-safe)?

@jimklimov jimklimov merged commit 2502ab8 into networkupstools:master Nov 6, 2021
@jimklimov jimklimov deleted the fix-nut-scanner-threadcount branch November 6, 2021 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants