-
Notifications
You must be signed in to change notification settings - Fork 70
Description
The .tpf function currently stores all Atoms in memory. This leads to OOM issues (out of memory crash) for large collections.
Some thoughts on this:
Rust OOM tools
Currently, we don't get any useful errors in the log. We can't do a stack trace, theres no unwind. This makes debugging OOM issues hard.
This also may have something to do with linux overcommitting memory.
- The RFC for
try_reservemay help prevent panics / OS killing Atomic-Server. oom=panicmight help give prettier error messages. But it's not implemented in stable rust yet.
Index all the TPF queries
Let's go over the types of TPF queries we use, and how we can index these:
- All the queries with a known
subjectare not relevant - By far the most queries have a known
propertyandvalue - The queries with a known
propertyprobably need aproperty-value-subjectindex. We don't have that as of now. That would also help us create really performant queries for new, unindexed query filters. - The queries with only a known
valueare indexed by thereference_index.
How I found the issue
read more...
- Go to atomicdata.dev/collections
- Scroll down
- See
loading...
The problem is that the websocket requests have no response.
Sometimes (but not always) the WebSocket connection seems to fail:
The connection to wss://atomicdata.dev/ws was interrupted while the page was loading. [websockets.js:23:19](https://atomicdata.dev/lib/dist/src/websockets.js)
websocket error:
error { target: WebSocket, isTrusted: true, srcElement: WebSocket, currentTarget: WebSocket, eventPhase: 2, bubbles: false, cancelable: false, returnValue: true, defaultPrevented: false, composed: false, … }
[bugsnag.js:2579:15](https://atomicdata.dev/node_modules/.pnpm/@bugsnag+browser@7.16.5/node_modules/@bugsnag/browser/dist/bugsnag.js)
On the server, I see this every time:
Oct 29 10:50:49 vultr.guest atomic-server[2965299]: Visit https://atomicdata.dev
Oct 29 10:50:49 vultr.guest atomic-server[2965299]: 2022-10-29T10:50:49.596753Z INFO actix_server::builder: Starting 1 workers
Oct 29 10:50:49 vultr.guest atomic-server[2965299]: 2022-10-29T10:50:49.596978Z INFO actix_server::server: Actix runtime found; starting in Actix runtime
Oct 29 10:51:13 vultr.guest systemd[1]: atomic.service: Main process exited, code=killed, status=9/KILL
Oct 29 10:51:13 vultr.guest systemd[1]: atomic.service: Failed with result 'signal'.
Oct 29 10:51:14 vultr.guest systemd[1]: atomic.service: Scheduled restart job, restart counter is at 27.
Oct 29 10:51:14 vultr.guest systemd[1]: Stopped Atomic-Server.
Oct 29 10:51:14 vultr.guest systemd[1]: Started Atomic-Server.
What killed our process?
dmesg -T| grep -E -i -B100 'killed process'
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/atomic.service,task=atomic-server,pid=2965353,uid=0
[Sat Oct 29 10:51:59 2022] Out of memory: Killed process 2965353 (atomic-server) total-vm:891908kB, anon-rss:278920kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:776kB oom_score_adj:0
An out of memory issue...
Since we can correctly see most of the Collections, but not all, I think it's one of the collections that is actually causing this.
After checking them one by one, the culprit seems to be /commits. Makes sense, it is by far the largest collection!
I think the problem has to do with .tpf not being iterable.
