Skip to content

Using stat and alikes can block dlopen when a large enough thread-local storage is used #16152

@lifthrasiir

Description

@lifthrasiir

Zig Version

0.11.0-dev.3382+c16d4ab9e

Steps to Reproduce and Observed Behavior

Given the following C code:

#include <sys/stat.h>

static __thread char BALLAST[20000];

char *get_ballast(void) {
    struct stat st;
    if (stat("/etc/localtime", &st) != 0) return 0;
    return BALLAST;
}

A shared object created out of this file should be safe to dlopen:

$ uname -sm
Linux x86_64
$ zig cc -target x86_64-linux-gnu -fPIC -shared example.c -o example.so
$ python -c 'import ctypes; print(ctypes.CDLL("./example.so").get_ballast())'
<random pointer value>

Expected Behavior

In reality the following happens, at least in my machine with glibc 2.31:

$ python -c 'import ctypes; print(ctypes.CDLL("./example.so").get_ballast())'
Traceback (most recent call last):
    ...
OSError: ./example.so: cannot allocate memory in static TLS block

Removing stat call above will restore the ability to dlopen this shared object.

Investigation

It seems that libc_nonshared.a built by Zig contains an excess errno.o:

$ ar t ~/.cache/zig/o/*/libc_nonshared.a | xargs -L 1 basename
atexit.o
at_quick_exit.o
pthread_atfork.o
stack_chk_fail_local.o
errno.o
elf-init-2.33.o
stat.o
fstat.o
lstat.o
stat64.o
fstat64.o
lstat64.o
fstatat.o
fstatat64.o
mknodat.o
mknod.o
stat_t64_cp.o

This is significant because errno.o contains a thread-local symbol using Initial Exec TLS model1:

$ readelf -r ~/.cache/zig/o/*/libc_nonshared.a | grep R_X86_64_GOTTPOFF
00000000001f  000600000016 R_X86_64_GOTTPOFF 0000000000000000 __libc_errno - 4
00000000001c  000400000016 R_X86_64_GOTTPOFF 0000000000000000 __libc_errno - 4
00000000002b  000400000016 R_X86_64_GOTTPOFF 0000000000000000 __libc_errno - 4

As noted by Drepper, shared objects using IE model will immediately try to allocate TLS blocks at the load time so that further TLS accesses can be done with an offset to that allocation. This "static TLS surplus" is a scarce resource and by default there are only 1,664 bytes of them2, and glibc prior to 2.32 used to opportunistically allocate all dynamic TLS at the load time as well, causing an unexpected linkage error when IE is somehow mixed in.

I'm not very sure how to solve this. In some way it seems that properly passing -DSHARED may solve this problem (I haven't checked). But the root of the problem is not that errno is compiled as IE model (which should have been within libc.so), but that errno is present at all in libc_nonshared.a and affects dynamic linkers. Patching errno.c and errno.h to avoid IE (e.g. __attribute__((tls_model("global-dynamic")))) would not be a solution either because it will break errno for some functions.

Footnotes

  1. Ulrich Drepper, ELF Handling For Thread-Local Storage, section 4.3.6 (https://c9x.me/compile/bib/tls.pdf)

  2. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ffb17e7ba3a5ba9632cee97330b325072fbe41dd

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugObserved behavior contradicts documented or intended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions