Post-hoc allocation site analysis

It would be very useful to be able to run a `dumpallocs`-style analysis for a built binary, given its debug info and source tree but without rebuilding it -- which takes time.

This is particularly important for introspecting on the ld.so (#98) because we don't want to have to rebuild the whole ld.so, but we could reasonably expect a source tree to be available.

An obvious problem is that `dumpallocs` requires a complete `.i` file to analyse, but DWARF does not currently include enough information to reconstruct the `.i` for any compilation unit. It does, however, contain various hints: many header files will be mentioned in the line table (but many won't!) and we have information about the compiler including its version (if that matters) and some of its command-line arguments (apparently just the `cc1`-style ones, though).

(One interesting question is to what extent the macro information, enabled at `-g3` with GCC, fills these gaps. I'm not pursuing this because it's almost never used in the field.)

I had a go at creating a simple awk script that mimics the preprocessor and tries to reconstruct where the included files came from. On simple examples this works*, but on realistic examples (e.g. compilation units from glibc's ld.so), even with maximum guesswork, it falls down for a number of reasons:

- generated headers
- computed includes, i.e. where macro expansion is used to generate the include spec
- ambiguity across include paths, i.e. the absence of information about include paths' ordering
- `#include_next`, again in the absence of information about include paths' ordering

What might be a better way forward is a way to reconstitute a 'good enough' `.i` file even in the absence of such information. That would require a forgiving parser. C is already forgiving about missing function prototypes, so we'd mostly be worried about type information (which of course affects the parse tree!). We could even use `dwarfidl` to generate a rendering of all the type information up-front, and then our parser would just have to be forgiving of duplicates.

Some problems might remain, e.g. function-like macros used as syntax generators -- if we don't manage to slurp the definition of such a macro, we will choke. However, we expect to get most headers, so it could still work most of the time.

\* The definition of 'works' here is already a bit generous. Since we don't have information about `-D` options on the command line, or builtin macros defined by the compiler, we don't know which `#if` / `#ifdef` branches are taken. So the tool will follow both branches but keeping track of a 'path condition'. The idea would then be to synthesise an environment of `-D` options that could have generated the output file's DWARF, e.g. would have included the set of embodied header files that the line table reports (among others, but hitting all of those). That already is into SMT-solver territory.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post-hoc allocation site analysis #99

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Post-hoc allocation site analysis #99

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions