Skip to content

fix(profiling): fix SystemError when collecting memory profiler events [forwardport 3.x]#12140

Merged
erikayasuda merged 2 commits into3.x-stagingfrom
backport-12075-to-3.x
Jan 30, 2025
Merged

fix(profiling): fix SystemError when collecting memory profiler events [forwardport 3.x]#12140
erikayasuda merged 2 commits into3.x-stagingfrom
backport-12075-to-3.x

Conversation

@erikayasuda
Copy link
Copy Markdown
Collaborator

@erikayasuda erikayasuda commented Jan 29, 2025

Forwardporting #12075

We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
iter_events function in particular can fail if the memory profiler lock
is contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.

Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in iter_events. We
can allocate the new tracker outside the memory allocation profiler lock
so that we don't need to worry about reentrancy/deadlock issues if
we start profiling that allocation. Then, the only thing we do under the
lock is swap out the global tracker, so it's safe to take the lock
unconditionally.

Fixes #11831

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

#12075)

We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
`iter_events` function in particular can fail if the memory profiler lock
is contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.

Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in `iter_events`. We
can allocate the new tracker outside the memory allocation profiler lock
so that we don't need to worry about reentrancy/deadlock issues if
we start profiling that allocation. Then, the only thing we do under the
lock is swap out the global tracker, so it's safe to take the lock
unconditionally.

Fixes #11831

TODO - regression test?

(cherry picked from commit e3045a1)
@erikayasuda erikayasuda requested review from a team as code owners January 29, 2025 15:05
@erikayasuda erikayasuda changed the base branch from main to 3.x-staging January 29, 2025 15:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 29, 2025

CODEOWNERS have been resolved as:

releasenotes/notes/profiling-memalloc-iter-events-null-780fd50bbebbf616.yaml  @DataDog/apm-python
ddtrace/profiling/collector/_memalloc.c                                 @DataDog/profiling-python

@datadog-dd-trace-py-rkomorn
Copy link
Copy Markdown

Datadog Report

Branch report: backport-12075-to-3.x
Commit report: 96bf9d1
Test service: dd-trace-py

✅ 0 Failed, 130 Passed, 1468 Skipped, 5m 26.23s Total duration (35m 44.29s time saved)

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented Jan 29, 2025

Benchmarks

Benchmark execution time: 2025-01-29 17:37:27

Comparing candidate commit 579c406 in PR branch backport-12075-to-3.x with baseline commit 3337560 in branch 3.x-staging.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

@erikayasuda erikayasuda merged commit 26a275b into 3.x-staging Jan 30, 2025
@erikayasuda erikayasuda deleted the backport-12075-to-3.x branch January 30, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: SystemError: <class 'iter_events'> returned NULL without setting an exception

4 participants