Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Oct 26, 2020

Allocation may fail spuriously since the garbage collector is lazy. A custom MemoryPool can force it to run when allocation may have failed spuriously then try again in case any reusable allocations have been freed.

@github-actions
Copy link

@bkietz
Copy link
Member Author

bkietz commented Oct 27, 2020

I've replace explicit calls to arrow::default_memory_pool() with gc_memory_pool() to pick up this change. However there are many Arrow APIs which don't require an explicit MemoryPool and will invoke arrow::default_memory_pool() if none is provided. The leak will persist if we fail to identify one of these and keep the default memory pool. I've fixed two implicit default pool usages: dataset___Dataset__NewScan and compute__CallFunction. I'll look for others but please call out any I miss

@bkietz
Copy link
Member Author

bkietz commented Oct 27, 2020

Not in scope for this PR, may result in further spurious OOMs later:

  • Feather APIs don't accept a MemoryPool
  • LocalFileSystem::OpenInput{Stream,File} use the default memory pool if use_mmap is not set. This affects dataset scanning transitively since we use those methods to open files for reading
  • Any file opened with FileSystem$OpenInputStream is also affected

Copy link
Contributor

@romainfrancois romainfrancois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot I don't understand on the C++ side, but that otherwise looks good to me.


// ARROW-10080: Allocation may fail spuriously since the garbage collector is lazy.
// Force it to run then try again in case any reusable allocations have been freed.
static cpp11::function gc = cpp11::package("base")["gc"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make gc() a member of GcMemoryPool instead of repeating it for each GcAndTryAgain ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance is static so it will only be initialized in the first call to GcAndTryAgain().

I had gc as a member of GcMemoryPool initially but it resulted in an instance of gc which didn't show up in trace(). I assume this is due to a subtlety with the timing of the constructor of g_pool but I didn't debug further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first call of each version of the template method GcAndTryAgain right ? Probably no big deal anyway.

@romainfrancois
Copy link
Contributor

romainfrancois commented Oct 28, 2020

I also had, in a branch that builds on top of #8256 ways to prematurely invalidate objects when we know they won't be used anymore. For example, in this function:

collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) {
  x <- ensure_group_vars(x)
  # Pull only the selected rows and cols into R
  if (query_on_dataset(x)) {
    # See dataset.R for Dataset and Scanner(Builder) classes
    tab <- Scanner$create(x)$ToTable()
  } else {
    # This is a Table/RecordBatch. See record-batch.R for the [ method
    tab <- x$.data[x$filtered_rows, x$selected_columns, keep_na = FALSE]
  }
  if (as_data_frame) {
    df <- as.data.frame(tab)
    tab$invalidate() # HERE <<<<<<------------- 
    restore_dplyr_features(df, x)
  } else {
    restore_dplyr_features(tab, x)
  }
}

inside the if (as_data_frame) as soon as tab is converted to a data.frame we will no longer need or use tab, so calling $invalidate() on it calls the destructor of the shared pointer held by the external pointer that lives in tab, so that the memory is free right now instead of later when the gc is called

Is this still worth having ? And in that case should I push this to #8256 cc @nealrichardson

@bkietz
Copy link
Member Author

bkietz commented Oct 28, 2020

I think it's worthwhile to invalidate individual objects as soon as we know we won't need them since it removes unnecessary burden from the garbage collector; it'll essentially be a performance hint (and therefore removal is also justifiable if you want to keep things simple-as-possible).

@romainfrancois
Copy link
Contributor

Agree. So I cherry picked that commit into the #8256 pull request so that we have $invalidate() for Table and RecordBatch.

@bkietz
Copy link
Member Author

bkietz commented Oct 29, 2020

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
@bkietz
Copy link
Member Author

bkietz commented Oct 30, 2020

CI failures are unrelated. Merging

@bkietz bkietz closed this in b109195 Oct 30, 2020
@bkietz bkietz deleted the 10080-Arrow-does-not-release-un branch October 30, 2020 14:09
@terencehonles
Copy link
Contributor

terencehonles commented Nov 4, 2020

CI failures are unrelated. Merging

@bkietz I've bisected the commit history, and it does look like this change

diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc
index 1cde5e64e..88f006fe0 100644
--- a/cpp/src/arrow/io/memory.cc
+++ b/cpp/src/arrow/io/memory.cc
@@ -54,7 +54,7 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr<ResizableBuffer>& b
 Result<std::shared_ptr<BufferOutputStream>> BufferOutputStream::Create(
     int64_t initial_capacity, MemoryPool* pool) {
   // ctor is private, so cannot use make_shared
+  DCHECK_NE(pool, nullptr);
   auto ptr = std::shared_ptr<BufferOutputStream>(new BufferOutputStream);
   RETURN_NOT_OK(ptr->Reset(initial_capacity, pool));
   return ptr;

is "responsible" for breaking the JNI status check (you can also see it was passing on the first commit in this PR https://github.com/apache/arrow/runs/1310578738 and then failing at the end https://github.com/apache/arrow/runs/1329149428).

I am not familiar with the code, and I was hoping for you to possibly help me understand why the check was added and why it might be segfaulting the test java/adapter/orc/src/test/java/org/apache/arrow/adapter/orc/OrcReaderTest.java. I have experimented with removing the check and the test passes as expected, but if the check was added to catch something I'd rather not cover it up. I noticed there were two follow ups ARROW-10420 and ARROW-10421 and I was wondering if this might be related?

Edit: since the runs above were not merged and their branch was deleted it looks like GitHub dropped the logs. This is the specific test error I am referring to (grabbed from a run on the master branch) https://github.com/apache/arrow/runs/1351819845#step:8:7374

@terencehonles
Copy link
Contributor

Here's some more context which could be helpful, the first coming from the run logs and the latter being a file that would be generated after the crash.

Test logs:

[INFO] Running org.apache.arrow.adapter.orc.OrcReaderTest
16:21:58.149 [main] INFO org.apache.arrow.memory.BaseAllocator - Debug mode enabled.
16:21:58.172 [main] INFO org.apache.arrow.memory.DefaultAllocationManagerOption - allocation manager type not specified, using netty as the default type
16:21:58.175 [main] INFO org.apache.arrow.memory.CheckAllocator - Using DefaultAllocationManager at memory-netty/3.0.0-SNAPSHOT/arrow-memory-netty-3.0.0-SNAPSHOT.jar!/org/apache/arrow/memory/DefaultAllocationManagerFactory.class

Test crash logs:

# Created at 2020-11-04T16:22:01.347
WARNING: Logging before InitGoogleLogging() is written to STDERR

# Created at 2020-11-04T16:22:01.347
F1104 16:22:01.345963  7719 memory.cc:57]  Check failed: (pool) != (nullptr) 

# Created at 2020-11-04T16:22:01.347
*** Check failure stack trace: ***

# Created at 2020-11-04T16:22:04.748
Aborted (core dumped)

@bkietz
Copy link
Member Author

bkietz commented Nov 4, 2020

@terencehonles Sorry for the breakage! I added this check when debugging injection of the customized memory pool. I didn't see that PoolBuffer::MakeUnique() coerces null MemoryPool* to default_memory_pool(), so I assumed leaving it in would be harmless. It can be removed, though it's still slightly suspicious to me that null would be passed for a memory pool given that the test logs seem to indicate a different memory manager (allocation manager type not specified, using netty as the default type) was selected.

@terencehonles
Copy link
Contributor

@terencehonles Sorry for the breakage! I added this check when debugging injection of the customized memory pool. I didn't see that PoolBuffer::MakeUnique() coerces null MemoryPool* to default_memory_pool(), so I assumed leaving it in would be harmless. It can be removed, though it's still slightly suspicious to me that null would be passed for a memory pool given that the test logs seem to indicate a different memory manager (allocation manager type not specified, using netty as the default type) was selected.

No prob, I'll try to see if I can figure out what the Java code is doing, but thanks for the input!

@terencehonles
Copy link
Contributor

It looks like the offending line is https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L229 and I'm updating that to not pass null, but since passing a null pointer was previously supported I'm going to also remove the check in case it's used elsewhere.

kou pushed a commit that referenced this pull request Nov 5, 2020
`OrcStripeReaderJniWrapper::getSchema` previously used `nullptr` for the memory pool parameter to `arrow::ipc::SerializeSchema` to implicitly call `arrow::default_memory_pool()`.

As part of ARROW-10080 (#8533), a check was placed to fail on `nullptr` being provided. This change removes the check, but also explicitly calls `arrow::default_memory_pool()` which is used elsewhere in the JNI wrapper.

Closes #8595 from terencehonles/arrow-10499

Authored-by: Terence D. Honles <terence@honles.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
bkietz pushed a commit that referenced this pull request Apr 8, 2021
Fixes the cases mentioned in #8533 (comment).

Closes #9939 from lidavidm/arrow-10421

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants