ARROW-14330: [C++] Create DataHolder #11426

aocsa · 2021-10-15T01:44:28Z

The purpose of this task is to make an ExecNode that can provide the following functionality.

Be able to obtain heuristics about our memory consumption and have a memory consumption threshold
Be able to write incoming ExecBatch to disk if memory consumption is above the threshold, stores either the ExecBatch or a handle to file in a queue.
Provide an api for pulling an ExecBatch from the queue. It should favor pulling all of the batches that are in memory first and then the ones that are handles to files.

github-actions · 2021-10-15T01:44:47Z

https://issues.apache.org/jira/browse/ARROW-14330

…ROW-14330

westonpace

I like parts of this PR. I'm not sure the node's scheduling is going to work. I think the scheduling prioritization logic and the trigger to resume operation are missing. "Something" needs to be checking memory on a regular basis and, if it frees up, resuming work (prioritizing in-memory data before on-disk data according to the JIRA).

cpp/src/arrow/compute/exec.h

cpp/src/arrow/compute/memory_resources.h

cpp/src/arrow/compute/memory_resources.cc

cpp/src/arrow/compute/exec/data_holder_node.cc

westonpace · 2021-10-29T23:38:59Z

cpp/src/arrow/compute/exec/data_holder_node_test.cc

+  std::shared_ptr<Schema> GenerateRandomSchema(size_t num_inputs) {
+    static std::vector<std::shared_ptr<DataType>> some_arrow_types = {
+        arrow::null(),    arrow::boolean(), arrow::int8(),    arrow::int16(),
+        arrow::int32(),   arrow::int64(),   arrow::float16(), arrow::float32(),
+        arrow::float64(), arrow::utf8(),    arrow::binary(),  arrow::date32()};
+
+    std::vector<std::shared_ptr<Field>> fields(num_inputs);
+    std::default_random_engine gen(42);
+    std::uniform_int_distribution<int> types_dist(
+        0, static_cast<int>(some_arrow_types.size()) - 1);
+    for (size_t i = 0; i < num_inputs; i++) {
+      int random_index = types_dist(gen);
+      auto col_type = some_arrow_types.at(random_index);
+      fields[i] =
+          field("column_" + std::to_string(i) + "_" + col_type->ToString(), col_type);
+    }
+    return schema(fields);
+  }
+
+  void GenerateBatchesFromSchema(const std::shared_ptr<Schema>& schema,
+                                 size_t num_batches, BatchesWithSchema* out_batches,
+                                 int multiplicity = 1, int64_t batch_size = 4) {
+    if (num_batches == 0) {
+      auto empty_record_batch = ExecBatch(*rng_.BatchOf(schema->fields(), 0));
+      out_batches->batches.push_back(empty_record_batch);
+    } else {
+      for (size_t j = 0; j < num_batches; j++) {
+        out_batches->batches.push_back(
+            ExecBatch(*rng_.BatchOf(schema->fields(), batch_size)));
+      }
+    }
+
+    size_t batch_count = out_batches->batches.size();
+    for (int repeat = 1; repeat < multiplicity; ++repeat) {
+      for (size_t i = 0; i < batch_count; ++i) {
+        out_batches->batches.push_back(out_batches->batches[i]);
+      }
+    }
+    out_batches->schema = schema;
+  }


While this is an interesting concept I don't see why we need random batches to test a data holder.

I guess intead of random we could just do comprehensive so that we cover all the types listed here every time.

westonpace · 2021-10-29T23:41:31Z

cpp/src/arrow/compute/exec/data_holder_node_test.cc

+  ::arrow::random::RandomArrayGenerator rng_;
+};
+
+TEST_F(TestDataHolderNode, TestNonEmpty) {


This will eventually need tests that cover the case where limits are hit and batches are actually persisted.

westonpace · 2021-10-29T23:42:56Z

cpp/src/arrow/compute/memory_resources.cc

+std::string RandomString(std::size_t length) {
+  const std::string characters =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+  std::random_device random_device;
+  std::mt19937 generator(random_device());
+  std::uniform_int_distribution<> distribution(0, characters.size() - 1);
+  std::string random_string;
+  for (std::size_t i = 0; i < length; ++i) {
+    random_string += characters[distribution(generator)];
+  }
+  return random_string;
+}


This probably fits better in a utility class somewhere so it can be shared. It's pretty similar to the existing function in bloom_filter_test.cc

westonpace · 2021-10-29T23:43:42Z

cpp/src/arrow/compute/memory_resources.cc

+        arrow::fs::FileSystemFromUri(cache_storage_root_path, &root_path).ValueOrDie();
+
+    file_path_ = root_path + file_name;
+    status_ = StoreRecordBatch(record_batch, filesystem_, file_path_);


Complex logic shouldn't happen in constructors. Maybe there needs to be a Put method or something.

So there are a few options here but what we are saying is that we have these different DataHolders that are created from record_batch. We could make a function which is like

template <typename DataHolderType> make_data_holder_from_batch(...)

And it would invoke the constructur with something like the filesystem and path pointing to a file that has already persisted the record batch. This would allow the constructor to be simpler while still being able to have a common api for creating the data_holders from a record_batch

westonpace · 2021-10-29T23:45:22Z

cpp/src/arrow/compute/memory_resources.cc

+  return batch;
+}
+
+class DiskDataHolder : public DataHolder {


Nit: These DataHolder implementations might fit better elsewhere. I'm not sure we want the compute module to depend on the ipc module.

I was wondering the same question. Maybe the best place is arrow/dataset/ like DatasetWritingSinkNodeConsumer.

Datasets also seems like an odd home.

I asked on Zulip: https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Compute.20module.20.2F.20IPC.20module.20dependency

felipeblazing · 2021-11-09T15:02:03Z

cpp/src/arrow/compute/exec/data_holder_node.cc

+      ARROW_DCHECK(executor_ != nullptr);
+      return executor_->Submit(this->stop_source_.token(), [this] {
+        auto generator = this->data_holder_manager_->generator();
+        auto iterator = MakeGeneratorIterator(std::move(generator));


Maybe we can use a condition variable that waits until there are things to iterate on. When a generator has something to push it can notify that cv.

felipeblazing · 2021-11-09T15:11:40Z

cpp/src/arrow/compute/exec/data_holder_node.cc

+    data_holder_manager_ =
+        ::arrow::internal::make_unique<DataHolderManager>(plan->exec_context());
+
+    auto status = task_group_.AddTask([this]() -> Result<Future<>> {


I don't think this is how we should push into the next node. We need for a way for a node to be able to request this data from the DataHolder. Setting up a task that basically loops through inputs to push them forward doesn't allow for intelligent scheduling of these things in the future.

Two options I see are either

Have the nodes that follow DataHolderManager pull from the DataHolderManager. This is relatively straight forward but lacks control.

Have some kind of Simple Scheduler which pulls from the DataHolderManager that have accumulated the most HeldData and pushes those tasks forward. This is a bit more work but provides more value in my opinion.

felipeblazing · 2021-11-09T15:14:36Z

cpp/src/arrow/compute/exec/data_holder_node_test.cc

+  std::shared_ptr<Schema> GenerateRandomSchema(size_t num_inputs) {
+    static std::vector<std::shared_ptr<DataType>> some_arrow_types = {
+        arrow::null(),    arrow::boolean(), arrow::int8(),    arrow::int16(),
+        arrow::int32(),   arrow::int64(),   arrow::float16(), arrow::float32(),
+        arrow::float64(), arrow::utf8(),    arrow::binary(),  arrow::date32()};
+
+    std::vector<std::shared_ptr<Field>> fields(num_inputs);
+    std::default_random_engine gen(42);
+    std::uniform_int_distribution<int> types_dist(
+        0, static_cast<int>(some_arrow_types.size()) - 1);
+    for (size_t i = 0; i < num_inputs; i++) {
+      int random_index = types_dist(gen);
+      auto col_type = some_arrow_types.at(random_index);
+      fields[i] =
+          field("column_" + std::to_string(i) + "_" + col_type->ToString(), col_type);
+    }
+    return schema(fields);
+  }
+
+  void GenerateBatchesFromSchema(const std::shared_ptr<Schema>& schema,
+                                 size_t num_batches, BatchesWithSchema* out_batches,
+                                 int multiplicity = 1, int64_t batch_size = 4) {
+    if (num_batches == 0) {
+      auto empty_record_batch = ExecBatch(*rng_.BatchOf(schema->fields(), 0));
+      out_batches->batches.push_back(empty_record_batch);
+    } else {
+      for (size_t j = 0; j < num_batches; j++) {
+        out_batches->batches.push_back(
+            ExecBatch(*rng_.BatchOf(schema->fields(), batch_size)));
+      }
+    }
+
+    size_t batch_count = out_batches->batches.size();
+    for (int repeat = 1; repeat < multiplicity; ++repeat) {
+      for (size_t i = 0; i < batch_count; ++i) {
+        out_batches->batches.push_back(out_batches->batches[i]);
+      }
+    }
+    out_batches->schema = schema;
+  }


I guess intead of random we could just do comprehensive so that we cover all the types listed here every time.

felipeblazing · 2021-11-09T15:22:27Z

cpp/src/arrow/compute/memory_resources.cc

+        arrow::fs::FileSystemFromUri(cache_storage_root_path, &root_path).ValueOrDie();
+
+    file_path_ = root_path + file_name;
+    status_ = StoreRecordBatch(record_batch, filesystem_, file_path_);


So there are a few options here but what we are saying is that we have these different DataHolders that are created from record_batch. We could make a function which is like

template <typename DataHolderType> make_data_holder_from_batch(...)

And it would invoke the constructur with something like the filesystem and path pointing to a file that has already persisted the record batch. This would allow the constructor to be simpler while still being able to have a common api for creating the data_holders from a record_batch

felipeblazing · 2021-11-09T15:43:14Z

cpp/src/arrow/compute/memory_resources.h

+
+struct ExecBatch;
+
+enum class MemoryLevel : int { kGpuLevel, kCpuLevel, kDiskLevel, kNumLevels };


Do we want the memory levels to be more extensible than the ones that we staticlaly define? I think this would be fine for a first pass but we should think about later one when maybe the user wants to define their own memory levels.

westonpace · 2022-03-29T23:26:04Z

I'm going to close this as I don't think it is being pursued any longer and we are implementing these features in other PRs.

github-actions bot added the Component: C++ label Oct 15, 2021

aocsa added 2 commits October 18, 2021 10:26

first commit

f7517fc

MemoryResourcesStats & DataHolder

1087843

aocsa force-pushed the aocsa/ARROW-14330 branch from ff88388 to 1087843 Compare October 29, 2021 18:41

aocsa added 2 commits October 29, 2021 13:43

Merge remote-tracking branch 'origin/aocsa/ARROW-14330' into aocsa/AR…

266f655

…ROW-14330

minor fix and test added

48ee4e7

aocsa marked this pull request as ready for review October 29, 2021 22:47

westonpace requested changes Oct 29, 2021

View reviewed changes

minor fixes

d42080b

felipeblazing suggested changes Nov 9, 2021

View reviewed changes

westonpace closed this Mar 29, 2022


		struct ExecBatch;

		enum class MemoryLevel : int { kGpuLevel, kCpuLevel, kDiskLevel, kNumLevels };

ARROW-14330: [C++] Create DataHolder #11426

ARROW-14330: [C++] Create DataHolder #11426

Uh oh!

Conversation

aocsa commented Oct 15, 2021

Uh oh!

github-actions bot commented Oct 15, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Mar 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants