ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch #9615

romainfrancois · 2021-03-02T16:46:47Z

No description provided.

r/src/r_to_arrow.cpp

github-actions · 2021-03-02T21:05:51Z

https://issues.apache.org/jira/browse/ARROW-3316

romainfrancois · 2021-03-03T14:55:44Z

bool CanExtendParallel(SEXP x, const std::shared_ptr<arrow::DataType>& type) {
  // TODO: identify when it's ok to do things in parallel
  return false;
}

turning this to true entirely makes everything fail dramatically. So we'll need to pay careful attention about what can and cannot be done concurrently. We might have to dial down the use of the cpp11 package because I believe when we create a cpp11 vector to shell a SEXP this uses a central resource for the protection.

It would probably be better to move the "can this be done in parallel" to a virtual method of RConverter, but we would need to move capture converter in the task lambda or at least figure out some way for the converter to still be alive when the task is run.

romainfrancois · 2021-03-19T10:58:51Z

The issue about doing R things in parallel is that you can't really. Maybe we can have an R specific mutex:

std::mutex& get_r_mutex() {
  static std::mutex m ;
  return m;
}

that we can lock when we do need to call something in the R api, including making a cpp11::doubles for example. Then use this in a wrapper class like:

template <class vector>
class synchronized {
public:
  synchronized(SEXP x) {
    std::lock_guard<std::mutex> lock(get_r_mutex());
    data_ = new vector(x);
  }

  vector& data() {
    return *data_;
  }

  ~synchronized() {
    std::lock_guard<std::mutex> lock(get_r_mutex());
    delete data_;
  }

private:
  vector* data_;
};

so that we can have something like this:

// [[arrow::export]]
int parallel_test(int n) {
  auto tasks = arrow::internal::TaskGroup::MakeThreaded(arrow::internal::GetCpuThreadPool());
  SEXP x = PROTECT(Rf_allocVector(REALSXP, 100));

  std::atomic<int> count(0);
  for (int i = 0; i < n; i++) {
    tasks->Append([x, &count] {
      synchronized<cpp11::doubles> dx(x);

      int nx = dx.data().size();
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
      count += nx;

      return arrow::Status::OK();
    });
  }

  auto status = tasks->Finish();
  UNPROTECT(1);
  return count;
}

Of course this only makes sure that the synchronized<cpp11::doubles> is safe on construction and destruction, access to other methods would also need to lock/unlock.

r/src/r_to_arrow.cpp

romainfrancois · 2021-04-28T13:51:25Z

Marking this as ready to review. I've changed the approach this week so that it does not need to resort to locking.

This introduces the RTasks class that factors out handling of tasks that can be run in parallel and tasks that cannot (because they might touch the R central resource, e.g. protect an R object ...). It has void Append(bool parallel, Task&& task) to add a task. Based on parallel the task is either added to the parallel task group, and potentially started immediately, or delayed to run until all the tasks have been added.

Then it has Finish() which 1) runs the tasks that have been delayed, and then waits for the parallel tasks to finish.

With this, the RConverter class gained virtual void DelayedExtend(SEXP values, int64_t size, RTasks& tasks). The idea is that an implementation might first do some setup work that has to happen on the main thread because it uses central R resources, but then the bulk of the work is either run in parallel if possible or delayed.

The RStructConverter implementation is a good example that has to do some work upfront but then can still benefit from parallel ingestion of its columns.

jonkeane · 2021-04-30T21:14:29Z

@github-actions crossbow submit -g r

github-actions · 2021-04-30T21:15:16Z

Revision: 299c34f94c61c7017f4a9e32437ddd0d9bbd50ee

Submitted crossbow builds: ursacomputing/crossbow @ actions-362

Task	Status
conda-linux-gcc-py36-cpu-r36
conda-linux-gcc-py37-cpu-r40
conda-osx-clang-py36-r36
conda-osx-clang-py37-r40
conda-win-vs2017-py36-r36
conda-win-vs2017-py37-r40
homebrew-r-autobrew
test-r-devdocs
test-r-install-local
test-r-linux-as-cran
test-r-minimal-build
test-r-rhub-ubuntu-gcc-release
test-r-rocker-r-base-latest
test-r-rstudio-r-base-3.6-bionic
test-r-rstudio-r-base-3.6-centos7-devtoolset-8
test-r-rstudio-r-base-3.6-centos8
test-r-rstudio-r-base-3.6-opensuse15
test-r-rstudio-r-base-3.6-opensuse42
test-r-version-compatibility
test-r-versions
test-r-without-arrow
test-ubuntu-18.04-r-sanitizer

westonpace

Neal asked me to take a look at some of the parallel stuff since I've been working on some parallel code in the C++ code base as well. I think this is a very clever approach. You basically take a quick scheduling pass through all of the data to spawn as much parallel as you can and then tackle the rest serially.

One thing I would watch out for with DelayedExtend is iterating through the data itself both before and after since you will most likely lose your CPU cache between the iterations and be forced to load the data out of RAM twice. I'm pretty sure you are not doing that here so I don't think it is a problem. Future DelayedExtend implementations will need to keep an eye out though.

I'm going to try and take a bit more of a look tomorrow but here are some initial comments.

r/src/r_to_arrow.cpp

westonpace · 2021-05-04T12:04:33Z

r/src/r_to_arrow.cpp

+    }
+
+    // then wait for the parallel tasks to finish
+    status &= parallel_tasks->Finish();


It would be nice if there were a good way to trigger the parallel_tasks to fail early if status was not ok here (and we broke out of the serial loop above).

I'm not sure how to do this

Scanning through quickly I just thought "it would be nice" but now thinking about implementing it I think "but it sure would be tricky" 😆

I think you would need to use a stop token of some kind. Either arrow::StopToken in arrow/util/cancel.h or an atomic bool of some kind that is set by the serial tasks.

TaskGroup::MakeThreaded can take in a stop token as well. That might be easiest:

Create a StopSource

Get a StopToken from the source

Pass the StopToken into TaskGroup::MakeThreaded

After doing the serial work, if there is an error, call RequestStop on the source.

That would cancel any conversion tasks that are scheduled but not yet executing. Any tasks currently executing would simply have to run until completion. If you really wanted to make it responsive then you could pass the StopToken into your conversion functions and check it periodically to see if a stop was requested and, if so, bail early.

r/src/r_to_arrow.cpp

westonpace · 2021-05-04T12:16:11Z

r/src/r_to_arrow.cpp

+template <typename Iterator, typename AppendNull, typename AppendValue>
+Status VisitVector(Iterator it, int64_t n, AppendNull&& append_null,
+                   AppendValue&& append_value) {
+  for (R_xlen_t i = 0; i < n; i++, ++it) {


Minor: You're pretty close to being able to use a range-based for loop here. I'm not sure how difficult it would be to create an end() pointer and an iterator equality function.

We might revisit when we tackle chunking, iterator equality and end() should be straightforward as the iterator classes we use are just wrappers around either pointers or cpp11 iterators.

The only thing is that when we'll chunk, we might not iterate from start to end.

r/src/r_to_arrow.cpp

jonkeane · 2021-05-06T18:41:23Z

Ok, Finally got these benchmarks re-run and this report put together.

TL;DR:

For multi-core operation:

Dict types are massively faster
Smaller improvements are seen on most other types (for types that we have all-one-type benchmark fixtures for): integers, floats
Strings are either the same as or slightly slower
The naturalistic datasets we have are a mixture:
- nyctaxi is faster (especially on the first iteration)
- fannie + chicago traffic are slightly longer (possibly because of more strings?)

For single-core operation:
Most datasets/types have very similar performance across the branches (dicts are the only ones that stand out as seeing a decent speed up, but nowhere near what we see on the 8-core test)

Here's a zip* of the report
parallel-data-conversion.html.zip

– to get around GH file-extension restrictions

jonkeane · 2021-05-06T21:44:39Z

Ok, I've added in a run from the commit that this branch is based off (16a0739 ) of to be a more close comparison, and things are murkier:

most of the naturalistic datasets are worse
floats, integers, and strings also seem worse to varying degrees
the dict type is still very improved

parallel-data-conversion.html.zip

nealrichardson · 2021-05-06T22:48:00Z

I'm a little skeptical, with the exception of the big change on the data.frames of factor columns, that this isn't just noise. I don't think there's been any other changes in the data.frame to Arrow code between latest master and where this branch is based.

For the sake of argument, let's assume that "a little better or a little worse" is really just no change. I'm more surprised that there seems only to be that one improvement. The fannie mae dataset has 31 columns: with 8 cores, why is essentially the same performance as before/with 1 core?

jonkeane · 2021-05-06T23:25:35Z

Absolutely, aside from factors, all of these differences are compatible with being pure noise / no real change.

If we don't see any speed up with any types other than factors, I'm not totally surprised that the naturalistic data sets aren't seeing an improvement since fannie + nyctaxi when read in as data.frames don't result in any factors. And the chi traffic dataset which starts as a parquet only has two columns which are factors.

jonkeane · 2021-05-06T23:26:45Z

Also, I should have been more careful with my words and that "worse" should have really been "not-convincingly-better"

romainfrancois · 2021-05-07T13:40:32Z

What is the schema of the fanni mae data set ? Does it have some missing values ? Maybe the code goes through this branch:

      if (arrow::r::can_reuse_memory(x, options.type)) {
        columns[j] = std::make_shared<arrow::ChunkedArray>(
            arrow::r::vec_to_arrow__reuse_memory(x));
      }

which for now does not benefit from parallelization, and perhaps should, at least when there are some NA to deal with:

// this is only used on some special cases when the arrow Array can just use the memory of
// the R object, via an RBuffer, hence be zero copy
template <int RTYPE, typename RVector, typename Type>
std::shared_ptr<Array> MakeSimpleArray(SEXP x) {
  using value_type = typename arrow::TypeTraits<Type>::ArrayType::value_type;
  RVector vec(x);
  auto n = vec.size();
  auto p_vec_start = reinterpret_cast<const value_type*>(DATAPTR_RO(vec));
  auto p_vec_end = p_vec_start + n;
  std::vector<std::shared_ptr<Buffer>> buffers{nullptr,
                                               std::make_shared<RBuffer<RVector>>(vec)};

  int null_count = 0;

  auto first_na = std::find_if(p_vec_start, p_vec_end, is_NA<value_type>);
  if (first_na < p_vec_end) {
    auto null_bitmap =
        ValueOrStop(AllocateBuffer(BitUtil::BytesForBits(n), gc_memory_pool()));
    internal::FirstTimeBitmapWriter bitmap_writer(null_bitmap->mutable_data(), 0, n);

    // first loop to clear all the bits before the first NA
    auto j = std::distance(p_vec_start, first_na);
    int i = 0;
    for (; i < j; i++, bitmap_writer.Next()) {
      bitmap_writer.Set();
    }

    auto p_vec = first_na;
    // then finish
    for (; i < n; i++, bitmap_writer.Next(), ++p_vec) {
      if (is_NA<value_type>(*p_vec)) {
        bitmap_writer.Clear();
        null_count++;
      } else {
        bitmap_writer.Set();
      }
    }

    bitmap_writer.Finish();
    buffers[0] = std::move(null_bitmap);
  }

  auto data = ArrayData::Make(std::make_shared<Type>(), LENGTH(x), std::move(buffers),
                              null_count, 0 /*offset*/);

  // return the right Array class
  return std::make_shared<typename TypeTraits<Type>::ArrayType>(data);
}

The find_if() and the content of the if (first_na < p_vec_end) { branch is where this does some work, but all things are in place so that we could benefit from parallelization.

Looking at this in the next few days.

jonkeane · 2021-05-07T14:15:28Z

Oh, knowing about missing values is helpful, lemme dig more into that and see if I can replicate performance differences on those.

Here's summary() of the fanniemae dataset, and that are a decent chunk of NAs in it (and has types too):

> summary(df_fannie)
       f0                    f1                 f2                  f3       
 Min.   :100001420754   Length:22180168    Length:22180168    Min.   :1.750  
 1st Qu.:326084086722   Class :character   Class :character   1st Qu.:3.375  
 Median :550659611473   Mode  :character   Mode  :character   Median :3.500  
 Mean   :550440259075                                         Mean   :3.561  
 3rd Qu.:775451076920                                         3rd Qu.:3.750  
 Max.   :999999800242                                         Max.   :5.900  
                                                                             
       f4                f5              f6              f7       
 Min.   :      0   Min.   :-1.00   Min.   : 27.0   Min.   :  0.0  
 1st Qu.: 136660   1st Qu.: 8.00   1st Qu.:220.0   1st Qu.:214.0  
 Median : 206558   Median :16.00   Median :336.0   Median :333.0  
 Mean   : 226014   Mean   :16.65   Mean   :292.9   Mean   :287.5  
 3rd Qu.: 299858   3rd Qu.:25.00   3rd Qu.:348.0   3rd Qu.:347.0  
 Max.   :1203000   Max.   :58.00   Max.   :482.0   Max.   :480.0  
 NA's   :4058158                                   NA's   :28840  
      f8                  f9            f10                f11           
 Length:22180168    Min.   :    0   Length:22180168    Length:22180168   
 Class :character   1st Qu.:17460   Class :character   Class :character  
 Mode  :character   Median :31080   Mode  :character   Mode  :character  
                    Mean   :28225                                        
                    3rd Qu.:39580                                        
                    Max.   :49740                                        
                                                                         
      f12               f13                f14                f15           
 Min.   : 1         Length:22180168    Length:22180168    Length:22180168   
 1st Qu.: 1         Class :character   Class :character   Class :character  
 Median : 1         Mode  :character   Mode  :character   Mode  :character  
 Mean   : 1                                                                 
 3rd Qu.: 1                                                                 
 Max.   :16                                                                 
 NA's   :22061889                                                           
     f16                 f17                f18                f19          
 Length:22180168    Min.   :    3      Min.   :  187      Min.   :  65      
 Class :character   1st Qu.: 2945      1st Qu.:  730      1st Qu.:1319      
 Mode  :character   Median : 4658      Median : 3679      Median :2500      
                    Mean   : 5143      Mean   : 6808      Mean   :2561      
                    3rd Qu.: 7026      3rd Qu.: 7542      3rd Qu.:2605      
                    Max.   :23055      Max.   :55625      Max.   :9900      
                    NA's   :22180014   NA's   :22180087   NA's   :22180124  
      f20                f21                f22                f23          
 Min.   :-3561      Min.   :   87      Min.   :  4911     Min.   :   284    
 1st Qu.:  -34      1st Qu.: 1089      1st Qu.: 73622     1st Qu.: 12374    
 Median :  869      Median : 2214      Median :127763     Median : 20665    
 Mean   : 1345      Mean   : 3399      Mean   :147056     Mean   : 36262    
 3rd Qu.: 1877      3rd Qu.: 3980      3rd Qu.:198586     3rd Qu.: 40433    
 Max.   :36497      Max.   :24840      Max.   :465825     Max.   :539401    
 NA's   :22180041   NA's   :22180053   NA's   :22180023   NA's   :22180081  
      f24                f25                f26             f27          
 Min.   :126773     Min.   :     0     Min.   :     0     Mode:logical   
 1st Qu.:126773     1st Qu.:   110     1st Qu.:     0     NA's:22180168  
 Median :126773     Median :   500     Median :     0                    
 Mean   :126773     Mean   : 14636     Mean   :  2807                    
 3rd Qu.:126773     3rd Qu.:  2846     3rd Qu.:     0                    
 Max.   :126773     Max.   :328871     Max.   :129946                    
 NA's   :22180167   NA's   :22180095   NA's   :22151328                  
     f28              f29               f30           
 Length:22180168    Mode:logical    Length:22180168   
 Class :character   NA's:22180168   Class :character  
 Mode  :character                   Mode  :character

I also have been digging into differences across types. Factors seem to parallelize really well, so I tried to convert the chitraffic data frame which is a mic of strings + numerics + 2 factor columns, and when I do that (with 12 cpu cores available) the most I’m seeing the CPU get to is ~140% and even that is only briefly, most of the time the process is at 100%

> system.time(tab_chi_traffic <- arrow::Table$create(df_chi_traffic))
   user  system elapsed 
 29.093   0.797  28.002

I then created a silly version of this dataset where I converted each of the columns into a factor (totally naively with as.factor()), and converting that is about half the time + the cpu usage peaks at ~300% though it drops down to 100% and then bumps back up a few times

> system.time(tab_chi_traffic <- arrow::Table$create(df_chi_traffic_factors))
   user  system elapsed 
 31.073   1.194  15.857

romainfrancois · 2021-05-07T16:10:44Z

Thanks. The special case for arrow::r::can_reuse_memory(x, options.type) predates our doing anything in parallel. I guess I'll fold that in one of the existing RConverter implementation, or create a new special one. One way or another, this should definitely leverage parallelism. Making this my Monday task :-)

jonkeane · 2021-05-07T21:33:14Z

Here's another example of trying a data.frame of strings and not seeing parallelization, but converting those strings to factors and boom we get parallelization:

> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

> 
> # this sample is located at https://ursa-qa.s3.amazonaws.com/single_types/type_strings.parquet
> # it is 1M rows, 5 columns. The first column has no missing, the second has 10% missing, 
> # the third 25% missing, the fourth 50% missing, and the 5th 90% missing.
> strings_df <- read_parquet("~/repos/ab_store/data/type_strings.parquet")
> 
> # embiggen so that the transform differences are easier to see (and so we have more columns than cores)
> strings_df <- dplyr::bind_cols(strings_df, strings_df, strings_df)
New names:
* jane -> jane...1
* austen -> austen...2
* sense -> sense...3
* and -> and...4
* sensibility -> sensibility...5
* ...
> strings_df <- dplyr::bind_rows(strings_df, strings_df, strings_df, strings_df, strings_df)
> 
> summary(strings_df)
   jane...1          austen...2         sense...3           and...4         
 Length:5000000     Length:5000000     Length:5000000     Length:5000000    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 sensibility...5      jane...6          austen...7         sense...8        
 Length:5000000     Length:5000000     Length:5000000     Length:5000000    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
   and...9          sensibility...10    jane...11         austen...12       
 Length:5000000     Length:5000000     Length:5000000     Length:5000000    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
  sense...13          and...14         sensibility...15  
 Length:5000000     Length:5000000     Length:5000000    
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
> 
> # when this runs, my cpu usage is always at 100% or slightly below (and note 
> # that user <= elapsed below)
> system.time(tab <- Table$create(strings_df))
   user  system elapsed 
 31.855   0.842  32.806 
> 
> 
> # naively turn the strings into factors:
> strings_as_factors_df <- dplyr::mutate(strings_df, dplyr::across(.fns = as.factor))
> 
> summary(strings_as_factors_df)
    jane...1         austen...2        sense...3          and...4       
 Elino  :   1210   Elinor :   1140          :    915   Elino  :    625  
 Elinor :   1120   Elino  :   1125   Elinor :    895   Elinor :    625  
        :   1115          :    895   Elino  :    890          :    545  
 Mariann:    890   Maria  :    745   Marian :    705   Marian :    480  
 Elinor :    880   Elinor :    725   Elinor :    670   Elinor :    455  
 Marian :    865   (Other):4491725   (Other):3746365   (Other):2497925  
 (Other):4993920   NA's   : 503645   NA's   :1249560   NA's   :2499345  
 sensibility...5      jane...6         austen...7        sense...8      
 Elino  :     65   Elino  :   1210   Elinor :   1140          :    915  
 Marian :     65   Elinor :   1120   Elino  :   1125   Elinor :    895  
 Maria  :     60          :   1115          :    895   Elino  :    890  
 Elinor :     55   Mariann:    890   Maria  :    745   Marian :    705  
 Elinor :     55   Elinor :    880   Elinor :    725   Elinor :    670  
 (Other): 249925   Marian :    865   (Other):4491725   (Other):3746365  
 NA's   :4749775   (Other):4993920   NA's   : 503645   NA's   :1249560  
    and...9        sensibility...10    jane...11        austen...12     
 Elino  :    625   Elino  :     65   Elino  :   1210   Elinor :   1140  
 Elinor :    625   Marian :     65   Elinor :   1120   Elino  :   1125  
        :    545   Maria  :     60          :   1115          :    895  
 Marian :    480   Elinor :     55   Mariann:    890   Maria  :    745  
 Elinor :    455   Elinor :     55   Elinor :    880   Elinor :    725  
 (Other):2497925   (Other): 249925   Marian :    865   (Other):4491725  
 NA's   :2499345   NA's   :4749775   (Other):4993920   NA's   : 503645  
   sense...13         and...14       sensibility...15 
        :    915   Elino  :    625   Elino  :     65  
 Elinor :    895   Elinor :    625   Marian :     65  
 Elino  :    890          :    545   Maria  :     60  
 Marian :    705   Marian :    480   Elinor :     55  
 Elinor :    670   Elinor :    455   Elinor :     55  
 (Other):3746365   (Other):2497925   (Other): 249925  
 NA's   :1249560   NA's   :2499345   NA's   :4749775  
> 
> 
> # when this runs, my cpu usage goes up to 400% (and user >> elapsed below)
> system.time(tab <- Table$create(strings_as_factors_df))
   user  system elapsed 
 31.166   0.794   6.184

romainfrancois · 2021-05-12T14:55:44Z

@jonkeane I believe the last commit will improve things. The zero copy cases are now handled in parallel, as it appears these cases might actually represent some work when dealing with missing values.

jonkeane · 2021-05-12T21:21:49Z

Yes! I reran the benchmarks again (comparing the last commit here with the base commit and apache/arrow@HEAD for today and I see some drastic improvements for floats and ints (and the improvements we saw before with dict are still there of course).

The naturalistic datasets aren't seeing much (if any) speed up — they are all within the noise range for variability that we see here. I'm going to dig into those separately and see if I see any funny patterns there that might explain it.

parallel-data-conversion.html.zip

nealrichardson · 2021-05-12T22:55:01Z

I wonder if the logic for "parallelize what you can, then do the rest in serial" isn't working right. Maybe the natural datasets all have at least one column (string, most likely) that can't be parallel, and instead of parallelizing the integer/double/factor columns and then handling the strings, it just keeps them all serial.

jonkeane · 2021-05-12T22:57:10Z

Yeah, I'm going to try testing that exactly and see if I can duplicate this behavior (probably tomorrow)

romainfrancois · 2021-05-13T07:50:57Z

Some drastic improvements for floats and ints is what I was after with the last few commits. That's a win.

Looking into strings now, hoping to be able to leverage parallelism there too, it's currently not the case:

  void DelayedExtend(SEXP values, int64_t size, RTasks& tasks) override {
    auto task = [this, values, size]() { return this->Extend(values, size); };
    // TODO: refine this., e.g. extract setup from Extend()
    tasks.Append(false, std::move(task));
  }

…tly because null handling).

…ation so that it may be done in parallel.

romainfrancois · 2021-05-20T15:09:20Z

After all the DelayedExtend() are finished, i.e. after the status &= tasks.Finish(); thing, there is still work to do to actually create the arrays:

for (int j = 0; j < num_fields; j++) {
    auto& converter = converters[j];
    if (converter != nullptr) {
      auto maybe_array = converter->ToArray();
      StopIfNotOk(maybe_array.status());
      columns[j] = std::make_shared<arrow::ChunkedArray>(maybe_array.ValueUnsafe());
    }
  }

I don't think there's any R involved there, so I suppose this could be done in parallel, with some care about the StopIfNotOk() ...

romainfrancois · 2021-05-21T08:43:47Z

Done. This probably won't have much impact, because I guess by the time the converter does ->ToArray() there isn't much more work to be done because this is just a call to the builder's finish method:

virtual Result<std::shared_ptr<Array>> ToArray() { return builder_->Finish(); }

Table__from_dots() will need to be re-adapted again when we do chuncking. I think that each chunk will need its own Converter.

romainfrancois · 2021-05-21T10:11:01Z

@westonpace can you have a look at the updated RTasks ? I have tried to allow tasks (either that are run in parallel or delayed) to request early stop.

nealrichardson · 2021-05-25T22:01:07Z

@romainfrancois @westonpace @jonkeane Is this ready to merge? (The rtools35 error is spurious)

westonpace

Really sorry for the delay, totally my mistake (I saw the ping, made a mental note, and then let the mental note get pushed out of my brain).

What you have should work fine. I think you could simplify it but if you wanted to do that in a follow-up that should be fine.

westonpace · 2021-06-02T17:51:55Z

r/src/r_to_arrow.cpp

+
+    // run the delayed tasks now
+    for (auto& task : delayed_serial_tasks_) {
+      status &= std::move(task)();


Rather than wrapping all of your tasks in StoppingTask you really only need the StopSource as a way to send a signal to parallel_tasks_. Everywhere else you could handle stopping logic on your own. So I think you could change this loop to...

for (auto& task : delayed_serial_tasks_) { status &= std::move(task)(); if (!status.ok()) { stop_source_.RequestStop(); break; } }

...then you can get rid of StoppingTask. If an error happens in a parallel task the ThreadedTaskGroup will already take care of stopping everything.

nealrichardson · 2021-06-02T18:44:55Z

Thanks @westonpace, I'll merge now and make a followup (edit: I made ARROW-12939)

romainfrancois added the Component: R label Mar 2, 2021

bkietz reviewed Mar 2, 2021

View reviewed changes

r/src/r_to_arrow.cpp Outdated Show resolved Hide resolved

ianmcook mentioned this pull request Apr 1, 2021

ARROW-12155: [R] Require Table columns to be same length #9851

Closed

romainfrancois force-pushed the RConverter_Parallel branch from 06a4d75 to 28baff4 Compare April 7, 2021 12:41

ianmcook reviewed Apr 22, 2021

View reviewed changes

r/src/r_to_arrow.cpp Outdated Show resolved Hide resolved

romainfrancois force-pushed the RConverter_Parallel branch from 28baff4 to 6e41082 Compare April 23, 2021 09:24

romainfrancois marked this pull request as ready for review April 28, 2021 13:38

westonpace reviewed May 4, 2021

View reviewed changes

romainfrancois force-pushed the RConverter_Parallel branch from 299c34f to feaf577 Compare May 7, 2021 09:21

This comment has been minimized.

Sign in to view

romainfrancois added 9 commits May 20, 2021 16:35

using internal::FnOnce<> instead of std::function<>

c9c3afb

comment on RTasks.Finish()

026241e

trickle down options.use_threads()

9fb41bd

for now comment part of the code using zero copy (but potentially cos…

ff94919

…tly because null handling).

handle zero copy in parallel in Table__from_dots()

9d0f7be

factorr out UnsafeAppendUtf8Strings in the string converter implement…

939e026

…ation so that it may be done in parallel.

handle strings in parallel

592a103

strings in parallel (for real this time, hopefully)

64fc8aa

finish rebase

af3d42b

romainfrancois force-pushed the RConverter_Parallel branch from 599efdc to af3d42b Compare May 20, 2021 14:45

romainfrancois added 3 commits May 21, 2021 10:07

do ->ToArray() calls in parallel

84e74d8

missing parameter

36bb9d7

prefer ARROW_ASSIGN_OR_RAISE

eea4a5f

stopping early

388f2c2

romainfrancois added 2 commits May 25, 2021 10:22

RTasks.Reset()

96e9759

raw vectors are inferred to uint8, not int8

e914b74

westonpace approved these changes Jun 2, 2021

View reviewed changes

nealrichardson closed this in 5e86300 Jun 2, 2021

romainfrancois added a commit to romainfrancois/arrow that referenced this pull request Jun 3, 2021

simplify RTasks as hinted by @westonpace in apache#9615 (comment)

42e0c0d

romainfrancois added a commit to romainfrancois/arrow that referenced this pull request Jun 24, 2021

simplify RTasks as hinted by @westonpace in apache#9615 (comment)

3d24547

romainfrancois added a commit to romainfrancois/arrow that referenced this pull request Jul 1, 2021

simplify RTasks as hinted by @westonpace in apache#9615 (comment)

76dc1eb

This was referenced Jun 2, 2021

[R] Multi-threaded conversion from R data.frame to Arrow table / record batch #15808

Closed

[R] Simplify RTask stop handling #28664

Closed

ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch #9615

ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch #9615

Uh oh!

Conversation

romainfrancois commented Mar 2, 2021

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2021

Uh oh!

romainfrancois commented Mar 3, 2021

Uh oh!

romainfrancois commented Mar 19, 2021

Uh oh!

Uh oh!

romainfrancois commented Apr 28, 2021

Uh oh!

jonkeane commented Apr 30, 2021

Uh oh!

github-actions bot commented Apr 30, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace May 4, 2021

Choose a reason for hiding this comment

Uh oh!

romainfrancois May 7, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace May 7, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace May 4, 2021

Choose a reason for hiding this comment

Uh oh!

romainfrancois May 7, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonkeane commented May 6, 2021

Uh oh!

jonkeane commented May 6, 2021

Uh oh!

nealrichardson commented May 6, 2021

Uh oh!

jonkeane commented May 6, 2021

Uh oh!

jonkeane commented May 6, 2021

Uh oh!

romainfrancois commented May 7, 2021

Uh oh!

jonkeane commented May 7, 2021

Uh oh!

romainfrancois commented May 7, 2021

Uh oh!

This comment has been minimized.

jonkeane commented May 7, 2021

Uh oh!

romainfrancois commented May 12, 2021

Uh oh!

jonkeane commented May 12, 2021

Uh oh!

nealrichardson commented May 12, 2021

Uh oh!

jonkeane commented May 12, 2021

Uh oh!

romainfrancois commented May 13, 2021

Uh oh!

romainfrancois commented May 20, 2021

Uh oh!

romainfrancois commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romainfrancois commented May 21, 2021

Uh oh!

nealrichardson commented May 25, 2021

Uh oh!

westonpace left a comment

romainfrancois commented May 21, 2021 •

edited

Loading

nealrichardson commented Jun 2, 2021 •

edited

Loading