Skip to content

ARROW-9001: [R] Box outputs as correct type in call_function#8256

Closed
romainfrancois wants to merge 43 commits intoapache:masterfrom
romainfrancois:ARROW-9001/unbox
Closed

ARROW-9001: [R] Box outputs as correct type in call_function#8256
romainfrancois wants to merge 43 commits intoapache:masterfrom
romainfrancois:ARROW-9001/unbox

Conversation

@romainfrancois
Copy link
Contributor

call_function() internally unbox to the right R6 class.

This probably needs some more work, e.g. not sure how to deal with this function:

.guess_result_class <- function(arg) {
  # HACK HACK HACK delete this when call_function returns an ArrowObject itself
  if (inherits(arg, "ArrowObject")) {
    return(class(arg)[1])
  } else if (inherits(arg, "array_expression")) {
    return(arg$result_class)
  } else {
    stop("Not implemented")
  }
}

@github-actions
Copy link

@romainfrancois
Copy link
Contributor Author

Turns out we just don't need to keep track of the result_type anymore I think.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this and I want this everywhere :) (#8246 (comment))

Just one note, I think there's more code that can be deleted. You're right, with this change you don't need to track result_type, that was a total workaround.

r/R/compute.R Outdated
Comment on lines 134 to 139
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be match_arrow.ChunkedArray <- match_arrow.Array now since they're the same. There are probably a few others like this (I see Ops also, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go further when tackling this: https://issues.apache.org/jira/browse/ARROW-10089

@romainfrancois
Copy link
Contributor Author

I love this and I want this everywhere :)

I sort of assumed you would go there, as I was working on #8246 too. So in short we would create the R6 object in C++ and don't have to manipulate the external pointers in R ? I can see the appeal, but I need to work out some logistics.

We would retain the same signatures ? e.g.

// [[arrow::export]]
std::shared_ptr<arrow::DataType> Int8__initialize() { return arrow::int8(); }

presently before this goes to R, this is wrapped in the external pointer, so we have a cpp11::external_pointer<std::shared_ptr<arrow::DataType>>, then in the R side it gets wrapped into a (subclass of) DataType, but I suppose that we could have that the std::shared_ptr<arrow::DataType> is directly wrapped to the correct R6 instance or NULL, this is essentially moving the weird ..dispatch method down C++ layer.

@nealrichardson
Copy link
Member

So in short we would create the R6 object in C++ and don't have to manipulate the external pointers in R ?

Ideally, right?

We would retain the same signatures ?

Right, I think the most developer-friendly experience would be to pack the externalptr -> R6 logic into the as_sexp method. And yeah ..dispatch would probably go away entirely because you'd directly instantiate the correct subclass; that dispatch is because the R side didn't know exactly which subclass it was getting returned.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits

Comment on lines 323 to 316
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something we'd like to use cpp11::function for, but among other things that always calls the function in R_GlobalEnv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is more of a hypothetical cpp11::call thing, e.g.

cpp11::call call(cpp11::call(R_DollarSymbol, symbol, fun_symbol), xp);
cpp11::sexp result = call.eval(arrow::r::ns::arrow);

Then I guess cpp11::function could factor out some of its logic in cpp11::call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someday this should probably be

SEXP as_sexp(arrow::Datum datum)

so that we don't need to wrap datums in from_datum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that still do the dispatch internally before reaching the R side or would there be an R6 class for arrow::Datum. Probably the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think an R6 class for Datum has value, since SEXP already fills the role of a discriminated union of the members of Datum.

With the addition of R6 I amend my earlier comment: from_datum probably be replaced by R6::R6(Datum)

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Sep 25, 2020

library(tidyverse)

brio::read_lines("~/git/apache/arrow/r/src/arrowExports.cpp") %>% 
  str_subset("^std::[a-z]+_ptr<") %>% 
  str_remove(" .*$") %>% 
  unique() 
#>  [1] "std::shared_ptr<arrow::Array>"                     
#>  [2] "std::shared_ptr<arrow::DataType>"                  
#>  [3] "std::shared_ptr<arrow::ArrayData>"                 
#>  [4] "std::shared_ptr<arrow::ChunkedArray>"              
#>  [5] "std::shared_ptr<arrow::Buffer>"                    
#>  [6] "std::shared_ptr<arrow::util::Codec>"               
#>  [7] "std::shared_ptr<arrow::io::CompressedOutputStream>"
#>  [8] "std::shared_ptr<arrow::io::CompressedInputStream>" 
#>  [9] "std::shared_ptr<arrow::compute::CastOptions>"      
#> [10] "std::shared_ptr<arrow::RecordBatch>"               
#> [11] "std::shared_ptr<arrow::Table>"                     
#> [12] "std::shared_ptr<arrow::csv::ReadOptions>"          
#> [13] "std::shared_ptr<arrow::csv::ParseOptions>"         
#> [14] "std::shared_ptr<arrow::csv::ConvertOptions>"       
#> [15] "std::shared_ptr<arrow::csv::TableReader>"          
#> [16] "std::shared_ptr<ds::ScannerBuilder>"               
#> [17] "std::shared_ptr<arrow::Schema>"                    
#> [18] "std::shared_ptr<ds::Dataset>"                      
#> [19] "std::shared_ptr<ds::UnionDataset>"                 
#> [20] "std::shared_ptr<ds::InMemoryDataset>"              
#> [21] "std::shared_ptr<ds::FileFormat>"                   
#> [22] "std::shared_ptr<fs::FileSystem>"                   
#> [23] "std::shared_ptr<ds::DatasetFactory>"               
#> [24] "std::shared_ptr<ds::ParquetFileFormat>"            
#> [25] "std::shared_ptr<ds::IpcFileFormat>"                
#> [26] "std::shared_ptr<ds::CsvFileFormat>"                
#> [27] "std::shared_ptr<ds::Partitioning>"                 
#> [28] "std::shared_ptr<ds::PartitioningFactory>"          
#> [29] "std::shared_ptr<ds::Scanner>"                      
#> [30] "std::shared_ptr<arrow::Field>"                     
#> [31] "std::shared_ptr<ds::Expression>"                   
#> [32] "std::shared_ptr<arrow::ipc::feather::Reader>"      
#> [33] "std::shared_ptr<fs::FileSelector>"                 
#> [34] "std::shared_ptr<arrow::io::InputStream>"           
#> [35] "std::shared_ptr<arrow::io::RandomAccessFile>"      
#> [36] "std::shared_ptr<arrow::io::OutputStream>"          
#> [37] "std::shared_ptr<fs::LocalFileSystem>"              
#> [38] "std::shared_ptr<fs::SubTreeFileSystem>"            
#> [39] "std::shared_ptr<fs::S3FileSystem>"                 
#> [40] "std::shared_ptr<arrow::io::MemoryMappedFile>"      
#> [41] "std::shared_ptr<arrow::io::ReadableFile>"          
#> [42] "std::shared_ptr<arrow::io::BufferReader>"          
#> [43] "std::shared_ptr<arrow::io::FileOutputStream>"      
#> [44] "std::shared_ptr<arrow::io::BufferOutputStream>"    
#> [45] "std::shared_ptr<arrow::json::ReadOptions>"         
#> [46] "std::shared_ptr<arrow::json::ParseOptions>"        
#> [47] "std::shared_ptr<arrow::json::TableReader>"         
#> [48] "std::shared_ptr<arrow::MemoryPool>"                
#> [49] "std::shared_ptr<arrow::ipc::MessageReader>"        
#> [50] "std::shared_ptr<arrow::ipc::Message>"              
#> [51] "std::shared_ptr<parquet::ArrowReaderProperties>"   
#> [52] "std::shared_ptr<parquet::arrow::FileReader>"       
#> [53] "std::shared_ptr<parquet::ArrowWriterProperties>"   
#> [54] "std::shared_ptr<parquet::WriterPropertiesBuilder>" 
#> [55] "std::shared_ptr<parquet::WriterProperties>"        
#> [56] "std::shared_ptr<parquet::arrow::FileWriter>"       
#> [57] "std::shared_ptr<arrow::RecordBatchReader>"         
#> [58] "std::shared_ptr<arrow::ipc::RecordBatchFileReader>"
#> [59] "std::shared_ptr<arrow::ipc::RecordBatchWriter>"    
#> [60] "std::shared_ptr<arrow::Scalar>"

Created on 2020-09-25 by the reprex package (v0.3.0.9001)

@romainfrancois romainfrancois force-pushed the ARROW-9001/unbox branch 2 times, most recently from c79c48f to cb60e7b Compare September 25, 2020 15:04
@romainfrancois
Copy link
Contributor Author

started to get away from the R function shared_ptr(). Not handling all cases yet, eventually this:

template <typename T>
SEXP as_sexp(const std::shared_ptr<T>& ptr) {
  return cpp11::external_pointer<std::shared_ptr<T>>(new std::shared_ptr<T>(ptr));
}

will disappear and we'll have individual functions for each as generated by some macro.

I'm not clear yet about the classes that are currently handled with $..dispatch() methods in R.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this (big) header here now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because of these classes that were not available in the fw header:

R6_HANDLE(arrow::dataset::DirectoryPartitioning, "DirectoryPartitioning")
R6_HANDLE(arrow::dataset::HivePartitioning, "HivePartitioning")

forward declaring them seems to do the trick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 We should probably add the forward declarations to arrow/dataset/type_fwd.h

@nealrichardson
Copy link
Member

I'm not clear yet about the classes that are currently handled with $..dispatch() methods in R.

I can think of a few possibilities (as I'm sure you can):

  1. List out all of the base classes that have the dispatch methods explicitly, just as you are with the class to R6 mapping, and do it directly (there are only 6 ..dispatch methods defined)
  2. Detect whether the R6 object created has a ..dispatch method and if so, call it (though we'd need to be careful that this doesn't trigger recursively)
  3. Presumably a function that has a return signature e.g. std::shared_ptr<arrow::Array> doesn't actually reveal that the returned object is std::shared_ptr<arrow::StructArray> or whatever subclass, but if it did, then ..dispatch goes away entirely and every class gets enumerated explicitly.

@romainfrancois
Copy link
Contributor Author

I'd like to make the ..dispatch() methods go away. So I guess the question is when a C++ function returns an Array that in fact is a StructArray, should the R object be Array or StructArray.

Now, with ..dispatch() we do resolve them so that we get a StructArray

@romainfrancois
Copy link
Contributor Author

Finished rebasing this, now seeing, not sure what this is about:

> test_check("arrow")
3491
── 1. Error: write_parquet with filesystem arg (@test-s3-minio.R#102)  ─────────
3492
NotImplemented: It is not possible to append efficiently to S3 objects
3493
Backtrace:
3494
 1. testthat::expect_length(fs$ls(minio_path("hive_dir")), 2)
3495
 4. fs$ls(minio_path("hive_dir"))
3496
 5. arrow:::fs___FileSystem__OpenAppendStream(self, clean_path_rel(path))
3497

3498
── 2. Error: open_dataset with fs (@test-s3-minio.R#113)  ──────────────────────
3499
Can't subset columns that don't exist.
3500
✖ Column `dbl` doesn't exist.
3501
Backtrace:
3502
  1. testthat::expect_identical(...)
3503
  9. dplyr::select(., dbl, lgl)
3504
 13. arrow:::column_select(arrow_dplyr_query(.data), !!!enquos(...))
3505
 14. tidyselect:::.FUN(names(.data), !!!enquos(...))
3506
 15. tidyselect:::eval_select_impl(...)
3507
 23. tidyselect:::vars_select_eval(...)
3508
 24. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
3509
 25. tidyselect:::eval_c(expr, data_mask, context_mask)
3510
 26. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
3511
 27. tidyselect:::walk_data_tree(new, data_mask, context_mask)
3512
 28. tidyselect:::as_indices_sel_impl(...)
3513
 29. tidyselect:::as_indices_impl(x, vars, strict = strict)
3514
 30. tidyselect:::chr_as_locations(x, vars)
3515
 31. vctrs::vec_as_location(x, n = length(vars), names = vars)
3516
 33. vctrs:::stop_subscript_oob(...)
3517
 34. vctrs:::stop_subscript(...)
3518

3519
── 3. Error: write_dataset with fs (@test-s3-minio.R#122)  ─────────────────────
3520
IOPath does not exist '1603877944.93906/new_dataset_dir'
3521
Backtrace:
3522
 1. testthat::expect_length(...)
3523
 4. fs$ls(minio_path("new_dataset_dir"))
3524
 5. self$GetFileInfo(selector)
3525
 6. arrow:::fs___FileSystem__GetTargetInfos_FileSelector(self, x)
3526

3527
── 4. Error: Let's test copy_files too (@test-s3-minio.R#136)  ─────────────────
3528
Can't subset columns that don't exist.
3529
✖ Column `dbl` doesn't exist.
3530
Backtrace:
3531
  1. testthat::expect_identical(...)
3532
  9. dplyr::select(., dbl, lgl)
3533
 13. arrow:::column_select(arrow_dplyr_query(.data), !!!enquos(...))
3534
 14. tidyselect:::.FUN(names(.data), !!!enquos(...))
3535
 15. tidyselect:::eval_select_impl(...)
3536
 23. tidyselect:::vars_select_eval(...)
3537
 24. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
3538
 25. tidyselect:::eval_c(expr, data_mask, context_mask)
3539
 26. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
3540
 27. tidyselect:::walk_data_tree(new, data_mask, context_mask)
3541
 28. tidyselect:::as_indices_sel_impl(...)
3542
 29. tidyselect:::as_indices_impl(x, vars, strict = strict)
3543
 30. tidyselect:::chr_as_locations(x, vars)
3544
 31. vctrs::vec_as_location(x, n = length(vars), names = vars)
3545
 33. vctrs:::stop_subscript_oob(...)
3546
 34. vctrs:::stop_subscript(...)

@nealrichardson
Copy link
Member

It's a test file that I added after you started this PR, and it's not run on every CI job, and you aren't running it locally unless you set up minio. See https://github.com/apache/arrow/blob/master/r/README.md#running-tests. On macOS you can brew install minio

@nealrichardson
Copy link
Member

If that one fix isn't enough, 883eb57 is probably where the relevant code came in, so see if you see anything there that needs updating.

@romainfrancois
Copy link
Contributor Author

This last commit is related to https://issues.apache.org/jira/browse/ARROW-10080 and gives a way to release memory immediately for RecordBatch and Table. If we want more classes (maybe Array etc ... ?) we can add them to generated code in codegen.R.

$invalidate() can be used on any arrow object but it only release the memory (i.e. call the destructor) if the method is generated:

invalidate = function() {
      cl <- class(self)[1L]
      # if there is a Reset function for that class, call it
      reset <- get(paste0("_arrow_", cl, "__Reset"), ns_arrow)
      if (!is.null(reset)) {
        get(".Call")(reset, self)
      }
      # but in any case, set the external pointer to NULL
      assign(".:xp:.", NULL, envir = self)
    }

otherwise it just sets the external pointer to NULL which is less useful because the memory will be reclaimed only later as part of a garbage collection.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I noticed, mostly rebase cleanup.

It kinda feels like we've just moved the shared_ptr() business down a level since we still have to explicitly wrap and declare types for the output, though that is an improvement. I had envisioned something, either as as_sexp methods or as some custom wrapping in codegen.R, that would handle the mapping from the return signature (e.g. std::shared_ptr<arrow::Table>) to R6 class name (Table). Maybe that was wishful thinking.

@romainfrancois
Copy link
Contributor Author

A few things I noticed, mostly rebase cleanup.

It kinda feels like we've just moved the shared_ptr() business down a level since we still have to explicitly wrap and declare types for the output, though that is an improvement. I had envisioned something, either as as_sexp methods or as some custom wrapping in codegen.R, that would handle the mapping from the return signature (e.g. std::shared_ptrarrow::Table) to R6 class name (Table). Maybe that was wishful thinking.

Let me try to push this further.

Now all the functions return an R6, which for now is just an alias to SEXP but may become some other thing, and it's our job to make the right R6 based on the shared pointer we mean to wrap.

This is indeed just moving the needle and pain down on level.

What I can try to do is that R6 be a class that:

  • can be converted to SEXP, i.e. it has a operator SEXP()
  • can be constructed from the various types, e.g. we'd have many constructors like R6(const std::shared_ptr<T>&) for all the T we handle, and some of these (e.g. R6(const std::shared_ptr<arrow::Array>&) do some dispatch (what r6_Array() does now).

@romainfrancois
Copy link
Contributor Author

Last commit goes in that direction with:

namespace cpp11 {

template <typename T>
std::string r6_class_name(const std::shared_ptr<T>& x) ;

template <typename T>
SEXP to_r6(const std::shared_ptr<T>& x) {
  if (x == nullptr) return R_NilValue;

  auto r_class_name = cpp11::r6_class_name<T>(x);
  cpp11::external_pointer<std::shared_ptr<T>> xp(new std::shared_ptr<T>(x));
  SEXP r6_class = Rf_install(r_class_name.c_str());

  // make call:  <symbol>$new(<x>)
  SEXP call = PROTECT(Rf_lang3(R_DollarSymbol, r6_class, arrow::r::symbols::new_));
  SEXP call2 = PROTECT(Rf_lang2(call, xp));

  // and then eval in arrow::
  SEXP r6 = PROTECT(Rf_eval(call2, arrow::r::ns::arrow));

  UNPROTECT(3);
  return r6;
}
}

class R6 {
public:

  template <typename T>
  R6(const std::shared_ptr<T>& x) : data_(cpp11::to_r6<T>(x)){}

  template <typename T>
  R6(std::unique_ptr<T> x) : data_(cpp11::to_r6<T>(std::shared_ptr<T>(x.release()))){}

  R6(SEXP data) : data_(data){}

  operator SEXP() const {
    return data_;
  }

private:
  SEXP data_;
};

so that the functions "just" return R6 and the the R6 constructor finds the right R6 class name by calling one of the r6_class_name methods.

This however means defining many of these, most of them being trivial:

template <> inline std::string r6_class_name<arrow::Field>(const std::shared_ptr<arrow::Field>& array_data) {
  return "Field";
}

but some of them with some notion of dispatch:

  auto type_name = dataset->type_name();

  if (type_name == "union") {
    return "UnionDataset";
  } else if (type_name == "filesystem") {
    return "FileSystemDataset";
  } else if(type_name == "in-memory"){
    return "InMemoryDataset";
  } else {
    return "Dataset";
  }
}

@bkietz
Copy link
Member

bkietz commented Nov 11, 2020

@nealrichardson @romainfrancois I've rebased and

  • moved forward declarations to c++ headers, so it's no longer necessary to forward declare things in arrow_exports.h
  • #include all type_fwd.h headers in arrow_types.h so all forward declarations are visible to r/src/*
  • moved specializations of r6_class_name to arrow_types.h so they'll also be visible to r/src/*
  • deleted arrow_exports.h; arrowExports.cpp now includes arrow_types.h

@bkietz
Copy link
Member

bkietz commented Nov 11, 2020

@nealrichardson
Copy link
Member

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: 2c3dd42

Submitted crossbow builds: ursa-labs/crossbow @ actions-701

Task Status
homebrew-r-autobrew TravisCI
test-r-linux-as-cran Github Actions
test-r-rhub-ubuntu-gcc-release Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos6 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-ubuntu-18.04-r-sanitizer Azure

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will merge when the crossbow jobs finish, then will rebase any outstanding R pull requests because this will probably conflict.

@nealrichardson
Copy link
Member

@github-actions crossbow submit test-r-linux-as-cran

@github-actions
Copy link

Revision: 2c3dd42

Submitted crossbow builds: ursa-labs/crossbow @ actions-702

Task Status
test-r-linux-as-cran Github Actions

@nealrichardson
Copy link
Member

nealrichardson commented Nov 11, 2020

@kszucs For some reason the test-r-linux-as-cran job (branch) isn't being created on crossbow. Unfortunately the crossbow output (e.g. https://github.com/apache/arrow/runs/1386912319?check_suite_focus=true) doesn't provide any information about it.

FWIW it is still being triggered on the nightly builds: https://github.com/ursa-labs/crossbow/branches/all?query=test-r-linux-as-cran

@kou
Copy link
Member

kou commented Nov 11, 2020

FYI: #8386 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants