Skip to content

[R] Problem with Join in apache arrow in R #30880

@asfimport

Description

@asfimport

Hi dear arrow developers. I tested inner_join with arrow R package but R crashed, this is my example with toy dataset iris:

 

data(iris)
write.csv(iris, "iris.csv") # write csv file

  1. write parket file with write_chunk_data function (below)

    walk("C:/Users/Stats/Desktop/ejemplo_join/iris.csv",
         write_chunk_data, "C:/Users/Stats/Desktop/ejemplo_join/parquet", chunk_size = 50)

     

    iris_arrow <- open_dataset("parquet")

    df1_arrow <- iris_arrow %>% select(...1, Sepal.Length, Sepal.Width, Petal.Length) 
    df2_arrow <-   iris_arrow %>% select(...1, Petal.Width, Species,) d

    df <- tabla1_arrow %>% inner_join(tabla2_arrow, by = "...1") %>%

    group_by(Species) %>% summarise(prom = mean(Sepal.Length)) %>% collect()
    print(df)

     

     

  2. Run this function to write parquet files in this example please

     write_chunk_data <- function(data_path, output_dir, chunk_size = 1000000) {
      #If the output_dir do not exist, it is created
      if (!fs::dir_exists(output_dir)) fs::dir_create(output_dir)
      #It gets the name of the file
      data_name <- fs::path_ext_remove(fs::path_file(data_path))
      #It sets the chunk_num to 0
      chunk_num <- 0
      #Read the file using vroom
      data_chunk <- vroom::vroom(data_path)
      #It gets the variable names
      data_names <- names(data_chunk)
      #It gets the number of rows
      rows<-nrow(data_chunk)
      
      #The following loop creates a parquet file for every [chunk_size] rows
      repeat{
        #It checks if we are over the max rows
        if(rows>(chunk_num+1)*chunk_size)

    {       arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):((chunk_num+1)*chunk_size),],                             fs::path(output_dir, glue::glue("
    {data_name}

    -{chunk_num}.parquet")))
        }
        else

    {       arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):rows,],                             fs::path(output_dir, glue::glue("
    {data_name}

    -{chunk_num}.parquet"))) 
          break
        }
        chunk_num <- chunk_num + 1
      }
       
      #This is to recover some memory and space in the disk
      rm(data_chunk)
      tmp_file <- tempdir()
      files <- list.files(tmp_file, full.names = T, pattern = "^vroom")
      file.remove(files)
    }

     

Reporter: José F

Related issues:

Note: This issue was originally created as ARROW-15397. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions