Skip to content

as.data.frame but keep row names#5320

Merged
mattdowle merged 11 commits intoRdatatable:masterfrom
dereckmezquita:as-data-frame-move-col-to-rownames
Mar 16, 2022
Merged

as.data.frame but keep row names#5320
mattdowle merged 11 commits intoRdatatable:masterfrom
dereckmezquita:as-data-frame-move-col-to-rownames

Conversation

@dereckmezquita
Copy link
Copy Markdown
Member

@dereckmezquita dereckmezquita commented Jan 30, 2022

Closes #5319

Hello, I created an issue in the repo a few hours ago. I want to contribute and thus read the documentation and guidelines for pull requests.

This is a new feature I'm adding here to the s3 method as.data.frame.data.table; allows the user to specify a column to move to the row names of the resulting data.frame.

My code looks like this:

as.data.frame.data.table = function(x, row.names = NULL, ...)
{
  ans = copy(x)
  setattr(ans,"row.names",.set_row_names(nrow(x)))   # since R 2.4.0, data.frames can have non-character row names
  setattr(ans,"class","data.frame")
  setattr(ans,"sorted",NULL)  # remove so if you convert to df, do something, and convert back, it is not sorted
  setattr(ans,"index",NULL)  #4889 #5042
  setattr(ans,".internal.selfref",NULL)
  # leave tl intact, no harm,
  if(!is.null(row.names)) {
      rownames(ans) =  ans[, row.names]
      ans[, row.names] = NULL
  }
  ans
}

I also added this feature to the NEWS.md document. I hope I've satisfied all requests and am always open to critique.

@ben-schwen
Copy link
Copy Markdown
Member

Hey,
dropping a column shouldn't be done IMO because that's undocumented behavior. I would rather mimic the behavior of as.data.frame, hence, the row.names argument would have to be a character vector.

Assigning row.names should be done with setattr() and also not twice.

Might be also worth to take a look at setDT and de-duplicate code here.

NEWS item misses your name.
Tests are missing.

@dereckmezquita
Copy link
Copy Markdown
Member Author

Hi thanks for your feedback.

That is true, data.table's philosophy is to not alter the data as much as possible. I updated my code to set rownames from a character vector instead of a column name, and done with settattr only once:

as.data.frame.data.table = function(x, row.names = NULL, ...) {
    ans = copy(x)
    setattr(ans,"class","data.frame")
    setattr(ans,"sorted",NULL)  # remove so if you convert to df, do something, and convert back, it is not sorted
    setattr(ans,"index",NULL)  #4889 #5042
    setattr(ans,".internal.selfref",NULL)
    # leave tl intact, no harm,
    if(!is.null(row.names) && is.character(row.names)) {
        setattr(ans, "row.names", row.names)
    } else {
        setattr(ans,"row.names",.set_row_names(nrow(x)))   # since R 2.4.0, data.frames can have non-character row names
    }
    ans
}

Could you help me understand what this line of code is doing: setattr(ans,"row.names",.set_row_names(nrow(x))).

I tried it and it returns this which I don't understand the purpose of:

test <- data.table::data.table(iris)
test$sample <- paste("sample", 1:nrow(test), sep = "_")
.set_row_names(nrow(test))
[1]   NA -150

You also mentioned looking at setDT to de-duplicate code; would that mean that my modifications to as.data.frame.data.table would cause redundancies? I did find a similar line in setDT: setattr(x, "row.names", .set_row_names(nrow(x)))

@ben-schwen
Copy link
Copy Markdown
Member

The result for .set_row_names(nrow(x)) makes perfectly sense if you take a look how rownames are actually represented inside R if not explicitly overwritten.

.Internal(inspect(iris))
# @55fa91c185d8 19 VECSXP g1c4 [OBJ,MARK,REF(27),ATT] (len=5, tl=0)
#   @55fa8ecbe2e0 14 REALSXP g1c7 [MARK,REF(7)] (len=150, tl=0) 5.1,4.9,4.7,4.6,5,...
#   @55fa910bca50 14 REALSXP g1c7 [MARK,REF(7)] (len=150, tl=0) 3.5,3,3.2,3.1,3.6,...
#   @55fa90e857e0 14 REALSXP g1c7 [MARK,REF(7)] (len=150, tl=0) 1.4,1.4,1.3,1.5,1.4,...
#   @55fa90c821e0 14 REALSXP g1c7 [MARK,REF(7)] (len=150, tl=0) 0.2,0.2,0.2,0.2,0.2,...
#   @55fa8e5a9a70 13 INTSXP g1c7 [OBJ,MARK,REF(11),ATT] (len=150, tl=0) 1,1,1,1,1,...
#   ATTRIB:
#     @55fa90729090 02 LISTSXP g1c0 [MARK,REF(1)] 
#       TAG: @55fa8e27a330 01 SYMSXP g1c0 [MARK,REF(334),LCK,gp=0x4000] "levels" (has value)
#       @55fa91c50d48 16 STRSXP g1c3 [MARK,REF(65535)] (len=3, tl=0)
#         @55fa91c54d20 09 CHARSXP g1c1 [MARK,REF(1588),gp=0x60] [ASCII] [cached] "setosa"
#         @55fa913c5f48 09 CHARSXP g1c2 [MARK,REF(1498),gp=0x60] [ASCII] [cached] "versicolor"
#         @55fa91355d88 09 CHARSXP g1c2 [MARK,REF(1498),gp=0x60] [ASCII] [cached] "virginica"
#       TAG: @55fa8e27a720 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x6000] "class" (has value)
#       @55fa91c54d58 16 STRSXP g1c1 [MARK,REF(65535)] (len=1, tl=0)
#         @55fa8e31dad8 09 CHARSXP g1c1 [MARK,REF(409),gp=0x61] [ASCII] [cached] "factor"
# ATTRIB:
#   @55fa90729100 02 LISTSXP g1c0 [MARK,REF(1)] 
#     TAG: @55fa8e27a1e0 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x6000] "names" (has value)
#     @55fa91c18648 16 STRSXP g1c4 [MARK,REF(65535)] (len=5, tl=0)
#       @55fa91355dc8 09 CHARSXP g1c2 [MARK,REF(169),gp=0x61] [ASCII] [cached] "Sepal.Length"
#       @55fa91355e08 09 CHARSXP g1c2 [MARK,REF(161),gp=0x61] [ASCII] [cached] "Sepal.Width"
#       @55fa91355e48 09 CHARSXP g1c2 [MARK,REF(169),gp=0x61] [ASCII] [cached] "Petal.Length"
#       @55fa91355e88 09 CHARSXP g1c2 [MARK,REF(161),gp=0x61] [ASCII] [cached] "Petal.Width"
#       @55fa91c54d90 09 CHARSXP g1c1 [MARK,REF(162),gp=0x61] [ASCII] [cached] "Species"
#     TAG: @55fa8e27a720 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x6000] "class" (has value)
#     @55fa91c54dc8 16 STRSXP g1c1 [MARK,REF(65535)] (len=1, tl=0)
#       @55fa8e324408 09 CHARSXP g1c2 [MARK,REF(1722),gp=0x61,ATT] [ASCII] [cached] "data.frame"
#     TAG: @55fa8e279fb0 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x4000] "row.names" (has value)
#     @55fa91c54e00 13 INTSXP g1c1 [MARK,REF(65535)] (len=2, tl=0) -2147483648,-150

The interesting part is the last line where we can see that the rownames are represented by c(-2147483648,-150). Now you only need to know that R internally represents NA_integer_ by INT_MIN of 32-bit integer, which is exactly -2147483648 and therefore explains the result of .set_row_names(nrow(x)).

setDF

The pointer towards setDF was that setDF already supports rownames so it could be used to deduplicate code by using it for as.data.frame.data.table as e.g. with

as.data.frame.data.table = function(x, row.names = NULL, ...) {
    ans = setDF(copy(x), rownames=row.names)
    ans
}
DT = as.data.table(head(iris))

as.data.frame(DT)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

as.data.frame(DT, row.names=paste0("n", seq(nrow(DT))))
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# n1          5.1         3.5          1.4         0.2  setosa
# n2          4.9         3.0          1.4         0.2  setosa
# n3          4.7         3.2          1.3         0.2  setosa
# n4          4.6         3.1          1.5         0.2  setosa
# n5          5.0         3.6          1.4         0.2  setosa
# n6          5.4         3.9          1.7         0.4  setosa

@dereckmezquita
Copy link
Copy Markdown
Member Author

Thank you for that, now some of that is still foggy but I do have a better understanding of rownames now.

You originally pointed me to setDT. I see now what you meant with setDF. However, are you suggesting that I replace all code in as.data.frame.data.table with the use of setDF?

Looking at as.data.frame.data.table I realise that indeed we could make a deep copy of the input and setDF and return this. I suppose that the current code for as.data.frame.data.table is in essence doing this, making a deep copy and stripping away data.table attributes. I ran a test and indeed this works, but is the other code that was previously in as.data.frame.data.table not necessary?

My current iteration thanks to you is now as you stated above:

as.data.frame.data.table = function(x, row.names = NULL, ...)
{
    ans = setDF(copy(x), rownames = row.names)
    ans
}
test <- data.table::data.table(iris)
test$sample <- paste("sample", 1:nrow(test), sep = "_")

test2 <- as.data.frame(test, row.names = test$sample)

@ben-schwen
Copy link
Copy Markdown
Member

Well if you take a look at the source of setDF it is essentially the same code for a data.table as as.data.frame.data.table had before (which is also mentioned via the # copied from as.data.frame.data.table). The only difference is that setDF is capable of really handling rownames.

Updating the NEWS item with the fix (no longer silently ignoring row.names argument in as.data.frame.data.table and adding some tests to inst/tests/tests.Rraw as also mentioned in the contributing guide and we should be good to go.

@dereckmezquita
Copy link
Copy Markdown
Member Author

dereckmezquita commented Jan 31, 2022

Hi thank you again. I made the necessary changes to my pull request:

  1. Moved the NEWS note to the bugs section.
  2. Updated the code to use the solution we discussed; setDF.
  3. Added tests to the tests.Rraw file

I simply added these lines I believe this should be sufficient (I updated my test, I believe this is how one writes a test):

# test for #5320 `as.data.table(x)` `s3` method fixed, no longer ignoring `row.names` argument and simplified with use of `setDT`
DT = data.table::data.table(iris)
DF = data.frame(iris)

test(2235.1, as.data.frame(DT, row.names = paste("sample", 1:nrow(DT), sep = "_")), as.data.frame(DF, row.names = paste("sample", 1:nrow(DF), sep = "_")))

Testing issues

I did have issues when running my data.table::test.data.table(); I am not sure if my not being able to complete testing is an impediment to my PR.

First of all thank you for your patience; this is my first PR to data.table. I hope not my last.

My system and sessionInfo

First of all this is my system:

MacOS 11.4 MacBook Air (M1, 2020) 16GB RAM. And my sessionInfo():

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20.6.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.1.2/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.14.3 R.utils_2.11.0    R.oo_1.24.0       R.methodsS3_1.8.1 bit64_4.0.5      
 [6] bit_4.0.4         yaml_2.2.1        xts_0.12.1        zoo_1.8-9         nanotime_0.3.5   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8      lattice_0.20-45 plyr_1.8.6      grid_4.1.2      magrittr_2.0.2  stringi_1.7.6  
 [7] reshape2_1.4.4  tools_4.1.2     stringr_1.4.0   RcppCCTZ_0.2.10 compiler_4.1.2 

Testing is failing for me; zlib missing

I installed the master branch as is with no modifications and tried to run the tests using data.table::test.data.table() from within RStudio. I got a failure; here is what it said exactly:

> data.table::test.data.table()
getDTthreads(verbose=TRUE):
  OpenMP version (_OPENMP)       201811
  omp_get_num_procs()            8
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          8
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 7 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /opt/homebrew/lib/R/4.1/site-library/data.table/tests/tests.Rraw

**** Full long double accuracy is not available. Tests using this will be skipped.

Running test id 1658.421          Test 1658.421 produced 1 errors but expected 0
Expected: 
Observed: Compression in fwrite uses zlib library. Its header files were not found at the time data.table was compiled. To enable fwrite compression, please reinstall data.table and study the output for further guidance.
Output captured before unexpected warning/error/message:
  OpenMP version (_OPENMP)       201811
  omp_get_num_procs()            8
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          8
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 7 threads with throttle==1024. See ?setDTthreads.
No list columns are present. Setting sep2='' otherwise quote='auto' would quote fields containing sep2.
Column writers: 3 3 
args.doRowNames=0 args.rowNames=0x0 args.rowNameFun=0 doQuote=-128 args.nrow=200 args.ncol=2 eolLen=1
maxLineLen=51. Found in 0.000s
<simpleError in fwrite(DT, file = f1 <- tempfile(fileext = ".gz"), verbose = TRUE): Compression in fwrite uses zlib library. Its header files were not found at the time data.table was compiled. To enable fwrite compression, please reinstall data.table and study the output for further guidance.>
Running test id 1658.43          Test 1658.43 produced 1 warnings but expected 0
Expected: 
Observed: File '/var/folders/hv/3571hdb159zggq6whkpdnq400000gq/T//RtmpiwTL8v/file9b8051840767.gz' has size 0. Returning a NULL data.table.
Error in fwrite(DT, file = f3 <- tempfile(), compress = "gzip") : 
  Compression in fwrite uses zlib library. Its header files were not found at the time data.table was compiled. To enable fwrite compression, please reinstall data.table and study the output for further guidance.

Sun Jan 30 21:45:15 2022  endian==little, sizeof(long double)==8, longdouble.digits==, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/Cancun', Sys.getlocale()=='en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201811; omp_get_num_procs()==8; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==8; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 7 threads with throttle==1024. See ?setDTthreads.', zlib header files were not found when data.table was compiled
 Error: Failed after test 1658.43 before the next test() call in /opt/homebrew/lib/R/4.1/site-library/data.table/tests/tests.Rraw

I noted that it mentions not being able to find headers for zlib at compile time; here is the Makevars I used for build/compilation:

# if you downloaded llvm manually above, replace with your chosen NEW_PATH/clang
LLVM_LOC = /opt/homebrew/opt/llvm
CC=$(LLVM_LOC)/bin/clang -fopenmp
CXX=$(LLVM_LOC)/bin/clang++ -fopenmp
# -O3 should be faster than -O2 (default) level optimisation ..
CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe
CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe
LDFLAGS=-L/opt/homebrew/opt/gettext/lib -L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib
CPPFLAGS=-I/opt/homebrew/opt/gettext/include -I$(LLVM_LOC)/include

This is one I have used myself and is modified from the suggested one on the installation guide so I tried the default one; with the supplemented line since I am on a Mac:

# Newly installed Homebrew is located in
#  - /opt/homebrew for ARM Macs (M1 and its successors)
#  - /usr/local for Intel Macs
HOMEBREW_LOC=/opt/homebrew
# If you downloaded llvm manually above, replace with your chosen NEW_PATH/clang
LLVM_LOC=$(HOMEBREW_LOC)/opt/llvm
CC=$(LLVM_LOC)/bin/clang -fopenmp
CXX=$(LLVM_LOC)/bin/clang++ -fopenmp
# -O3 should be faster than -O2 (default) level optimisation ..
CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe
CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe
LDFLAGS=-L$(HOMEBREW_LOC)/opt/gettext/lib -L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib
CPPFLAGS=-I$(HOMEBREW_LOC)/opt/gettext/include -I$(LLVM_LOC)/include

This also resulted in the same issue; zlib is missing. I am not sure how to solve this.

I also ran the tests using: R CMD build . and R CMD check data.table_1.14.3.tar.gz. This also resulted in a similar failure.

@ben-schwen
Copy link
Copy Markdown
Member

Regarding the test cases, there should be at least two test cases, 1. where we assign rownames to non-standard values and 2. where row.names argument is NULL). Moreover, testing should be done against fixed objects if possible.

So the first one could look like this.

dt = data.table(a=1:2, b=3:4)
df = structure(list(a = 1:2, b = 3:4), row.names = c("x", "y"), class = "data.frame")
test(2235.1, as.data.frame(dt, row.names=c("x","y")), df)

Regarding the testing issues there are some issues with installing on MAC, but in any case devtools:::load_all might also work to load the updated version.

I clicked now on approve to allow Github CI to run, so you can also check the test cases when you changed them (simply push them to the branch).

@dereckmezquita
Copy link
Copy Markdown
Member Author

Hi thank you, so the tests look as such after your guidance:

dt = data.table(a=1:2, b=3:4)
df = structure(list(a=1:2, b=3:4), row.names=c("x", "y"), class="data.frame")

test(2235.1, as.data.frame(dt, row.names=c("x", "y")), df)

df = data.frame(a=1:2, b=3:4)
test(2235.2, as.data.frame(dt, row.names=NULL), df)

I found that structure does not return a valid df with row.names=NULL:

structure(list(a=1:2, b=3:4), row.names=NULL, class="data.frame")
[1] a b
<0 rows> (or 0-length row.names)

I found that these were TRUE so this should not present an issue for my second test in my opinion and is shorter than having to do rownames(df)=NULL after creating the df via the structure method.

Here I tested the equivalence of all these ways of creating a data.frame with row.names=NULL:

df = data.frame(a=1:2, b=3:4)
df2 = data.frame(a=1:2, b=3:4, row.names = NULL)
df3 = structure(list(a=1:2, b=3:4), row.names=c("x", "y"), class="data.frame")

rownames(df1) = NULL

sapply(list(df, df1, df2), function(x) { identical(df, x)})
[1] TRUE TRUE TRUE

For my education; why did you create a data.frame with a structure call? Why not data.frame? Is there a deeper change or is this a style difference? My identical check does yield TRUE on all counts.

@ben-schwen
Copy link
Copy Markdown
Member

There is something wrong with your equivalence checking code, since you never create an object for df1..

The reason for creating the data.frame with structure is that this way we can create the data.frame and also assign row.names in one call and not with df=data.frame(a=1:2, b=3:4); rownames(df)=c("x","y") as one would do in a script. Ofc I initially created it the script way and only afterwards used dput(df) to put it into a single call.

Comment thread inst/tests/tests.Rraw Outdated
Comment thread R/data.table.R Outdated
@dereckmezquita
Copy link
Copy Markdown
Member Author

You're right this is the correct equivalency code; pardon that:

df1 = data.frame(a=1:2, b=3:4)
df2 = data.frame(a=1:2, b=3:4, row.names = NULL)
df3 = structure(list(a=1:2, b=3:4), row.names=c("x", "y"), class="data.frame")

rownames(df3) = NULL

sapply(list(df1, df2, df3), function(x) { identical(df1, x) })

As for the structure call thank you, I didn't know I could do that. However, I checked and note that I can indeed create a data.frame with row.names in one call?

data.frame(a=1:2, b=3:4, row.names = c("x", "y"))
  a b
x 1 3
y 2 4

And no way that is really cool; I will be using this for sure: dput(df)

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 1, 2022

Codecov Report

Merging #5320 (b79f68e) into master (0ed63cd) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head b79f68e differs from pull request most recent head 934c1ea. Consider uploading reports for the commit 934c1ea to get more accurate results

@@            Coverage Diff             @@
##           master    #5320      +/-   ##
==========================================
- Coverage   99.51%   99.51%   -0.01%     
==========================================
  Files          78       78              
  Lines       14761    14756       -5     
==========================================
- Hits        14689    14684       -5     
  Misses         72       72              
Impacted Files Coverage Δ
R/data.table.R 99.89% <100.00%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ed63cd...934c1ea. Read the comment docs.

@ben-schwen
Copy link
Copy Markdown
Member

Ah true, didn't remember that data.frame() has an argument for row.names. In general structure is nice for creating objects while already setting attributes.

@ben-schwen
Copy link
Copy Markdown
Member

@mattdowle Codecov decrease is spurious since duplicated code is deleted.

@mattdowle mattdowle added this to the 1.14.3 milestone Mar 16, 2022
@mattdowle
Copy link
Copy Markdown
Member

Great PR, thanks @dereckdemezquita and welcome to the project. Have invited you to be project member; the invite should be a button in your profile or projects page that you need to click to accept. Then you can create branches in the main project in future. Thanks @ben-schwen for guiding here.

@mattdowle mattdowle merged commit 2f67531 into Rdatatable:master Mar 16, 2022
@dereckmezquita
Copy link
Copy Markdown
Member Author

@mattdowle thank for you the invite, I am a big fan so naturally very happy to contribute!

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@dereckmezquita dereckmezquita deleted the as-data-frame-move-col-to-rownames branch April 1, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

as.data.frame but keep row names?

5 participants