Skip to content

Allow the argument type.convert of tstrsplit to accept a named list. #5094

@Kamgang-B

Description

@Kamgang-B

This is not really an issue but a feature request.
Sometimes when using the function tstrsplit, setting its argument type.convert to FALSE or TRUE
does not return the desired output. We may want some columns in the result to be of a different type.
Here is an illustrative example:

options(datatable.print.class=TRUE)


dt <- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos"))

	                  x
                     <char>
1: 00531725   Male 2021 Neg
2: 07640613 Female 2020 Pos

Spltting the variable x:

cols <- c("personID", "gender", "year", "covidTest")

dt[, tstrsplit(x, split=" ", names=cols, type.convert=FALSE)]

   personID gender   year covidTest
     <char> <char> <char>    <char>
1: 00531725   Male   2021       Neg
2: 07640613 Female   2020       Pos

Setting type.convert to TRUE:

dt[, tstrsplit(x, split=" ", names=cols, type.convert=TRUE)]

   personID gender  year covidTest
      <int> <char> <int>    <char>
1:   531725   Male  2021       Neg
2:  7640613 Female  2020       Pos

All columns are of type character when type.convert=FALSE and when it is set to TRUE, the columns gender, and covidTest are set to character while the column personID is set to integer. In practice, it is quite likely that the desired type for gender and covidTest is factor while the desired type for personID is to keep it as character (especially because removing leading zeros lead to IDs that are likely not valid).

My suggestion is to allow the type.convert argument to accept a named list where the names are functions and the values (integer positions or column names) specify the columns to apply the functions on.
Using the same example, we would obtain something like:

dt[, tstrsplit(x, split=" ", names=cols, type.convert=list(as.character=1, as.factor=c(2, 4), as.integer=3))]

   personID gender   year covidTest
     <char> <fctr>  <int>    <fctr>
1: 00531725   Male   2021       Neg
2: 07640613 Female   2020       Pos

This idea is closely related to that of the built-in function strcapture, which allows to split a variable into several columns by extracting groups. But it is usually quite slow and typing the same thing several times (like factor below) makes it less attractive.
Further, the conversion function name must start with the prefix as. (like as.factor, etc.).

strcapture(pattern="(\\d+) ([a-zA-Z]+) (\\d+) ([a-zA-Z]+)", 
           x=x, 
           proto=list(personID=character(), gender=factor(), year=integer(), covidTest=factor()))

  personID gender year covidTest
1 00531725   Male 2021       Neg
2 07640613 Female 2020       Pos

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions