This is not really an issue but a feature request.
Sometimes when using the function tstrsplit, setting its argument type.convert to FALSE or TRUE
does not return the desired output. We may want some columns in the result to be of a different type.
Here is an illustrative example:
options(datatable.print.class=TRUE)
dt <- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos"))
x
<char>
1: 00531725 Male 2021 Neg
2: 07640613 Female 2020 Pos
Spltting the variable x:
cols <- c("personID", "gender", "year", "covidTest")
dt[, tstrsplit(x, split=" ", names=cols, type.convert=FALSE)]
personID gender year covidTest
<char> <char> <char> <char>
1: 00531725 Male 2021 Neg
2: 07640613 Female 2020 Pos
Setting type.convert to TRUE:
dt[, tstrsplit(x, split=" ", names=cols, type.convert=TRUE)]
personID gender year covidTest
<int> <char> <int> <char>
1: 531725 Male 2021 Neg
2: 7640613 Female 2020 Pos
All columns are of type character when type.convert=FALSE and when it is set to TRUE, the columns gender, and covidTest are set to character while the column personID is set to integer. In practice, it is quite likely that the desired type for gender and covidTest is factor while the desired type for personID is to keep it as character (especially because removing leading zeros lead to IDs that are likely not valid).
My suggestion is to allow the type.convert argument to accept a named list where the names are functions and the values (integer positions or column names) specify the columns to apply the functions on.
Using the same example, we would obtain something like:
dt[, tstrsplit(x, split=" ", names=cols, type.convert=list(as.character=1, as.factor=c(2, 4), as.integer=3))]
personID gender year covidTest
<char> <fctr> <int> <fctr>
1: 00531725 Male 2021 Neg
2: 07640613 Female 2020 Pos
This idea is closely related to that of the built-in function strcapture, which allows to split a variable into several columns by extracting groups. But it is usually quite slow and typing the same thing several times (like factor below) makes it less attractive.
Further, the conversion function name must start with the prefix as. (like as.factor, etc.).
strcapture(pattern="(\\d+) ([a-zA-Z]+) (\\d+) ([a-zA-Z]+)",
x=x,
proto=list(personID=character(), gender=factor(), year=integer(), covidTest=factor()))
personID gender year covidTest
1 00531725 Male 2021 Neg
2 07640613 Female 2020 Pos
This is not really an issue but a feature request.
Sometimes when using the function
tstrsplit, setting its argumenttype.converttoFALSEorTRUEdoes not return the desired output. We may want some columns in the result to be of a different type.
Here is an illustrative example:
Spltting the variable
x:Setting
type.converttoTRUE:All columns are of type character when
type.convert=FALSEand when it is set toTRUE, the columnsgender, andcovidTestare set to character while the columnpersonIDis set to integer. In practice, it is quite likely that the desired type forgenderandcovidTestisfactorwhile the desired type forpersonIDis to keep it as character (especially because removing leading zeros lead to IDs that are likely not valid).My suggestion is to allow the
type.convertargument to accept a named list where the names are functions and the values (integer positions or column names) specify the columns to apply the functions on.Using the same example, we would obtain something like:
This idea is closely related to that of the built-in function
strcapture, which allows to split a variable into several columns by extracting groups. But it is usually quite slow and typing the same thing several times (likefactorbelow) makes it less attractive.Further, the conversion function name must start with the prefix as. (like
as.factor, etc.).