Skip to content

utilities

Ronald W. Buie edited this page Feb 11, 2026 · 6 revisions

Utility Functions

Introduction

The rads package includes a suite of small utility functions designed to simplify common data manipulation tasks and accelerate analysis workflows.

The core utility functions we’ll cover in this vignette are:

Let’s explore each of these functions with examples to demonstrate their usage and benefits.

Setting up the environment

First, let’s load the necessary packages:

library(rads)
library(data.table)

Functions

calc_age()

The calc_age() function calculates age in years based on two dates. This function has identified numerous mistakes in ages provided in DOH vital statistics.

calc_age is better than subtracting the years

birth_date <- as.Date("1990-08-02")
current_date <- as.Date("2024-08-01")
simple_age <- year(current_date) - year(birth_date)
proper_age <- calc_age(from = birth_date, to = current_date)

head(data.table(birth_date, current_date, simple_age, proper_age))
birth_date current_date simple_age proper_age
1990-08-02 2024-08-01 34 33

calc_age is better than days / 365.25

dates <- data.table(
  birth = seq.Date(as.Date('2000-02-25'), as.Date('2000-03-02'), 1), 
  death = seq.Date(as.Date('2024-03-02'), as.Date('2024-02-25'), -1))
dates[, simple_age := round(as.integer(death - birth)/365.25)]
dates[, proper_age := calc_age(birth, death)]

head(dates)
birth death simple_age proper_age
2000-02-25 2024-03-02 24 24
2000-02-26 2024-03-01 24 24
2000-02-27 2024-02-29 24 24
2000-02-28 2024-02-28 24 24
2000-02-29 2024-02-27 24 23
2000-03-01 2024-02-26 24 23

convert_to_date()

convert_to_date() attempts to convert various US style date formats to R Dates.

dates <- data.table(orig = c("2024-07-25", "07/26/2024", 
                             "07-27-2024", "20240728", 
                             "29 July, 2024", "July 30, 2024"))
dates[, rads_dates := convert_to_date(orig)]
dates[, simple_dates := as.Date(orig)]

head(dates)
orig rads_dates simple_dates
2024-07-25 2024-07-25 2024-07-25
07/26/2024 2024-07-26 NA
07-27-2024 2024-07-27 NA
20240728 57317-03-24 NA
29 July, 2024 2024-07-29 NA
July 30, 2024 2024-07-30 NA

This function is particularly useful when dealing with inconsistent date formats within a single vector or column of your data.

create_dictionary()

The create_dictionary() function generates a comprehensive data dictionary for a given data.table or data.frame. It provides details about variable types, sample values, and can incorporate descriptions and notes from a reference table. The function can be used to both understand a new dataset and document the products of your ETL processes.

Basic usage

Let’s create a sample dataset and generate a dictionary:

dt <- data.table(
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  gender = factor(c("F", "M", "M", "M", "F")),
  age = c(25, 30, 35, 40, 45),
  score = c(85.5, 92.0, 78.5, 88.0, 95.5)
)

dict <- create_dictionary(ph.data = dt, source = "sample_data")

head(dict, 10)
source varname vartype values factor_labels desc notes dict_updated
sample_data name character Alicia NA NA NA 2026-02-11
sample_data name character Bianca NA NA NA 2026-02-11
sample_data name character Clarence NA NA NA 2026-02-11
sample_data name character Dwane NA NA NA 2026-02-11
sample_data name character Eve NA NA NA 2026-02-11
sample_data gender factor 1 F NA NA 2026-02-11
sample_data gender factor 2 M NA NA 2026-02-11
sample_data age integer 25 NA NA NA 2026-02-11
sample_data age integer 30 NA NA NA 2026-02-11
sample_data age integer 35 NA NA NA 2026-02-11

Suppressing sensitive information

You can suppress sample values for sensitive columns:

dict_suppressed <- create_dictionary(ph.data = dt, 
                                     source = "sample_data", 
                                     suppress = c("name"))

head(dict_suppressed, 10)
source varname vartype values factor_labels desc notes dict_updated
sample_data name character suppressed NA NA NA 2026-02-11
sample_data gender factor 1 F NA NA 2026-02-11
sample_data gender factor 2 M NA NA 2026-02-11
sample_data age integer 25 NA NA NA 2026-02-11
sample_data age integer 30 NA NA NA 2026-02-11
sample_data age integer 35 NA NA NA 2026-02-11
sample_data age integer 40 NA NA NA 2026-02-11
sample_data age integer 45 NA NA NA 2026-02-11
sample_data score numeric 78.5 NA NA NA 2026-02-11
sample_data score numeric 85.5 NA NA NA 2026-02-11

Incorporating reference information

You can include additional descriptions and notes using a reference table:

ref_table <- data.table(
  source = rep("sample_data", 4),
  varname = c("name", "age", "gender", "score"),
  desc = c("Full name", "Age in years", "Gender", "Test score"),
  notes = c("May contain special characters", 
            "Rounded to nearest year", "Binary classification", 
            "Range: 0-100")
)

dict_with_ref <- create_dictionary(ph.data = dt, 
                                   source = "sample_data", 
                                   ph.ref = ref_table)

head(dict_with_ref, 8)
source varname vartype values factor_labels desc notes dict_updated
sample_data name character Alicia NA Full name May contain special characters 2026-02-11
sample_data name character Bianca NA Full name May contain special characters 2026-02-11
sample_data name character Clarence NA Full name May contain special characters 2026-02-11
sample_data name character Dwane NA Full name May contain special characters 2026-02-11
sample_data name character Eve NA Full name May contain special characters 2026-02-11
sample_data gender factor 1 F Gender Binary classification 2026-02-11
sample_data gender factor 2 M Gender Binary classification 2026-02-11
sample_data age integer 25 NA Age in years Rounded to nearest year 2026-02-11

Handling numerous unique values

For datasets with many unique values, create_dictionary() provides summaries:

set.seed(98104)
large_dt <- data.table(
  id = 1:1000,
  category = sample(letters[1:10], 1000, replace = TRUE),
  value = round2(rnorm(1000), 2)
)

large_dict <- create_dictionary(ph.data = large_dt, 
                                source = "large_sample")

head(large_dict, 8)
source varname vartype values factor_labels desc notes dict_updated
large_sample id integer min = 1, max = 1000 NA NA NA 2026-02-11
large_sample category character a NA NA NA 2026-02-11
large_sample category character b NA NA NA 2026-02-11
large_sample category character c NA NA NA 2026-02-11
large_sample category character d NA NA NA 2026-02-11
large_sample category character e NA NA NA 2026-02-11
large_sample category character NA NA NA 2026-02-11
large_sample value numeric min = -3.33, max = 3.73 NA NA NA 2026-02-11

data_modeler()

data_modeler() receives a data.table object and returns a data.table of similar structure and content, but with uncorrelated data across columns. For example, a column of dates will become a column of dates of a similar range, and distribution, but randomly distributed across observations. Common use cases include:

  • demonstrating a statistic or visualization without exposing the audience to “real” data or results
  • sharing code that generates a data set for minimally viable examples used in requesting help from, or teaching, others
  • creating data sets programmatically for (unit) testing or training

There are two kinds of output that data_modeler() returns, either the data table itself, or a list object of strings that can be evaluated to create the data table. In addition, data_modeler() can output-to-terminal a human readable version of the generation code that the user can copy and paste into their own scripts for sharing publicly.

data_modeler() makes some guesses as to if a column’s content should be randomized (such as person names or blood pressures) or kept but randomly distributed (such as levels of response to a question or city names). This is intended to generate random data or to retain original categories where appropriate. It bases these guesses on the data type (integer or string), and complexity (e.g. 75% being unique data), but not the meaning of the data (i.e. it doesn’t know that a number is a “zip code” and doesn’t know that a string is a “human name”). Variables that are identified as non-categorical will, in most cases, be replaced with a uniform distribution of noisy, but similarly shaped, data (e.g. floating point numbers of the same range and precision, but in a random uniform distribution, or a “lore ipsum” string of similar lengths). Variables identified as categorical will, in most cases, be replaced with identical categories in a similar distribution to the original, but randomly assigned across observations (a data set of 95% males and 5% females would still generate a data set of roughly 95% males and 5% females).

Two parameters are available to adjust the assumptions data_modeler uses to identify categorical values. These are “categorical_threshold” and “proportional_threshold”.

  • “categorical_threshold”: defines how many unique categories can exist in the variable before data_modeler assumes the variable isn’t categorical. Above this number, and the value will not be considered categorical. This currently defaults to 61.
  • “proportional_threshold”: defines the proportion of values that can be unique before data_modeler() assumes the variable isn’t categorical. This currently defaults to 0.75.

data privacy caveats

uncorrelated may not always equal non-informative

data_modeler() creates an uncorrelated column of data to substitute the original by randomly reassigning values across that column. However, this does not ensure that meaningful information is not passed along. Note that, in most cases, categorical observations are distributed in proportion to the source data. So, if the provided data set is of categorical outcomes in a single population, the uncorrelated result would still show a similar rate of that outcome in that population because all observations would be for the same population group. The analyst should consider if this is acceptable or not and take measures to correct if necessary by either changing the input data or modifying the synthetic data. A simple approach in the above example would be to add one or more additional population groups so that the outcomes would be distributed removing the ability to estimate the rate for the smaller population (in an uncorrelated result, the rate of the categorical outcome would tend to be similar in all populations.)

walkthrough

For the examples below, we create a “real” data set and show how to use data_modeler() and how the synthetic and “real” data differ.

Our demo “real” data has one indicator variable that represents an outcome that is never in the White population, sometimes in the Asian population, and always in other Races. In an uncorrelated (and so non-informative) synthetic data set, we would expect this indicator to no longer be biased by race.

DT.real.data <- data.table(
   id = 1:1000,
   chi_geo_kc = sample(c('King County',NA), 1000, replace = T),
   chi_race_7 = factor(sample(c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple", NA), 1000, replace = T, prob = c(.19,.01,.07,.11,.01,.35,.07,.14,.02)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple", NA)),
   chi_sex = as.factor(sample(c("Male","Female"), 1000, replace = T)),
   chi_geo_region = factor(sample(c("South", "North", "Seattle", "East"), 1000, replace = T), levels = c("South","North","Seattle","East")))

# Attach a biased indicator.
#indicator1, all White population are "never", all Asians population, "sometimes", and everyone else, "always"
   DT.real.data[, indicator1 := ifelse(chi_race_7 == "White", "never",
                         ifelse(chi_race_7 == "Asian", "sometimes",
                                "always"))]
##Observe that modeled data have the same structure as the original
DT.synthetic.data <- data_modeler(ph.data = DT.real.data, #our "real" data
                                number_of_observations =  100, #number of observations to create in the synthetic data
                                comments = TRUE, #whether or not to include comments in code (only matters for return_code == TRUE or print_code == TRUE)
                                return_code = FALSE, #return a list object of strings for creating the synthetic data instead of the data itself. leave FALSE to receive synthetic data
                              print_code = FALSE) #print the generation code to terminal. useful for both review and if you wish to copy into other contexts, such as a shareable script or in a chat with a LLM/AI

You can observe that the data structures are the same for both data frames:

```{r}
str(DT.real.data)
str(DT.synthetic.data)
```
Classes 'data.table' and 'data.frame':  1000 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  NA "King County" NA NA ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 8 NA 6 3 8 7 1 6 6 1 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 2 2 2 2 2 2 1 2 2 4 ...
 $ indicator1    : chr  "always" NA "never" "always" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'data.table' and 'data.frame':  100 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  "King County" "King County" "King County" "King County" ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 1 5 8 5 1 2 5 1 6 8 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 1 1 1 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 1 2 1 1 2 1 3 3 4 1 ...
 $ indicator1    : chr  "never" "always" NA "always" ...
 - attr(*, ".internal.selfref")=<externalptr> 

But we can see that the original bias of White’s in the never group, Asians in the sometimes group, and all others in the always group now no longer holds.

prop.table(table(DT.real.data[,.(indicator1,chi_race_7)]))
           chi_race_7
indicator1        Asian        AIAN       Black    Hispanic        NHPI
  always    0.000000000 0.013251784 0.082568807 0.114169215 0.007135576
  never     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
  sometimes 0.191641182 0.000000000 0.000000000 0.000000000 0.000000000
           chi_race_7
indicator1        White       Other    Multiple
  always    0.000000000 0.071355759 0.161060143
  never     0.358817533 0.000000000 0.000000000
  sometimes 0.000000000 0.000000000 0.000000000
prop.table(table(DT.synthetic.data[,.(indicator1,chi_race_7)]))
          chi_race_7
indicator1      Asian       AIAN      Black   Hispanic       NHPI      White
    always 0.31666667 0.03333333 0.03333333 0.06666667 0.11666667 0.06666667
    never  0.11666667 0.01666667 0.03333333 0.01666667 0.01666667 0.03333333
          chi_race_7
indicator1      Other   Multiple
    always 0.00000000 0.13333333
    never  0.00000000 0.00000000

Notice that you may see extreme, apparently biased, results in the outcomes. This is still randomly distributed and you can observe these change if you recreate the synthetic.

You can review, share, and execute the code used to create the synthetic data in two ways. Manually by printing it to terminal and copying it into your R script, or by receiving and processing the text object. The first is more useful for review and sharing with others, the later is more useful for programmatic handling of the instruction set.

DT.synthetic.data <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                              print_code = TRUE) # change this to true to copy code from your terminal output
DT <- data.table( 
`id` = as.integer(1:100), # as an identifying (all unique) integer 
`chi_geo_kc` = sample(c(NA, "King County"), 100, replace = TRUE, prob = c(0.507, 0.493)), # as categorical character (non factor) 
`chi_race_7` = factor(sample(c("Multiple", NA, "White", "Black", "Other", "Asian", "Hispanic", "NHPI", "AIAN"), 100, replace = TRUE, prob = c(0.188, 0.013, 0.081, 0.112, 0.007, 0.352, 0.07, 0.158, 0.019)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple"), ordered = FALSE), # as factor 
`chi_sex` = factor(sample(c("Male", "Female"), 100, replace = TRUE, prob = c(0.493, 0.507)), levels = c("Female", "Male"), ordered = FALSE), # as factor 
`chi_geo_region` = factor(sample(c("North", "South", "East", "Seattle"), 100, replace = TRUE, prob = c(0.233, 0.253, 0.256, 0.258)), levels = c("South", "North", "Seattle", "East"), ordered = FALSE), # as factor 
`indicator1` = sample(c("always", NA, "never", "sometimes"), 100, replace = TRUE, prob = c(0.441, 0.352, 0.188, 0.019))) # as categorical character (non factor)

The code below is what is used to generate the synthetic data. Using this approach you can see the data, sample approach and rates for each variable. Because we ran data_modeler() with comments = TRUE, we also see a human-readable string at the end of each line describing how each variable was classified by data_modeler. This allows you to review the logic, make changes, and share the code with others.

DT <- data.table( 
`id` = as.integer(1:100), # as an identifying (all unique) integer 
`chi_geo_kc` = sample(c(NA, "King County"), 100, replace = TRUE, prob = c(0.494, 0.506)), # as categorical character (non factor) 
`chi_race_7` = factor(sample(c("Black", "White", "AIAN", NA, "Asian", "Multiple", "Other", "NHPI", "Hispanic"), 100, replace = TRUE, prob = c(0.193, 0.014, 0.067, 0.11, 0.012, 0.36, 0.064, 0.157, 0.023)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple"), ordered = FALSE), # as factor 
`chi_sex` = factor(sample(c("Male", "Female"), 100, replace = TRUE, prob = c(0.518, 0.482)), levels = c("Female", "Male"), ordered = FALSE), # as factor 
`chi_geo_region` = factor(sample(c("Seattle", "South", "North", "East"), 100, replace = TRUE, prob = c(0.241, 0.229, 0.258, 0.272)), levels = c("South", "North", "Seattle", "East"), ordered = FALSE), # as factor 
`indicator1` = sample(c("always", "never", NA, "sometimes"), 100, replace = TRUE, prob = c(0.424, 0.36, 0.193, 0.023))) # as categorical character (non factor)

Note, the above approach requires the user to copy and paste output from their terminal. If you wish to check, modify, and extract the code without human intervention, you need to receive the text as a list object using return_code = TRUE:

DT.synthetic.data.instructions <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = FALSE, 
                                return_code = TRUE, #setting this to "TRUE" will cause data_modeler() to return a list object of the text instead of a DT of synthetic data
                                print_code = FALSE)

head(DT.synthetic.data.instructions)
[[1]]
[1] "DT <- data.table("

[[2]]
[1] "`id` = as.integer(1:100),"

[[3]]
[1] "`chi_geo_kc` = sample(c(NA, \"King County\"), 100, replace = TRUE, prob = c(0.507, 0.493)),"

[[4]]
[1] "`chi_race_7` = factor(sample(c(\"Multiple\", NA, \"White\", \"Black\", \"Other\", \"Asian\", \"Hispanic\", \"NHPI\", \"AIAN\"), 100, replace = TRUE, prob = c(0.188, 0.013, 0.081, 0.112, 0.007, 0.352, 0.07, 0.158, 0.019)), levels = c(\"Asian\", \"AIAN\", \"Black\", \"Hispanic\", \"NHPI\", \"White\", \"Other\", \"Multiple\"), ordered = FALSE),"

[[5]]
[1] "`chi_sex` = factor(sample(c(\"Male\", \"Female\"), 100, replace = TRUE, prob = c(0.493, 0.507)), levels = c(\"Female\", \"Male\"), ordered = FALSE),"

[[6]]
[1] "`chi_geo_region` = factor(sample(c(\"North\", \"South\", \"East\", \"Seattle\"), 100, replace = TRUE, prob = c(0.233, 0.253, 0.256, 0.258)), levels = c(\"South\", \"North\", \"Seattle\", \"East\"), ordered = FALSE),"

After processing the text as desired, you can execute it, creating the desired synthetic data object, as follows:

## Manually convert received code into data
codeText <- paste(unlist(DT.synthetic.data.instructions), collapse =" \n" )
DT.synthetic.data.manual <- eval( parse(text = paste0(codeText)))

Now you can visually confirm that the class structure of each of the tables is the same:

str(DT.real.data)
Classes 'data.table' and 'data.frame':  1000 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  NA "King County" NA NA ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 8 NA 6 3 8 7 1 6 6 1 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 2 2 2 2 2 2 1 2 2 4 ...
 $ indicator1    : chr  "always" NA "never" "always" ...
 - attr(*, ".internal.selfref")=<externalptr> 
str(DT.synthetic.data)
Classes 'data.table' and 'data.frame':  100 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  "King County" NA NA "King County" ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 8 5 1 1 6 5 1 1 1 3 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 2 2 1 2 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 4 1 4 2 3 1 4 1 3 3 ...
 $ indicator1    : chr  NA NA "always" "always" ...
 - attr(*, ".internal.selfref")=<externalptr> 
str(DT.synthetic.data.manual)
Classes 'data.table' and 'data.frame':  100 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  NA "King County" "King County" "King County" ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 3 1 1 5 8 1 1 1 8 8 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 2 2 2 2 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 3 1 4 1 1 3 4 2 2 3 ...
 $ indicator1    : chr  "always" "never" NA "always" ...
 - attr(*, ".internal.selfref")=<externalptr> 
sapply(DT.real.data, class)
            id     chi_geo_kc     chi_race_7        chi_sex chi_geo_region 
     "integer"    "character"       "factor"       "factor"       "factor" 
    indicator1 
   "character" 
sapply(DT.synthetic.data, class)
            id     chi_geo_kc     chi_race_7        chi_sex chi_geo_region 
     "integer"    "character"       "factor"       "factor"       "factor" 
    indicator1 
   "character" 
sapply(DT.synthetic.data.manual, class)
            id     chi_geo_kc     chi_race_7        chi_sex chi_geo_region 
     "integer"    "character"       "factor"       "factor"       "factor" 
    indicator1 
   "character" 

Finally, each time data_modeler is executed, there is a randomness to how the data are created. If you wish to create identical data sets, be sure to set seed before executing:

DT.synthetic.data.random <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                                print_code = FALSE) 

set.seed(1000)
DT.synthetic.data.s1000a <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                                print_code = FALSE) 

set.seed(1000)
DT.synthetic.data.s1000b <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                                print_code = FALSE) 

Comparing a table with a random seed to one with a fixed seed, we expect the resulting tables to differ:

identical(DT.synthetic.data.random, DT.synthetic.data.s1000a)
[1] FALSE

When using the same seed for distinct rounds, we would expect the resulting tables to be identical:

identical(DT.synthetic.data.s1000a, DT.synthetic.data.s1000b)
[1] TRUE

format_time() and format_time_simple()

These functions format numeric values (including dates) into human-readable time chunks. They are often used for summarizing years and can be used for identifying unexpected gaps.

years <- c(2001:2003, 
           2005, 
           2008, 
           2011:2012)

dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03",
                   "2024-01-06", 
                   "2024-01-09", "2024-01-10", "2024-01-11", "2024-01-12"))

dt <- data.table(desc = c('format_time', 'format_time_simple', '',
                          'format_time', 'format_time_simple'), 
                 data = c(format_time(years), format_time_simple(years), '', 
                          format_time(dates), format_time_simple(dates)))
head(dt)
desc data
format_time 2001-2003, 2005, 2008, 2011-2012
format_time_simple 2001-2012
format_time 2024-01-01 to 2024-01-03, 2024-01-06, 2024-01-09 to 2024-01-12
format_time_simple 2024-01-01 to 2024-01-12

generate_yaml()

generate_yaml() generates a YAML file for SQL loading based on a data.frame or data.table.

data <- data.table(
  id = 1:5,
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  age = c(25.1, 30.2, 35.3, 40.4, 45.5)
)

yaml_output <- generate_yaml(data, schema = "dbo", table = "example_table", datasource = 'WADOH_CHS')
print(yaml_output)
$datasource
[1] "WADOH_CHS"

$schema
[1] "dbo"

$table
[1] "example_table"

$vars
$vars$id
[1] "INT"

$vars$name
[1] "NVARCHAR(18)"

$vars$age
[1] "NUMERIC(38,5)"

The output can be be used to specify the field_types for the tsql_validate_field_types() and tsql_chunk_loader() functions described below.

field_types <- unlist(yaml_output$vars)
print(field_types)
             id            name             age 
          "INT"  "NVARCHAR(18)" "NUMERIC(38,5)" 

multi_t_test()

The multi_t_test() function performs t-tests comparing multiple groups against a reference group using means, standard errors, and the sample size (when available).

Basic usage

Let’s create a sample dataset and use multi_t_test() to compare birthweights across different maternal age groups:

# Sample data
age_groups <- c("18-24", "25-29", "30-34", "35-39", "40+")
birthweight_means <- c(3150, 3450, 3400, 3250, 3100)  # in grams
birthweight_ses <- c(50, 45, 40, 55, 60)
sample_sizes <- c(500, 800, 750, 400, 200)
reference_group <- 3  # comparing all groups to the 30-34 age group

# Perform multi_t_test
birthweight_comparison <- multi_t_test(means = birthweight_means,
                                       ses = birthweight_ses,
                                       reference_index = reference_group,
                                       n = sample_sizes,
                                       alternative = 'two.sided',
                                       df_method = "estimated", 
                                       alpha = 0.05)

# Add age group labels to the results
birthweight_comparison[, Age_Group := age_groups]

head(birthweight_comparison)
comparison diff_means ci_lower ci_upper p.value significant t.statistic df df_method Age_Group
Group 1 vs Reference -250 -375.64316 -124.3568 0.0001005 TRUE -3.9043440 1054.3852 estimated 18-24
Group 2 vs Reference 50 -68.09846 168.0985 0.4064107 FALSE 0.8304548 1536.8996 estimated 25-29
Group 3 - Referent 0 NA NA NA NA NA NA estimated 30-34
Group 4 vs Reference -150 -283.49100 -16.5090 0.0276885 TRUE -2.2056439 811.7334 estimated 35-39
Group 5 vs Reference -300 -441.76996 -158.2300 0.0000390 TRUE -4.1602515 394.4939 estimated 40+

Key features

  1. Flexible degrees of freedom calculation: The function offers four methods for calculating degrees of freedom:

    • estimated” (default): Uses the Welch–Satterthwaite equation
    • conservative”: Uses the minimum possible degrees of freedom (df = 2); lower Type I Error (false +) and higher Type II Error (false -)
    • moderate”: Uses the number of groups minus 1 (df = k - 1)
    • liberal”: Assumes infinite degrees of freedom (df = Inf); higher Type 1 Error (false +) and lower Type II Error (false -)
  2. Sample size estimation: If sample sizes are not provided, the function can estimate them based on the distribution of mean values.

  3. Adjustment for multiple comparisons (optional): If desired, it will adjust p-values for multiple comparisons using one of two methods (Holm-Bonferroni or Benjamini-Hochberg).

  4. Comprehensive output: The function returns a data.table containing comparison results, including difference in means, confidence intervals, p-values, and significance indicators.

round2()

round2() rounds numbers the way we learned in elementary school: if the digit to the right of the rounding position is 5 or greater, round up; otherwise, round down. Note! Base R’s round function follows a different logic (IEC 60559 / IEEE 754 standards), which specifies rounding to the nearest even number.

set.seed(98104)
dt <- data.table(orig = seq(0.5, 5.5, 1))
dt[, round2 := round2(orig)]
dt[, baseR := round(orig)]
dt[round2 != baseR, different := '*']
head(dt)
orig round2 baseR different
0.5 1 0 *
1.5 2 2 NA
2.5 3 2 *
3.5 4 4 NA
4.5 5 4 *
5.5 6 6 NA

Like the base R round function, round2 can also be used to round large numbers to convey uncertainty in estimates

# round to hundreds
round2(123456, -2)
[1] 123500
# round to thousands
round2(123456, -3)
[1] 123000

string_clean()

string_clean() cleans string and factor columns in a data.frame or data.table. It eliminates random white spaces and pseudo-white spaces and sets blanks to true NA.

# create table 
dt <- data.table(
  name = c(" John   ", "Jane  Doe", "Bob Smith", "  ", "  Edward  Jenner"),
  category = c("A ", "   B", "", "D", "E      ")
)
print(dt$category)
[1] "A "      "   B"    ""        "D"       "E      "
head(dt)
name category
John A
Jane Doe B
Bob Smith
D
Edward Jenner E
# clean table
string_clean(dt)
print(dt$category)
[1] "A" "B" NA  "D" "E"
head(dt)
name category
John A
Jane Doe B
Bob Smith NA
NA D
Edward Jenner E

std_error()

std_error() calculates the standard error of the mean for a vector, columns in a table, or a list. R does not have a built in standard error of the mean function, so we created one for rads.

Use with a vector

set.seed(98104)
x <- rnorm(1000)
se <- std_error(x)
print(paste("Standard Error:", round2(se, 5)))
[1] "Standard Error: 0.03207"

Use with a data.table / data.frame

set.seed(98104)
dt <- data.table(x = rnorm(1000), y = rnorm(1000))
se <- std_error(dt)
se <- round2(se, 5)
print(se)
      x       y 
0.03207 0.03235 

Use with a list

set.seed(98104)
mylist <- list(x = rnorm(1000), y = rnorm(1000))
se <- std_error(mylist) # returns a list with the standard errors
se <- lapply(se, round2, n = 5)
print(se)
$x
[1] 0.03207

$y
[1] 0.03235

substrRight()

substrRight() extracts a substring from the right side of a string. It complements the base R function substr.

text <- "Hello, World!"
substr_right <- substrRight(text, 2, 6)
print(substr_right)
[1] "World"

This function is useful when you need to extract the right hand portions of strings, e.g., the last 4 digits of a phone number.

tsql_validate_field_types()

tsql_validate_field_types() validates whether a named vector of TSQL data types is compatible with a data.table. This function is useful when you have a yaml file specifying the TSQL field types to be used when uploading your data to a server.

A successful validation

dt <- data.table(
  id = 1:5,
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  age = c(25, 30, 35, 40, 45), 
  time = Sys.time()
)

field_types <- c(id = "int", 
                 name = "varchar(50)", 
                 age = "int", 
                 time = 'datetime')
validation_result <- tsql_validate_field_types(dt, field_types)
🙂 Success! Your desired TSQL data types are suitable for your dataset.
head(validation_result)
colname R_type tsql_type is_valid issue
age numeric int TRUE NA
id integer int TRUE NA
name character varchar TRUE NA
time POSIXct datetime TRUE NA

A failed validation

dt <- data.table(
  id = 1:5,
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  age = as.character(c(25, 30, 35, 40, 45)), 
  time = Sys.Date()
)

field_types <- c(id = "int", 
                 name = "varchar(50)", 
                 age = "int", # should generate an error
                 time = 'datetime') # should generate an error
validation_result <- tsql_validate_field_types(dt, field_types)
Error in `tsql_validate_field_types()`:
! 
🛑👿 The following columns in your dataset did not align with the proposed TSQL field types:
     column: age, R Type: character, TSQL Type: int, issue: Incompatible types
     column: time, R Type: Date, TSQL Type: datetime, issue: Incompatible types

tsql_chunk_loader()

tsql_chunk_loader() loads large datasets to Microsoft SQL Server in ‘chunks’. It is suitable for datasets too large for single upload. For extreme sizes try using bcp.

Benefits include:

  • Loads large datasets in chunks, preventing memory and batch size issues
  • Reduces risk of network timeouts during data transfer
  • Attempts multiple uploads per chunk for error resilience
  • Provides progress tracking
  • Allows SQL data type specification and validation
  • Verifies complete data upload
 mydt = data.table(col1 = 1:10000L,  # create integer
                   col2 = 1:10000/3) # create float
 mydt[, col3 := as.Date(Sys.Date()) - col1] # create date
 mydt[, col4 := as.character(col3)] # create string
 
 myfieldtypes <- c(col1 = 'int', col2 = 'float', 
                   col3 = 'date', col4 = 'nvarchar(20)')

 tsql_chunk_loader(
   ph.data = mydt,
   db_conn = rads::validate_hhsaw_key(), # connect to Azure 16 hhs_analytics_workspace
   chunk_size = 3333,
   schema_name = Sys.getenv("USERNAME"),
   table_name = 'JustTesting',
   overwrite = TRUE,
   append = FALSE,
   field_types = myfieldtypes,
   validate_field_types = TRUE, # uses tsql_validate_field_types()
   validate_upload = TRUE
 )
Warning: `validate_hhsaw_key()` was deprecated in rads 1.5.2.
ℹ Please use `apde.data::authenticate_hhsaw()` instead.
ℹ 

🙂 Success! Your desired TSQL data types are suitable for your dataset.

2026-02-11 12:09:57.361791: Loading chunk 1 of 4: rows 1-3,333

2026-02-11 12:09:57.830526: Loading chunk 2 of 4: rows 3,334-6,666

2026-02-11 12:09:58.226583: Loading chunk 3 of 4: rows 6,667-9,999

2026-02-11 12:09:58.63213: Loading chunk 4 of 4: rows 10,000-10,000

🙂 🎉 🎊 🌈 
 Congratulations! All the rows in {ph.data} were successfully uploaded.

validate_hhsaw_key()

validate_hhsaw_key() validates HHSAW keys and opens a connection to HHSAW when possible.

 myConnection <- validate_hhsaw_key(hhsaw_key = 'hhsaw')
 print(myConnection)
<OdbcConnection> rbuie@kingcounty.gov@kcitazrhpasqlprp16
  Database: hhs_analytics_workspace
  Microsoft SQL Server Version: 12.00.1017

Conclusion

These utility functions provide tools for common data manipulation and analysis tasks. By leveraging these functions, you can streamline your workflow, improve code readability, and handle various data-related challenges more efficiently.

Remember to consult the individual function documentation for more detailed information on usage and parameters. Happy coding!

Updated February 11, 2026 (rads v1.5.3)

Clone this wiki locally