utilities

Utility Functions

Introduction

The rads package includes a suite of small utility functions designed to simplify common data manipulation tasks and accelerate analysis workflows.

The core utility functions we’ll cover in this vignette are:

calc_age(): Calculate age based on two dates
convert_to_date(): Convert various formats to Date objects
create_dictionary(): Generate a comprehensive data dictionary for a data.table or data.frame
data_modeler(): Create uncorrelated data and code that recreates data for sharing publicly
format_time(): Format numeric values into human-readable time chunks
format_time_simple(): Format numeric values into a single time chunk
generate_yaml(): Generate a YAML file for loading R data to SQL
multi_t_test(): Perform t-tests for Multiple Comparisons with Aggregate Estimates
round2(): Improved rounding function
string_clean(): Clean string and factor columns
std_error(): Calculate standard error of the mean
substrRight(): Extract substring from the right
tsql_validate_field_types(): Validate R data against field types for SQL Server
tsql_chunk_loader(): Load large datasets to SQL Server in chunks
validate_hhsaw_key(): Validates HHSAW keys and, if possible, opens a connection to HHSAW

Let’s explore each of these functions with examples to demonstrate their usage and benefits.

Setting up the environment

First, let’s load the necessary packages:

library(rads)
library(data.table)

Functions

calc_age()

The calc_age() function calculates age in years based on two dates. This function has identified numerous mistakes in ages provided in DOH vital statistics.

`calc_age` is better than subtracting the years

birth_date <- as.Date("1990-08-02")
current_date <- as.Date("2024-08-01")
simple_age <- year(current_date) - year(birth_date)
proper_age <- calc_age(from = birth_date, to = current_date)

head(data.table(birth_date, current_date, simple_age, proper_age))

birth_date	current_date	simple_age	proper_age
1990-08-02	2024-08-01	34	33

`calc_age` is better than days / 365.25

dates <- data.table(
  birth = seq.Date(as.Date('2000-02-25'), as.Date('2000-03-02'), 1), 
  death = seq.Date(as.Date('2024-03-02'), as.Date('2024-02-25'), -1))
dates[, simple_age := round(as.integer(death - birth)/365.25)]
dates[, proper_age := calc_age(birth, death)]

head(dates)

birth	death	simple_age	proper_age
2000-02-25	2024-03-02	24	24
2000-02-26	2024-03-01	24	24
2000-02-27	2024-02-29	24	24
2000-02-28	2024-02-28	24	24
2000-02-29	2024-02-27	24	23
2000-03-01	2024-02-26	24	23

convert_to_date()

convert_to_date() attempts to convert various US style date formats to R Dates.

dates <- data.table(orig = c("2024-07-25", "07/26/2024", 
                             "07-27-2024", "20240728", 
                             "29 July, 2024", "July 30, 2024"))
dates[, rads_dates := convert_to_date(orig)]
dates[, simple_dates := as.Date(orig)]

head(dates)

orig	rads_dates	simple_dates
2024-07-25	2024-07-25	2024-07-25
07/26/2024	2024-07-26	NA
07-27-2024	2024-07-27	NA
20240728	57317-03-24	NA
29 July, 2024	2024-07-29	NA
July 30, 2024	2024-07-30	NA

This function is particularly useful when dealing with inconsistent date formats within a single vector or column of your data.

create_dictionary()

The create_dictionary() function generates a comprehensive data dictionary for a given data.table or data.frame. It provides details about variable types, sample values, and can incorporate descriptions and notes from a reference table. The function can be used to both understand a new dataset and document the products of your ETL processes.

Basic usage

Let’s create a sample dataset and generate a dictionary:

dt <- data.table(
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  gender = factor(c("F", "M", "M", "M", "F")),
  age = c(25, 30, 35, 40, 45),
  score = c(85.5, 92.0, 78.5, 88.0, 95.5)
)

dict <- create_dictionary(ph.data = dt, source = "sample_data")

head(dict, 10)

source	varname	vartype	values	factor_labels	desc	notes	dict_updated
sample_data	name	character	Alicia	NA	NA	NA	2026-02-11
sample_data	name	character	Bianca	NA	NA	NA	2026-02-11
sample_data	name	character	Clarence	NA	NA	NA	2026-02-11
sample_data	name	character	Dwane	NA	NA	NA	2026-02-11
sample_data	name	character	Eve	NA	NA	NA	2026-02-11
sample_data	gender	factor	1	F	NA	NA	2026-02-11
sample_data	gender	factor	2	M	NA	NA	2026-02-11
sample_data	age	integer	25	NA	NA	NA	2026-02-11
sample_data	age	integer	30	NA	NA	NA	2026-02-11
sample_data	age	integer	35	NA	NA	NA	2026-02-11

Suppressing sensitive information

You can suppress sample values for sensitive columns:

dict_suppressed <- create_dictionary(ph.data = dt, 
                                     source = "sample_data", 
                                     suppress = c("name"))

head(dict_suppressed, 10)

source	varname	vartype	values	factor_labels	desc	notes	dict_updated
sample_data	name	character	suppressed	NA	NA	NA	2026-02-11
sample_data	gender	factor	1	F	NA	NA	2026-02-11
sample_data	gender	factor	2	M	NA	NA	2026-02-11
sample_data	age	integer	25	NA	NA	NA	2026-02-11
sample_data	age	integer	30	NA	NA	NA	2026-02-11
sample_data	age	integer	35	NA	NA	NA	2026-02-11
sample_data	age	integer	40	NA	NA	NA	2026-02-11
sample_data	age	integer	45	NA	NA	NA	2026-02-11
sample_data	score	numeric	78.5	NA	NA	NA	2026-02-11
sample_data	score	numeric	85.5	NA	NA	NA	2026-02-11

Incorporating reference information

You can include additional descriptions and notes using a reference table:

ref_table <- data.table(
  source = rep("sample_data", 4),
  varname = c("name", "age", "gender", "score"),
  desc = c("Full name", "Age in years", "Gender", "Test score"),
  notes = c("May contain special characters", 
            "Rounded to nearest year", "Binary classification", 
            "Range: 0-100")
)

dict_with_ref <- create_dictionary(ph.data = dt, 
                                   source = "sample_data", 
                                   ph.ref = ref_table)

head(dict_with_ref, 8)

source	varname	vartype	values	factor_labels	desc	notes	dict_updated
sample_data	name	character	Alicia	NA	Full name	May contain special characters	2026-02-11
sample_data	name	character	Bianca	NA	Full name	May contain special characters	2026-02-11
sample_data	name	character	Clarence	NA	Full name	May contain special characters	2026-02-11
sample_data	name	character	Dwane	NA	Full name	May contain special characters	2026-02-11
sample_data	name	character	Eve	NA	Full name	May contain special characters	2026-02-11
sample_data	gender	factor	1	F	Gender	Binary classification	2026-02-11
sample_data	gender	factor	2	M	Gender	Binary classification	2026-02-11
sample_data	age	integer	25	NA	Age in years	Rounded to nearest year	2026-02-11

Handling numerous unique values

For datasets with many unique values, create_dictionary() provides summaries:

set.seed(98104)
large_dt <- data.table(
  id = 1:1000,
  category = sample(letters[1:10], 1000, replace = TRUE),
  value = round2(rnorm(1000), 2)
)

large_dict <- create_dictionary(ph.data = large_dt, 
                                source = "large_sample")

head(large_dict, 8)

source	varname	vartype	values	factor_labels	desc	notes	dict_updated
large_sample	id	integer	min = 1, max = 1000	NA	NA	NA	2026-02-11
large_sample	category	character	a	NA	NA	NA	2026-02-11
large_sample	category	character	b	NA	NA	NA	2026-02-11
large_sample	category	character	c	NA	NA	NA	2026-02-11
large_sample	category	character	d	NA	NA	NA	2026-02-11
large_sample	category	character	e	NA	NA	NA	2026-02-11
large_sample	category	character	…	NA	NA	NA	2026-02-11
large_sample	value	numeric	min = -3.33, max = 3.73	NA	NA	NA	2026-02-11

data_modeler()

data_modeler() receives a data.table object and returns a data.table of similar structure and content, but with uncorrelated data across columns. For example, a column of dates will become a column of dates of a similar range, and distribution, but randomly distributed across observations. Common use cases include:

demonstrating a statistic or visualization without exposing the audience to “real” data or results
sharing code that generates a data set for minimally viable examples used in requesting help from, or teaching, others
creating data sets programmatically for (unit) testing or training

There are two kinds of output that data_modeler() returns, either the data table itself, or a list object of strings that can be evaluated to create the data table. In addition, data_modeler() can output-to-terminal a human readable version of the generation code that the user can copy and paste into their own scripts for sharing publicly.

data_modeler() makes some guesses as to if a column’s content should be randomized (such as person names or blood pressures) or kept but randomly distributed (such as levels of response to a question or city names). This is intended to generate random data or to retain original categories where appropriate. It bases these guesses on the data type (integer or string), and complexity (e.g. 75% being unique data), but not the meaning of the data (i.e. it doesn’t know that a number is a “zip code” and doesn’t know that a string is a “human name”). Variables that are identified as non-categorical will, in most cases, be replaced with a uniform distribution of noisy, but similarly shaped, data (e.g. floating point numbers of the same range and precision, but in a random uniform distribution, or a “lore ipsum” string of similar lengths). Variables identified as categorical will, in most cases, be replaced with identical categories in a similar distribution to the original, but randomly assigned across observations (a data set of 95% males and 5% females would still generate a data set of roughly 95% males and 5% females).

Two parameters are available to adjust the assumptions data_modeler uses to identify categorical values. These are “categorical_threshold” and “proportional_threshold”.

“categorical_threshold”: defines how many unique categories can exist in the variable before data_modeler assumes the variable isn’t categorical. Above this number, and the value will not be considered categorical. This currently defaults to 61.
“proportional_threshold”: defines the proportion of values that can be unique before data_modeler() assumes the variable isn’t categorical. This currently defaults to 0.75.

data privacy caveats

uncorrelated may not always equal non-informative

data_modeler() creates an uncorrelated column of data to substitute the original by randomly reassigning values across that column. However, this does not ensure that meaningful information is not passed along. Note that, in most cases, categorical observations are distributed in proportion to the source data. So, if the provided data set is of categorical outcomes in a single population, the uncorrelated result would still show a similar rate of that outcome in that population because all observations would be for the same population group. The analyst should consider if this is acceptable or not and take measures to correct if necessary by either changing the input data or modifying the synthetic data. A simple approach in the above example would be to add one or more additional population groups so that the outcomes would be distributed removing the ability to estimate the rate for the smaller population (in an uncorrelated result, the rate of the categorical outcome would tend to be similar in all populations.)

walkthrough

For the examples below, we create a “real” data set and show how to use data_modeler() and how the synthetic and “real” data differ.

Our demo “real” data has one indicator variable that represents an outcome that is never in the White population, sometimes in the Asian population, and always in other Races. In an uncorrelated (and so non-informative) synthetic data set, we would expect this indicator to no longer be biased by race.

DT.real.data <- data.table(
   id = 1:1000,
   chi_geo_kc = sample(c('King County',NA), 1000, replace = T),
   chi_race_7 = factor(sample(c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple", NA), 1000, replace = T, prob = c(.19,.01,.07,.11,.01,.35,.07,.14,.02)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple", NA)),
   chi_sex = as.factor(sample(c("Male","Female"), 1000, replace = T)),
   chi_geo_region = factor(sample(c("South", "North", "Seattle", "East"), 1000, replace = T), levels = c("South","North","Seattle","East")))

# Attach a biased indicator.
#indicator1, all White population are "never", all Asians population, "sometimes", and everyone else, "always"
   DT.real.data[, indicator1 := ifelse(chi_race_7 == "White", "never",
                         ifelse(chi_race_7 == "Asian", "sometimes",
                                "always"))]

##Observe that modeled data have the same structure as the original
DT.synthetic.data <- data_modeler(ph.data = DT.real.data, #our "real" data
                                number_of_observations =  100, #number of observations to create in the synthetic data
                                comments = TRUE, #whether or not to include comments in code (only matters for return_code == TRUE or print_code == TRUE)
                                return_code = FALSE, #return a list object of strings for creating the synthetic data instead of the data itself. leave FALSE to receive synthetic data
                              print_code = FALSE) #print the generation code to terminal. useful for both review and if you wish to copy into other contexts, such as a shareable script or in a chat with a LLM/AI

You can observe that the data structures are the same for both data frames:

```{r}
str(DT.real.data)
str(DT.synthetic.data)
```

Classes 'data.table' and 'data.frame':  1000 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  NA "King County" NA NA ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 8 NA 6 3 8 7 1 6 6 1 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 2 2 2 2 2 2 1 2 2 4 ...
 $ indicator1    : chr  "always" NA "never" "always" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'data.table' and 'data.frame':  100 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  "King County" "King County" "King County" "King County" ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 1 5 8 5 1 2 5 1 6 8 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 1 1 1 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 1 2 1 1 2 1 3 3 4 1 ...
 $ indicator1    : chr  "never" "always" NA "always" ...
 - attr(*, ".internal.selfref")=<externalptr>

But we can see that the original bias of White’s in the never group, Asians in the sometimes group, and all others in the always group now no longer holds.

prop.table(table(DT.real.data[,.(indicator1,chi_race_7)]))

           chi_race_7
indicator1        Asian        AIAN       Black    Hispanic        NHPI
  always    0.000000000 0.013251784 0.082568807 0.114169215 0.007135576
  never     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
  sometimes 0.191641182 0.000000000 0.000000000 0.000000000 0.000000000
           chi_race_7
indicator1        White       Other    Multiple
  always    0.000000000 0.071355759 0.161060143
  never     0.358817533 0.000000000 0.000000000
  sometimes 0.000000000 0.000000000 0.000000000

prop.table(table(DT.synthetic.data[,.(indicator1,chi_race_7)]))

          chi_race_7
indicator1      Asian       AIAN      Black   Hispanic       NHPI      White
    always 0.31666667 0.03333333 0.03333333 0.06666667 0.11666667 0.06666667
    never  0.11666667 0.01666667 0.03333333 0.01666667 0.01666667 0.03333333
          chi_race_7
indicator1      Other   Multiple
    always 0.00000000 0.13333333
    never  0.00000000 0.00000000

Notice that you may see extreme, apparently biased, results in the outcomes. This is still randomly distributed and you can observe these change if you recreate the synthetic.

You can review, share, and execute the code used to create the synthetic data in two ways. Manually by printing it to terminal and copying it into your R script, or by receiving and processing the text object. The first is more useful for review and sharing with others, the later is more useful for programmatic handling of the instruction set.

DT.synthetic.data <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                              print_code = TRUE) # change this to true to copy code from your terminal output

DT <- data.table( 
`id` = as.integer(1:100), # as an identifying (all unique) integer 
`chi_geo_kc` = sample(c(NA, "King County"), 100, replace = TRUE, prob = c(0.507, 0.493)), # as categorical character (non factor) 
`chi_race_7` = factor(sample(c("Multiple", NA, "White", "Black", "Other", "Asian", "Hispanic", "NHPI", "AIAN"), 100, replace = TRUE, prob = c(0.188, 0.013, 0.081, 0.112, 0.007, 0.352, 0.07, 0.158, 0.019)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple"), ordered = FALSE), # as factor 
`chi_sex` = factor(sample(c("Male", "Female"), 100, replace = TRUE, prob = c(0.493, 0.507)), levels = c("Female", "Male"), ordered = FALSE), # as factor 
`chi_geo_region` = factor(sample(c("North", "South", "East", "Seattle"), 100, replace = TRUE, prob = c(0.233, 0.253, 0.256, 0.258)), levels = c("South", "North", "Seattle", "East"), ordered = FALSE), # as factor 
`indicator1` = sample(c("always", NA, "never", "sometimes"), 100, replace = TRUE, prob = c(0.441, 0.352, 0.188, 0.019))) # as categorical character (non factor)

The code below is what is used to generate the synthetic data. Using this approach you can see the data, sample approach and rates for each variable. Because we ran data_modeler() with comments = TRUE, we also see a human-readable string at the end of each line describing how each variable was classified by data_modeler. This allows you to review the logic, make changes, and share the code with others.

DT <- data.table( 
`id` = as.integer(1:100), # as an identifying (all unique) integer 
`chi_geo_kc` = sample(c(NA, "King County"), 100, replace = TRUE, prob = c(0.494, 0.506)), # as categorical character (non factor) 
`chi_race_7` = factor(sample(c("Black", "White", "AIAN", NA, "Asian", "Multiple", "Other", "NHPI", "Hispanic"), 100, replace = TRUE, prob = c(0.193, 0.014, 0.067, 0.11, 0.012, 0.36, 0.064, 0.157, 0.023)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple"), ordered = FALSE), # as factor 
`chi_sex` = factor(sample(c("Male", "Female"), 100, replace = TRUE, prob = c(0.518, 0.482)), levels = c("Female", "Male"), ordered = FALSE), # as factor 
`chi_geo_region` = factor(sample(c("Seattle", "South", "North", "East"), 100, replace = TRUE, prob = c(0.241, 0.229, 0.258, 0.272)), levels = c("South", "North", "Seattle", "East"), ordered = FALSE), # as factor 
`indicator1` = sample(c("always", "never", NA, "sometimes"), 100, replace = TRUE, prob = c(0.424, 0.36, 0.193, 0.023))) # as categorical character (non factor)

Note, the above approach requires the user to copy and paste output from their terminal. If you wish to check, modify, and extract the code without human intervention, you need to receive the text as a list object using return_code = TRUE:

DT.synthetic.data.instructions <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = FALSE, 
                                return_code = TRUE, #setting this to "TRUE" will cause data_modeler() to return a list object of the text instead of a DT of synthetic data
                                print_code = FALSE)

head(DT.synthetic.data.instructions)

[[1]]
[1] "DT <- data.table("

[[2]]
[1] "`id` = as.integer(1:100),"

[[3]]
[1] "`chi_geo_kc` = sample(c(NA, \"King County\"), 100, replace = TRUE, prob = c(0.507, 0.493)),"

[[4]]
[1] "`chi_race_7` = factor(sample(c(\"Multiple\", NA, \"White\", \"Black\", \"Other\", \"Asian\", \"Hispanic\", \"NHPI\", \"AIAN\"), 100, replace = TRUE, prob = c(0.188, 0.013, 0.081, 0.112, 0.007, 0.352, 0.07, 0.158, 0.019)), levels = c(\"Asian\", \"AIAN\", \"Black\", \"Hispanic\", \"NHPI\", \"White\", \"Other\", \"Multiple\"), ordered = FALSE),"

[[5]]
[1] "`chi_sex` = factor(sample(c(\"Male\", \"Female\"), 100, replace = TRUE, prob = c(0.493, 0.507)), levels = c(\"Female\", \"Male\"), ordered = FALSE),"

[[6]]
[1] "`chi_geo_region` = factor(sample(c(\"North\", \"South\", \"East\", \"Seattle\"), 100, replace = TRUE, prob = c(0.233, 0.253, 0.256, 0.258)), levels = c(\"South\", \"North\", \"Seattle\", \"East\"), ordered = FALSE),"

After processing the text as desired, you can execute it, creating the desired synthetic data object, as follows:

## Manually convert received code into data
codeText <- paste(unlist(DT.synthetic.data.instructions), collapse =" \n" )
DT.synthetic.data.manual <- eval( parse(text = paste0(codeText)))

Now you can visually confirm that the class structure of each of the tables is the same:

str(DT.real.data)

Classes 'data.table' and 'data.frame':  1000 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  NA "King County" NA NA ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 8 NA 6 3 8 7 1 6 6 1 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 2 2 2 2 2 2 1 2 2 4 ...
 $ indicator1    : chr  "always" NA "never" "always" ...
 - attr(*, ".internal.selfref")=<externalptr>

str(DT.synthetic.data)

Classes 'data.table' and 'data.frame':  100 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  "King County" NA NA "King County" ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 8 5 1 1 6 5 1 1 1 3 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 2 2 1 2 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 4 1 4 2 3 1 4 1 3 3 ...
 $ indicator1    : chr  NA NA "always" "always" ...
 - attr(*, ".internal.selfref")=<externalptr>

str(DT.synthetic.data.manual)

Classes 'data.table' and 'data.frame':  100 obs. of  6 variables:
 $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ chi_geo_kc    : chr  NA "King County" "King County" "King County" ...
 $ chi_race_7    : Factor w/ 8 levels "Asian","AIAN",..: 3 1 1 5 8 1 1 1 8 8 ...
 $ chi_sex       : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 2 2 2 2 ...
 $ chi_geo_region: Factor w/ 4 levels "South","North",..: 3 1 4 1 1 3 4 2 2 3 ...
 $ indicator1    : chr  "always" "never" NA "always" ...
 - attr(*, ".internal.selfref")=<externalptr>

sapply(DT.real.data, class)

            id     chi_geo_kc     chi_race_7        chi_sex chi_geo_region 
     "integer"    "character"       "factor"       "factor"       "factor" 
    indicator1 
   "character"

sapply(DT.synthetic.data, class)

            id     chi_geo_kc     chi_race_7        chi_sex chi_geo_region 
     "integer"    "character"       "factor"       "factor"       "factor" 
    indicator1 
   "character"

sapply(DT.synthetic.data.manual, class)

            id     chi_geo_kc     chi_race_7        chi_sex chi_geo_region 
     "integer"    "character"       "factor"       "factor"       "factor" 
    indicator1 
   "character"

Finally, each time data_modeler is executed, there is a randomness to how the data are created. If you wish to create identical data sets, be sure to set seed before executing:

DT.synthetic.data.random <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                                print_code = FALSE) 

set.seed(1000)
DT.synthetic.data.s1000a <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                                print_code = FALSE) 

set.seed(1000)
DT.synthetic.data.s1000b <- data_modeler(ph.data = DT.real.data,
                                number_of_observations =  100, 
                                comments = TRUE, 
                                return_code = FALSE,
                                print_code = FALSE)

Comparing a table with a random seed to one with a fixed seed, we expect the resulting tables to differ:

identical(DT.synthetic.data.random, DT.synthetic.data.s1000a)

[1] FALSE

When using the same seed for distinct rounds, we would expect the resulting tables to be identical:

identical(DT.synthetic.data.s1000a, DT.synthetic.data.s1000b)

[1] TRUE

format_time() and format_time_simple()

These functions format numeric values (including dates) into human-readable time chunks. They are often used for summarizing years and can be used for identifying unexpected gaps.

years <- c(2001:2003, 
           2005, 
           2008, 
           2011:2012)

dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03",
                   "2024-01-06", 
                   "2024-01-09", "2024-01-10", "2024-01-11", "2024-01-12"))

dt <- data.table(desc = c('format_time', 'format_time_simple', '',
                          'format_time', 'format_time_simple'), 
                 data = c(format_time(years), format_time_simple(years), '', 
                          format_time(dates), format_time_simple(dates)))
head(dt)

desc	data
format_time	2001-2003, 2005, 2008, 2011-2012
format_time_simple	2001-2012

format_time	2024-01-01 to 2024-01-03, 2024-01-06, 2024-01-09 to 2024-01-12
format_time_simple	2024-01-01 to 2024-01-12

generate_yaml()

generate_yaml() generates a YAML file for SQL loading based on a data.frame or data.table.

data <- data.table(
  id = 1:5,
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  age = c(25.1, 30.2, 35.3, 40.4, 45.5)
)

yaml_output <- generate_yaml(data, schema = "dbo", table = "example_table", datasource = 'WADOH_CHS')
print(yaml_output)

$datasource
[1] "WADOH_CHS"

$schema
[1] "dbo"

$table
[1] "example_table"

$vars
$vars$id
[1] "INT"

$vars$name
[1] "NVARCHAR(18)"

$vars$age
[1] "NUMERIC(38,5)"

The output can be be used to specify the field_types for the tsql_validate_field_types() and tsql_chunk_loader() functions described below.

field_types <- unlist(yaml_output$vars)
print(field_types)

             id            name             age 
          "INT"  "NVARCHAR(18)" "NUMERIC(38,5)"

multi_t_test()

The multi_t_test() function performs t-tests comparing multiple groups against a reference group using means, standard errors, and the sample size (when available).

Basic usage

Let’s create a sample dataset and use multi_t_test() to compare birthweights across different maternal age groups:

# Sample data
age_groups <- c("18-24", "25-29", "30-34", "35-39", "40+")
birthweight_means <- c(3150, 3450, 3400, 3250, 3100)  # in grams
birthweight_ses <- c(50, 45, 40, 55, 60)
sample_sizes <- c(500, 800, 750, 400, 200)
reference_group <- 3  # comparing all groups to the 30-34 age group

# Perform multi_t_test
birthweight_comparison <- multi_t_test(means = birthweight_means,
                                       ses = birthweight_ses,
                                       reference_index = reference_group,
                                       n = sample_sizes,
                                       alternative = 'two.sided',
                                       df_method = "estimated", 
                                       alpha = 0.05)

# Add age group labels to the results
birthweight_comparison[, Age_Group := age_groups]

head(birthweight_comparison)

comparison	diff_means	ci_lower	ci_upper	p.value	significant	t.statistic	df	df_method	Age_Group
Group 1 vs Reference	-250	-375.64316	-124.3568	0.0001005	TRUE	-3.9043440	1054.3852	estimated	18-24
Group 2 vs Reference	50	-68.09846	168.0985	0.4064107	FALSE	0.8304548	1536.8996	estimated	25-29
Group 3 - Referent	0	NA	NA	NA	NA	NA	NA	estimated	30-34
Group 4 vs Reference	-150	-283.49100	-16.5090	0.0276885	TRUE	-2.2056439	811.7334	estimated	35-39
Group 5 vs Reference	-300	-441.76996	-158.2300	0.0000390	TRUE	-4.1602515	394.4939	estimated	40+

Key features

Flexible degrees of freedom calculation: The function offers four methods for calculating degrees of freedom:
- “estimated” (default): Uses the Welch–Satterthwaite equation
- “conservative”: Uses the minimum possible degrees of freedom (df = 2); lower Type I Error (false +) and higher Type II Error (false -)
- “moderate”: Uses the number of groups minus 1 (df = k - 1)
- “liberal”: Assumes infinite degrees of freedom (df = Inf); higher Type 1 Error (false +) and lower Type II Error (false -)
Sample size estimation: If sample sizes are not provided, the function can estimate them based on the distribution of mean values.
Adjustment for multiple comparisons (optional): If desired, it will adjust p-values for multiple comparisons using one of two methods (Holm-Bonferroni or Benjamini-Hochberg).
Comprehensive output: The function returns a data.table containing comparison results, including difference in means, confidence intervals, p-values, and significance indicators.

round2()

round2() rounds numbers the way we learned in elementary school: if the digit to the right of the rounding position is 5 or greater, round up; otherwise, round down. Note! Base R’s round function follows a different logic (IEC 60559 / IEEE 754 standards), which specifies rounding to the nearest even number.

set.seed(98104)
dt <- data.table(orig = seq(0.5, 5.5, 1))
dt[, round2 := round2(orig)]
dt[, baseR := round(orig)]
dt[round2 != baseR, different := '*']
head(dt)

orig	round2	baseR	different
0.5	1	0	*
1.5	2	2	NA
2.5	3	2	*
3.5	4	4	NA
4.5	5	4	*
5.5	6	6	NA

Like the base R round function, round2 can also be used to round large numbers to convey uncertainty in estimates

# round to hundreds
round2(123456, -2)

[1] 123500

# round to thousands
round2(123456, -3)

[1] 123000

string_clean()

string_clean() cleans string and factor columns in a data.frame or data.table. It eliminates random white spaces and pseudo-white spaces and sets blanks to true NA.

# create table 
dt <- data.table(
  name = c(" John   ", "Jane  Doe", "Bob Smith", "  ", "  Edward  Jenner"),
  category = c("A ", "   B", "", "D", "E      ")
)
print(dt$category)

[1] "A "      "   B"    ""        "D"       "E      "

head(dt)

name	category
John	A
Jane Doe	B
Bob Smith
	D
Edward Jenner	E

# clean table
string_clean(dt)
print(dt$category)

[1] "A" "B" NA  "D" "E"

head(dt)

name	category
John	A
Jane Doe	B
Bob Smith	NA
NA	D
Edward Jenner	E

std_error()

std_error() calculates the standard error of the mean for a vector, columns in a table, or a list. R does not have a built in standard error of the mean function, so we created one for rads.

Use with a vector

set.seed(98104)
x <- rnorm(1000)
se <- std_error(x)
print(paste("Standard Error:", round2(se, 5)))

[1] "Standard Error: 0.03207"

Use with a data.table / data.frame

set.seed(98104)
dt <- data.table(x = rnorm(1000), y = rnorm(1000))
se <- std_error(dt)
se <- round2(se, 5)
print(se)

      x       y 
0.03207 0.03235

Use with a list

set.seed(98104)
mylist <- list(x = rnorm(1000), y = rnorm(1000))
se <- std_error(mylist) # returns a list with the standard errors
se <- lapply(se, round2, n = 5)
print(se)

$x
[1] 0.03207

$y
[1] 0.03235

substrRight()

substrRight() extracts a substring from the right side of a string. It complements the base R function substr.

text <- "Hello, World!"
substr_right <- substrRight(text, 2, 6)
print(substr_right)

[1] "World"

This function is useful when you need to extract the right hand portions of strings, e.g., the last 4 digits of a phone number.

tsql_validate_field_types()

tsql_validate_field_types() validates whether a named vector of TSQL data types is compatible with a data.table. This function is useful when you have a yaml file specifying the TSQL field types to be used when uploading your data to a server.

A successful validation

dt <- data.table(
  id = 1:5,
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  age = c(25, 30, 35, 40, 45), 
  time = Sys.time()
)

field_types <- c(id = "int", 
                 name = "varchar(50)", 
                 age = "int", 
                 time = 'datetime')
validation_result <- tsql_validate_field_types(dt, field_types)

🙂 Success! Your desired TSQL data types are suitable for your dataset.

head(validation_result)

colname	R_type	tsql_type	is_valid	issue
age	numeric	int	TRUE	NA
id	integer	int	TRUE	NA
name	character	varchar	TRUE	NA
time	POSIXct	datetime	TRUE	NA

A failed validation

dt <- data.table(
  id = 1:5,
  name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
  age = as.character(c(25, 30, 35, 40, 45)), 
  time = Sys.Date()
)

field_types <- c(id = "int", 
                 name = "varchar(50)", 
                 age = "int", # should generate an error
                 time = 'datetime') # should generate an error
validation_result <- tsql_validate_field_types(dt, field_types)

Error in `tsql_validate_field_types()`:
! 
🛑👿 The following columns in your dataset did not align with the proposed TSQL field types:
     column: age, R Type: character, TSQL Type: int, issue: Incompatible types
     column: time, R Type: Date, TSQL Type: datetime, issue: Incompatible types

tsql_chunk_loader()

tsql_chunk_loader() loads large datasets to Microsoft SQL Server in ‘chunks’. It is suitable for datasets too large for single upload. For extreme sizes try using bcp.

Benefits include:

Loads large datasets in chunks, preventing memory and batch size issues
Reduces risk of network timeouts during data transfer
Attempts multiple uploads per chunk for error resilience
Provides progress tracking
Allows SQL data type specification and validation
Verifies complete data upload

 mydt = data.table(col1 = 1:10000L,  # create integer
                   col2 = 1:10000/3) # create float
 mydt[, col3 := as.Date(Sys.Date()) - col1] # create date
 mydt[, col4 := as.character(col3)] # create string
 
 myfieldtypes <- c(col1 = 'int', col2 = 'float', 
                   col3 = 'date', col4 = 'nvarchar(20)')

 tsql_chunk_loader(
   ph.data = mydt,
   db_conn = rads::validate_hhsaw_key(), # connect to Azure 16 hhs_analytics_workspace
   chunk_size = 3333,
   schema_name = Sys.getenv("USERNAME"),
   table_name = 'JustTesting',
   overwrite = TRUE,
   append = FALSE,
   field_types = myfieldtypes,
   validate_field_types = TRUE, # uses tsql_validate_field_types()
   validate_upload = TRUE
 )

Warning: `validate_hhsaw_key()` was deprecated in rads 1.5.2.
ℹ Please use `apde.data::authenticate_hhsaw()` instead.
ℹ 

🙂 Success! Your desired TSQL data types are suitable for your dataset.

2026-02-11 12:09:57.361791: Loading chunk 1 of 4: rows 1-3,333

2026-02-11 12:09:57.830526: Loading chunk 2 of 4: rows 3,334-6,666

2026-02-11 12:09:58.226583: Loading chunk 3 of 4: rows 6,667-9,999

2026-02-11 12:09:58.63213: Loading chunk 4 of 4: rows 10,000-10,000

🙂 🎉 🎊 🌈 
 Congratulations! All the rows in {ph.data} were successfully uploaded.

validate_hhsaw_key()

validate_hhsaw_key() validates HHSAW keys and opens a connection to HHSAW when possible.

 myConnection <- validate_hhsaw_key(hhsaw_key = 'hhsaw')
 print(myConnection)

<OdbcConnection> rbuie@kingcounty.gov@kcitazrhpasqlprp16
  Database: hhs_analytics_workspace
  Microsoft SQL Server Version: 12.00.1017

Conclusion

These utility functions provide tools for common data manipulation and analysis tasks. By leveraging these functions, you can streamline your workflow, improve code readability, and handle various data-related challenges more efficiently.

Remember to consult the individual function documentation for more detailed information on usage and parameters. Happy coding!

– Updated February 11, 2026 (rads v1.5.3)

utilities

Utility Functions

Introduction

Setting up the environment

Functions

calc_age()

calc_age is better than subtracting the years

calc_age is better than days / 365.25

convert_to_date()

create_dictionary()

Basic usage

Suppressing sensitive information

Incorporating reference information

Handling numerous unique values

data_modeler()

data privacy caveats

uncorrelated may not always equal non-informative

walkthrough

format_time() and format_time_simple()

generate_yaml()

multi_t_test()

Basic usage

Key features

round2()

string_clean()

std_error()

Use with a vector

Use with a data.table / data.frame

Use with a list

substrRight()

tsql_validate_field_types()

A successful validation

A failed validation

tsql_chunk_loader()

validate_hhsaw_key()

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`calc_age` is better than subtracting the years

`calc_age` is better than days / 365.25