-
Notifications
You must be signed in to change notification settings - Fork 1
utilities
The rads package includes a suite of small utility functions designed
to simplify common data manipulation tasks and accelerate analysis
workflows.
The core utility functions we’ll cover in this vignette are:
-
calc_age(): Calculate age based on two dates -
convert_to_date(): Convert various formats to Date objects -
create_dictionary(): Generate a comprehensive data dictionary for a data.table or data.frame -
data_modeler(): Create uncorrelated data and code that recreates data for sharing publicly -
format_time(): Format numeric values into human-readable time chunks -
format_time_simple(): Format numeric values into a single time chunk -
generate_yaml(): Generate a YAML file for loading R data to SQL -
multi_t_test(): Perform t-tests for Multiple Comparisons with Aggregate Estimates -
round2(): Improved rounding function -
string_clean(): Clean string and factor columns -
std_error(): Calculate standard error of the mean -
substrRight(): Extract substring from the right -
tsql_validate_field_types(): Validate R data against field types for SQL Server -
tsql_chunk_loader(): Load large datasets to SQL Server in chunks -
validate_hhsaw_key(): Validates HHSAW keys and, if possible, opens a connection to HHSAW
Let’s explore each of these functions with examples to demonstrate their usage and benefits.
First, let’s load the necessary packages:
library(rads)
library(data.table)The calc_age() function calculates age in years based on two dates.
This function has identified numerous mistakes in ages provided in DOH
vital statistics.
birth_date <- as.Date("1990-08-02")
current_date <- as.Date("2024-08-01")
simple_age <- year(current_date) - year(birth_date)
proper_age <- calc_age(from = birth_date, to = current_date)
head(data.table(birth_date, current_date, simple_age, proper_age))| birth_date | current_date | simple_age | proper_age |
|---|---|---|---|
| 1990-08-02 | 2024-08-01 | 34 | 33 |
dates <- data.table(
birth = seq.Date(as.Date('2000-02-25'), as.Date('2000-03-02'), 1),
death = seq.Date(as.Date('2024-03-02'), as.Date('2024-02-25'), -1))
dates[, simple_age := round(as.integer(death - birth)/365.25)]
dates[, proper_age := calc_age(birth, death)]
head(dates)| birth | death | simple_age | proper_age |
|---|---|---|---|
| 2000-02-25 | 2024-03-02 | 24 | 24 |
| 2000-02-26 | 2024-03-01 | 24 | 24 |
| 2000-02-27 | 2024-02-29 | 24 | 24 |
| 2000-02-28 | 2024-02-28 | 24 | 24 |
| 2000-02-29 | 2024-02-27 | 24 | 23 |
| 2000-03-01 | 2024-02-26 | 24 | 23 |
convert_to_date() attempts to convert various US style date formats to
R Dates.
dates <- data.table(orig = c("2024-07-25", "07/26/2024",
"07-27-2024", "20240728",
"29 July, 2024", "July 30, 2024"))
dates[, rads_dates := convert_to_date(orig)]
dates[, simple_dates := as.Date(orig)]
head(dates)| orig | rads_dates | simple_dates |
|---|---|---|
| 2024-07-25 | 2024-07-25 | 2024-07-25 |
| 07/26/2024 | 2024-07-26 | NA |
| 07-27-2024 | 2024-07-27 | NA |
| 20240728 | 57317-03-24 | NA |
| 29 July, 2024 | 2024-07-29 | NA |
| July 30, 2024 | 2024-07-30 | NA |
This function is particularly useful when dealing with inconsistent date formats within a single vector or column of your data.
The create_dictionary() function generates a comprehensive data
dictionary for a given data.table or data.frame. It provides details
about variable types, sample values, and can incorporate descriptions
and notes from a reference table. The function can be used to both
understand a new dataset and document the products of your ETL
processes.
Let’s create a sample dataset and generate a dictionary:
dt <- data.table(
name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
gender = factor(c("F", "M", "M", "M", "F")),
age = c(25, 30, 35, 40, 45),
score = c(85.5, 92.0, 78.5, 88.0, 95.5)
)
dict <- create_dictionary(ph.data = dt, source = "sample_data")
head(dict, 10)| source | varname | vartype | values | factor_labels | desc | notes | dict_updated |
|---|---|---|---|---|---|---|---|
| sample_data | name | character | Alicia | NA | NA | NA | 2026-02-11 |
| sample_data | name | character | Bianca | NA | NA | NA | 2026-02-11 |
| sample_data | name | character | Clarence | NA | NA | NA | 2026-02-11 |
| sample_data | name | character | Dwane | NA | NA | NA | 2026-02-11 |
| sample_data | name | character | Eve | NA | NA | NA | 2026-02-11 |
| sample_data | gender | factor | 1 | F | NA | NA | 2026-02-11 |
| sample_data | gender | factor | 2 | M | NA | NA | 2026-02-11 |
| sample_data | age | integer | 25 | NA | NA | NA | 2026-02-11 |
| sample_data | age | integer | 30 | NA | NA | NA | 2026-02-11 |
| sample_data | age | integer | 35 | NA | NA | NA | 2026-02-11 |
You can suppress sample values for sensitive columns:
dict_suppressed <- create_dictionary(ph.data = dt,
source = "sample_data",
suppress = c("name"))
head(dict_suppressed, 10)| source | varname | vartype | values | factor_labels | desc | notes | dict_updated |
|---|---|---|---|---|---|---|---|
| sample_data | name | character | suppressed | NA | NA | NA | 2026-02-11 |
| sample_data | gender | factor | 1 | F | NA | NA | 2026-02-11 |
| sample_data | gender | factor | 2 | M | NA | NA | 2026-02-11 |
| sample_data | age | integer | 25 | NA | NA | NA | 2026-02-11 |
| sample_data | age | integer | 30 | NA | NA | NA | 2026-02-11 |
| sample_data | age | integer | 35 | NA | NA | NA | 2026-02-11 |
| sample_data | age | integer | 40 | NA | NA | NA | 2026-02-11 |
| sample_data | age | integer | 45 | NA | NA | NA | 2026-02-11 |
| sample_data | score | numeric | 78.5 | NA | NA | NA | 2026-02-11 |
| sample_data | score | numeric | 85.5 | NA | NA | NA | 2026-02-11 |
You can include additional descriptions and notes using a reference table:
ref_table <- data.table(
source = rep("sample_data", 4),
varname = c("name", "age", "gender", "score"),
desc = c("Full name", "Age in years", "Gender", "Test score"),
notes = c("May contain special characters",
"Rounded to nearest year", "Binary classification",
"Range: 0-100")
)
dict_with_ref <- create_dictionary(ph.data = dt,
source = "sample_data",
ph.ref = ref_table)
head(dict_with_ref, 8)| source | varname | vartype | values | factor_labels | desc | notes | dict_updated |
|---|---|---|---|---|---|---|---|
| sample_data | name | character | Alicia | NA | Full name | May contain special characters | 2026-02-11 |
| sample_data | name | character | Bianca | NA | Full name | May contain special characters | 2026-02-11 |
| sample_data | name | character | Clarence | NA | Full name | May contain special characters | 2026-02-11 |
| sample_data | name | character | Dwane | NA | Full name | May contain special characters | 2026-02-11 |
| sample_data | name | character | Eve | NA | Full name | May contain special characters | 2026-02-11 |
| sample_data | gender | factor | 1 | F | Gender | Binary classification | 2026-02-11 |
| sample_data | gender | factor | 2 | M | Gender | Binary classification | 2026-02-11 |
| sample_data | age | integer | 25 | NA | Age in years | Rounded to nearest year | 2026-02-11 |
For datasets with many unique values, create_dictionary() provides summaries:
set.seed(98104)
large_dt <- data.table(
id = 1:1000,
category = sample(letters[1:10], 1000, replace = TRUE),
value = round2(rnorm(1000), 2)
)
large_dict <- create_dictionary(ph.data = large_dt,
source = "large_sample")
head(large_dict, 8)| source | varname | vartype | values | factor_labels | desc | notes | dict_updated |
|---|---|---|---|---|---|---|---|
| large_sample | id | integer | min = 1, max = 1000 | NA | NA | NA | 2026-02-11 |
| large_sample | category | character | a | NA | NA | NA | 2026-02-11 |
| large_sample | category | character | b | NA | NA | NA | 2026-02-11 |
| large_sample | category | character | c | NA | NA | NA | 2026-02-11 |
| large_sample | category | character | d | NA | NA | NA | 2026-02-11 |
| large_sample | category | character | e | NA | NA | NA | 2026-02-11 |
| large_sample | category | character | … | NA | NA | NA | 2026-02-11 |
| large_sample | value | numeric | min = -3.33, max = 3.73 | NA | NA | NA | 2026-02-11 |
data_modeler() receives a data.table object and returns a data.table
of similar structure and content, but with uncorrelated data across
columns. For example, a column of dates will become a column of dates of
a similar range, and distribution, but randomly distributed across
observations. Common use cases include:
- demonstrating a statistic or visualization without exposing the audience to “real” data or results
- sharing code that generates a data set for minimally viable examples used in requesting help from, or teaching, others
- creating data sets programmatically for (unit) testing or training
There are two kinds of output that data_modeler() returns, either the
data table itself, or a list object of strings that can be evaluated to
create the data table. In addition, data_modeler() can
output-to-terminal a human readable version of the generation code that
the user can copy and paste into their own scripts for sharing publicly.
data_modeler() makes some guesses as to if a column’s content should
be randomized (such as person names or blood pressures) or kept but
randomly distributed (such as levels of response to a question or city
names). This is intended to generate random data or to retain original
categories where appropriate. It bases these guesses on the data type
(integer or string), and complexity (e.g. 75% being unique data), but
not the meaning of the data (i.e. it doesn’t know that a number is a
“zip code” and doesn’t know that a string is a “human name”). Variables
that are identified as non-categorical will, in most cases, be replaced
with a uniform distribution of noisy, but similarly shaped, data
(e.g. floating point numbers of the same range and precision, but in a
random uniform distribution, or a “lore ipsum” string of similar
lengths). Variables identified as categorical will, in most cases, be
replaced with identical categories in a similar distribution to the
original, but randomly assigned across observations (a data set of 95%
males and 5% females would still generate a data set of roughly 95%
males and 5% females).
Two parameters are available to adjust the assumptions data_modeler uses to identify categorical values. These are “categorical_threshold” and “proportional_threshold”.
- “categorical_threshold”: defines how many unique categories can exist in the variable before data_modeler assumes the variable isn’t categorical. Above this number, and the value will not be considered categorical. This currently defaults to 61.
- “proportional_threshold”: defines the proportion of values that can be unique before data_modeler() assumes the variable isn’t categorical. This currently defaults to 0.75.
data_modeler() creates an uncorrelated column of data to substitute
the original by randomly reassigning values across that column. However,
this does not ensure that meaningful information is not passed along.
Note that, in most cases, categorical observations are distributed in
proportion to the source data. So, if the provided data set is of
categorical outcomes in a single population, the uncorrelated result
would still show a similar rate of that outcome in that population
because all observations would be for the same population group. The
analyst should consider if this is acceptable or not and take measures
to correct if necessary by either changing the input data or modifying
the synthetic data. A simple approach in the above example would be to
add one or more additional population groups so that the outcomes would
be distributed removing the ability to estimate the rate for the smaller
population (in an uncorrelated result, the rate of the categorical
outcome would tend to be similar in all populations.)
For the examples below, we create a “real” data set and show how to use
data_modeler() and how the synthetic and “real” data differ.
Our demo “real” data has one indicator variable that represents an outcome that is never in the White population, sometimes in the Asian population, and always in other Races. In an uncorrelated (and so non-informative) synthetic data set, we would expect this indicator to no longer be biased by race.
DT.real.data <- data.table(
id = 1:1000,
chi_geo_kc = sample(c('King County',NA), 1000, replace = T),
chi_race_7 = factor(sample(c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple", NA), 1000, replace = T, prob = c(.19,.01,.07,.11,.01,.35,.07,.14,.02)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple", NA)),
chi_sex = as.factor(sample(c("Male","Female"), 1000, replace = T)),
chi_geo_region = factor(sample(c("South", "North", "Seattle", "East"), 1000, replace = T), levels = c("South","North","Seattle","East")))
# Attach a biased indicator.
#indicator1, all White population are "never", all Asians population, "sometimes", and everyone else, "always"
DT.real.data[, indicator1 := ifelse(chi_race_7 == "White", "never",
ifelse(chi_race_7 == "Asian", "sometimes",
"always"))]##Observe that modeled data have the same structure as the original
DT.synthetic.data <- data_modeler(ph.data = DT.real.data, #our "real" data
number_of_observations = 100, #number of observations to create in the synthetic data
comments = TRUE, #whether or not to include comments in code (only matters for return_code == TRUE or print_code == TRUE)
return_code = FALSE, #return a list object of strings for creating the synthetic data instead of the data itself. leave FALSE to receive synthetic data
print_code = FALSE) #print the generation code to terminal. useful for both review and if you wish to copy into other contexts, such as a shareable script or in a chat with a LLM/AIYou can observe that the data structures are the same for both data frames:
```{r}
str(DT.real.data)
str(DT.synthetic.data)
```Classes 'data.table' and 'data.frame': 1000 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ chi_geo_kc : chr NA "King County" NA NA ...
$ chi_race_7 : Factor w/ 8 levels "Asian","AIAN",..: 8 NA 6 3 8 7 1 6 6 1 ...
$ chi_sex : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
$ chi_geo_region: Factor w/ 4 levels "South","North",..: 2 2 2 2 2 2 1 2 2 4 ...
$ indicator1 : chr "always" NA "never" "always" ...
- attr(*, ".internal.selfref")=<externalptr>
Classes 'data.table' and 'data.frame': 100 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ chi_geo_kc : chr "King County" "King County" "King County" "King County" ...
$ chi_race_7 : Factor w/ 8 levels "Asian","AIAN",..: 1 5 8 5 1 2 5 1 6 8 ...
$ chi_sex : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 1 1 1 ...
$ chi_geo_region: Factor w/ 4 levels "South","North",..: 1 2 1 1 2 1 3 3 4 1 ...
$ indicator1 : chr "never" "always" NA "always" ...
- attr(*, ".internal.selfref")=<externalptr>
But we can see that the original bias of White’s in the never group, Asians in the sometimes group, and all others in the always group now no longer holds.
prop.table(table(DT.real.data[,.(indicator1,chi_race_7)])) chi_race_7
indicator1 Asian AIAN Black Hispanic NHPI
always 0.000000000 0.013251784 0.082568807 0.114169215 0.007135576
never 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
sometimes 0.191641182 0.000000000 0.000000000 0.000000000 0.000000000
chi_race_7
indicator1 White Other Multiple
always 0.000000000 0.071355759 0.161060143
never 0.358817533 0.000000000 0.000000000
sometimes 0.000000000 0.000000000 0.000000000
prop.table(table(DT.synthetic.data[,.(indicator1,chi_race_7)])) chi_race_7
indicator1 Asian AIAN Black Hispanic NHPI White
always 0.31666667 0.03333333 0.03333333 0.06666667 0.11666667 0.06666667
never 0.11666667 0.01666667 0.03333333 0.01666667 0.01666667 0.03333333
chi_race_7
indicator1 Other Multiple
always 0.00000000 0.13333333
never 0.00000000 0.00000000
Notice that you may see extreme, apparently biased, results in the outcomes. This is still randomly distributed and you can observe these change if you recreate the synthetic.
You can review, share, and execute the code used to create the synthetic data in two ways. Manually by printing it to terminal and copying it into your R script, or by receiving and processing the text object. The first is more useful for review and sharing with others, the later is more useful for programmatic handling of the instruction set.
DT.synthetic.data <- data_modeler(ph.data = DT.real.data,
number_of_observations = 100,
comments = TRUE,
return_code = FALSE,
print_code = TRUE) # change this to true to copy code from your terminal outputDT <- data.table(
`id` = as.integer(1:100), # as an identifying (all unique) integer
`chi_geo_kc` = sample(c(NA, "King County"), 100, replace = TRUE, prob = c(0.507, 0.493)), # as categorical character (non factor)
`chi_race_7` = factor(sample(c("Multiple", NA, "White", "Black", "Other", "Asian", "Hispanic", "NHPI", "AIAN"), 100, replace = TRUE, prob = c(0.188, 0.013, 0.081, 0.112, 0.007, 0.352, 0.07, 0.158, 0.019)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple"), ordered = FALSE), # as factor
`chi_sex` = factor(sample(c("Male", "Female"), 100, replace = TRUE, prob = c(0.493, 0.507)), levels = c("Female", "Male"), ordered = FALSE), # as factor
`chi_geo_region` = factor(sample(c("North", "South", "East", "Seattle"), 100, replace = TRUE, prob = c(0.233, 0.253, 0.256, 0.258)), levels = c("South", "North", "Seattle", "East"), ordered = FALSE), # as factor
`indicator1` = sample(c("always", NA, "never", "sometimes"), 100, replace = TRUE, prob = c(0.441, 0.352, 0.188, 0.019))) # as categorical character (non factor)
The code below is what is used to generate the synthetic data. Using
this approach you can see the data, sample approach and rates for each
variable. Because we ran data_modeler() with comments = TRUE, we
also see a human-readable string at the end of each line describing how
each variable was classified by data_modeler. This allows you to review
the logic, make changes, and share the code with others.
DT <- data.table(
`id` = as.integer(1:100), # as an identifying (all unique) integer
`chi_geo_kc` = sample(c(NA, "King County"), 100, replace = TRUE, prob = c(0.494, 0.506)), # as categorical character (non factor)
`chi_race_7` = factor(sample(c("Black", "White", "AIAN", NA, "Asian", "Multiple", "Other", "NHPI", "Hispanic"), 100, replace = TRUE, prob = c(0.193, 0.014, 0.067, 0.11, 0.012, 0.36, 0.064, 0.157, 0.023)), levels = c("Asian", "AIAN", "Black", "Hispanic", "NHPI", "White", "Other", "Multiple"), ordered = FALSE), # as factor
`chi_sex` = factor(sample(c("Male", "Female"), 100, replace = TRUE, prob = c(0.518, 0.482)), levels = c("Female", "Male"), ordered = FALSE), # as factor
`chi_geo_region` = factor(sample(c("Seattle", "South", "North", "East"), 100, replace = TRUE, prob = c(0.241, 0.229, 0.258, 0.272)), levels = c("South", "North", "Seattle", "East"), ordered = FALSE), # as factor
`indicator1` = sample(c("always", "never", NA, "sometimes"), 100, replace = TRUE, prob = c(0.424, 0.36, 0.193, 0.023))) # as categorical character (non factor)Note, the above approach requires the user to copy and paste output from
their terminal. If you wish to check, modify, and extract the code
without human intervention, you need to receive the text as a list
object using return_code = TRUE:
DT.synthetic.data.instructions <- data_modeler(ph.data = DT.real.data,
number_of_observations = 100,
comments = FALSE,
return_code = TRUE, #setting this to "TRUE" will cause data_modeler() to return a list object of the text instead of a DT of synthetic data
print_code = FALSE)
head(DT.synthetic.data.instructions)[[1]]
[1] "DT <- data.table("
[[2]]
[1] "`id` = as.integer(1:100),"
[[3]]
[1] "`chi_geo_kc` = sample(c(NA, \"King County\"), 100, replace = TRUE, prob = c(0.507, 0.493)),"
[[4]]
[1] "`chi_race_7` = factor(sample(c(\"Multiple\", NA, \"White\", \"Black\", \"Other\", \"Asian\", \"Hispanic\", \"NHPI\", \"AIAN\"), 100, replace = TRUE, prob = c(0.188, 0.013, 0.081, 0.112, 0.007, 0.352, 0.07, 0.158, 0.019)), levels = c(\"Asian\", \"AIAN\", \"Black\", \"Hispanic\", \"NHPI\", \"White\", \"Other\", \"Multiple\"), ordered = FALSE),"
[[5]]
[1] "`chi_sex` = factor(sample(c(\"Male\", \"Female\"), 100, replace = TRUE, prob = c(0.493, 0.507)), levels = c(\"Female\", \"Male\"), ordered = FALSE),"
[[6]]
[1] "`chi_geo_region` = factor(sample(c(\"North\", \"South\", \"East\", \"Seattle\"), 100, replace = TRUE, prob = c(0.233, 0.253, 0.256, 0.258)), levels = c(\"South\", \"North\", \"Seattle\", \"East\"), ordered = FALSE),"
After processing the text as desired, you can execute it, creating the desired synthetic data object, as follows:
## Manually convert received code into data
codeText <- paste(unlist(DT.synthetic.data.instructions), collapse =" \n" )
DT.synthetic.data.manual <- eval( parse(text = paste0(codeText)))Now you can visually confirm that the class structure of each of the tables is the same:
str(DT.real.data)Classes 'data.table' and 'data.frame': 1000 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ chi_geo_kc : chr NA "King County" NA NA ...
$ chi_race_7 : Factor w/ 8 levels "Asian","AIAN",..: 8 NA 6 3 8 7 1 6 6 1 ...
$ chi_sex : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 1 2 1 1 ...
$ chi_geo_region: Factor w/ 4 levels "South","North",..: 2 2 2 2 2 2 1 2 2 4 ...
$ indicator1 : chr "always" NA "never" "always" ...
- attr(*, ".internal.selfref")=<externalptr>
str(DT.synthetic.data)Classes 'data.table' and 'data.frame': 100 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ chi_geo_kc : chr "King County" NA NA "King County" ...
$ chi_race_7 : Factor w/ 8 levels "Asian","AIAN",..: 8 5 1 1 6 5 1 1 1 3 ...
$ chi_sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 2 2 1 2 ...
$ chi_geo_region: Factor w/ 4 levels "South","North",..: 4 1 4 2 3 1 4 1 3 3 ...
$ indicator1 : chr NA NA "always" "always" ...
- attr(*, ".internal.selfref")=<externalptr>
str(DT.synthetic.data.manual)Classes 'data.table' and 'data.frame': 100 obs. of 6 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ chi_geo_kc : chr NA "King County" "King County" "King County" ...
$ chi_race_7 : Factor w/ 8 levels "Asian","AIAN",..: 3 1 1 5 8 1 1 1 8 8 ...
$ chi_sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 2 2 2 2 ...
$ chi_geo_region: Factor w/ 4 levels "South","North",..: 3 1 4 1 1 3 4 2 2 3 ...
$ indicator1 : chr "always" "never" NA "always" ...
- attr(*, ".internal.selfref")=<externalptr>
sapply(DT.real.data, class) id chi_geo_kc chi_race_7 chi_sex chi_geo_region
"integer" "character" "factor" "factor" "factor"
indicator1
"character"
sapply(DT.synthetic.data, class) id chi_geo_kc chi_race_7 chi_sex chi_geo_region
"integer" "character" "factor" "factor" "factor"
indicator1
"character"
sapply(DT.synthetic.data.manual, class) id chi_geo_kc chi_race_7 chi_sex chi_geo_region
"integer" "character" "factor" "factor" "factor"
indicator1
"character"
Finally, each time data_modeler is executed, there is a randomness to how the data are created. If you wish to create identical data sets, be sure to set seed before executing:
DT.synthetic.data.random <- data_modeler(ph.data = DT.real.data,
number_of_observations = 100,
comments = TRUE,
return_code = FALSE,
print_code = FALSE)
set.seed(1000)
DT.synthetic.data.s1000a <- data_modeler(ph.data = DT.real.data,
number_of_observations = 100,
comments = TRUE,
return_code = FALSE,
print_code = FALSE)
set.seed(1000)
DT.synthetic.data.s1000b <- data_modeler(ph.data = DT.real.data,
number_of_observations = 100,
comments = TRUE,
return_code = FALSE,
print_code = FALSE) Comparing a table with a random seed to one with a fixed seed, we expect the resulting tables to differ:
identical(DT.synthetic.data.random, DT.synthetic.data.s1000a)[1] FALSE
When using the same seed for distinct rounds, we would expect the resulting tables to be identical:
identical(DT.synthetic.data.s1000a, DT.synthetic.data.s1000b)[1] TRUE
These functions format numeric values (including dates) into human-readable time chunks. They are often used for summarizing years and can be used for identifying unexpected gaps.
years <- c(2001:2003,
2005,
2008,
2011:2012)
dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03",
"2024-01-06",
"2024-01-09", "2024-01-10", "2024-01-11", "2024-01-12"))
dt <- data.table(desc = c('format_time', 'format_time_simple', '',
'format_time', 'format_time_simple'),
data = c(format_time(years), format_time_simple(years), '',
format_time(dates), format_time_simple(dates)))
head(dt)| desc | data |
|---|---|
| format_time | 2001-2003, 2005, 2008, 2011-2012 |
| format_time_simple | 2001-2012 |
| format_time | 2024-01-01 to 2024-01-03, 2024-01-06, 2024-01-09 to 2024-01-12 |
| format_time_simple | 2024-01-01 to 2024-01-12 |
generate_yaml() generates a YAML file for SQL loading based on a
data.frame or data.table.
data <- data.table(
id = 1:5,
name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
age = c(25.1, 30.2, 35.3, 40.4, 45.5)
)
yaml_output <- generate_yaml(data, schema = "dbo", table = "example_table", datasource = 'WADOH_CHS')
print(yaml_output)$datasource
[1] "WADOH_CHS"
$schema
[1] "dbo"
$table
[1] "example_table"
$vars
$vars$id
[1] "INT"
$vars$name
[1] "NVARCHAR(18)"
$vars$age
[1] "NUMERIC(38,5)"
The output can be be used to specify the field_types for the
tsql_validate_field_types() and tsql_chunk_loader() functions
described below.
field_types <- unlist(yaml_output$vars)
print(field_types) id name age
"INT" "NVARCHAR(18)" "NUMERIC(38,5)"
The multi_t_test() function performs t-tests comparing multiple groups
against a reference group using means, standard errors, and the sample
size (when available).
Let’s create a sample dataset and use multi_t_test() to compare
birthweights across different maternal age groups:
# Sample data
age_groups <- c("18-24", "25-29", "30-34", "35-39", "40+")
birthweight_means <- c(3150, 3450, 3400, 3250, 3100) # in grams
birthweight_ses <- c(50, 45, 40, 55, 60)
sample_sizes <- c(500, 800, 750, 400, 200)
reference_group <- 3 # comparing all groups to the 30-34 age group
# Perform multi_t_test
birthweight_comparison <- multi_t_test(means = birthweight_means,
ses = birthweight_ses,
reference_index = reference_group,
n = sample_sizes,
alternative = 'two.sided',
df_method = "estimated",
alpha = 0.05)
# Add age group labels to the results
birthweight_comparison[, Age_Group := age_groups]
head(birthweight_comparison)| comparison | diff_means | ci_lower | ci_upper | p.value | significant | t.statistic | df | df_method | Age_Group |
|---|---|---|---|---|---|---|---|---|---|
| Group 1 vs Reference | -250 | -375.64316 | -124.3568 | 0.0001005 | TRUE | -3.9043440 | 1054.3852 | estimated | 18-24 |
| Group 2 vs Reference | 50 | -68.09846 | 168.0985 | 0.4064107 | FALSE | 0.8304548 | 1536.8996 | estimated | 25-29 |
| Group 3 - Referent | 0 | NA | NA | NA | NA | NA | NA | estimated | 30-34 |
| Group 4 vs Reference | -150 | -283.49100 | -16.5090 | 0.0276885 | TRUE | -2.2056439 | 811.7334 | estimated | 35-39 |
| Group 5 vs Reference | -300 | -441.76996 | -158.2300 | 0.0000390 | TRUE | -4.1602515 | 394.4939 | estimated | 40+ |
-
Flexible degrees of freedom calculation: The function offers four methods for calculating degrees of freedom:
- “estimated” (default): Uses the Welch–Satterthwaite equation
- “conservative”: Uses the minimum possible degrees of freedom (df = 2); lower Type I Error (false +) and higher Type II Error (false -)
- “moderate”: Uses the number of groups minus 1 (df = k - 1)
- “liberal”: Assumes infinite degrees of freedom (df = Inf); higher Type 1 Error (false +) and lower Type II Error (false -)
-
Sample size estimation: If sample sizes are not provided, the function can estimate them based on the distribution of mean values.
-
Adjustment for multiple comparisons (optional): If desired, it will adjust p-values for multiple comparisons using one of two methods (Holm-Bonferroni or Benjamini-Hochberg).
-
Comprehensive output: The function returns a data.table containing comparison results, including difference in means, confidence intervals, p-values, and significance indicators.
round2() rounds numbers the way we learned in elementary school: if
the digit to the right of the rounding position is 5 or greater, round
up; otherwise, round down. Note! Base R’s round function
follows a different logic (IEC 60559 / IEEE 754 standards), which
specifies rounding to the nearest even number.
set.seed(98104)
dt <- data.table(orig = seq(0.5, 5.5, 1))
dt[, round2 := round2(orig)]
dt[, baseR := round(orig)]
dt[round2 != baseR, different := '*']
head(dt)| orig | round2 | baseR | different |
|---|---|---|---|
| 0.5 | 1 | 0 | * |
| 1.5 | 2 | 2 | NA |
| 2.5 | 3 | 2 | * |
| 3.5 | 4 | 4 | NA |
| 4.5 | 5 | 4 | * |
| 5.5 | 6 | 6 | NA |
Like the base R round function, round2 can also be used to round
large numbers to convey uncertainty in estimates
# round to hundreds
round2(123456, -2)[1] 123500
# round to thousands
round2(123456, -3)[1] 123000
string_clean() cleans string and factor columns in a data.frame or
data.table. It eliminates random white spaces and pseudo-white spaces
and sets blanks to true NA.
# create table
dt <- data.table(
name = c(" John ", "Jane Doe", "Bob Smith", " ", " Edward Jenner"),
category = c("A ", " B", "", "D", "E ")
)
print(dt$category)[1] "A " " B" "" "D" "E "
head(dt)| name | category |
|---|---|
| John | A |
| Jane Doe | B |
| Bob Smith | |
| D | |
| Edward Jenner | E |
# clean table
string_clean(dt)
print(dt$category)[1] "A" "B" NA "D" "E"
head(dt)| name | category |
|---|---|
| John | A |
| Jane Doe | B |
| Bob Smith | NA |
| NA | D |
| Edward Jenner | E |
std_error() calculates the standard error of the mean for a vector,
columns in a table, or a list. R does not have a built in standard error
of the mean function, so we created one for rads.
set.seed(98104)
x <- rnorm(1000)
se <- std_error(x)
print(paste("Standard Error:", round2(se, 5)))[1] "Standard Error: 0.03207"
set.seed(98104)
dt <- data.table(x = rnorm(1000), y = rnorm(1000))
se <- std_error(dt)
se <- round2(se, 5)
print(se) x y
0.03207 0.03235
set.seed(98104)
mylist <- list(x = rnorm(1000), y = rnorm(1000))
se <- std_error(mylist) # returns a list with the standard errors
se <- lapply(se, round2, n = 5)
print(se)$x
[1] 0.03207
$y
[1] 0.03235
substrRight() extracts a substring from the right side of a string. It
complements the base R function substr.
text <- "Hello, World!"
substr_right <- substrRight(text, 2, 6)
print(substr_right)[1] "World"
This function is useful when you need to extract the right hand portions of strings, e.g., the last 4 digits of a phone number.
tsql_validate_field_types() validates whether a named vector of TSQL
data types is compatible with a data.table. This function is useful when
you have a yaml file specifying the TSQL field types to be used when
uploading your data to a server.
dt <- data.table(
id = 1:5,
name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
age = c(25, 30, 35, 40, 45),
time = Sys.time()
)
field_types <- c(id = "int",
name = "varchar(50)",
age = "int",
time = 'datetime')
validation_result <- tsql_validate_field_types(dt, field_types)🙂 Success! Your desired TSQL data types are suitable for your dataset.
head(validation_result)| colname | R_type | tsql_type | is_valid | issue |
|---|---|---|---|---|
| age | numeric | int | TRUE | NA |
| id | integer | int | TRUE | NA |
| name | character | varchar | TRUE | NA |
| time | POSIXct | datetime | TRUE | NA |
dt <- data.table(
id = 1:5,
name = c("Alicia", "Bianca", "Clarence", "Dwane", "Eve"),
age = as.character(c(25, 30, 35, 40, 45)),
time = Sys.Date()
)
field_types <- c(id = "int",
name = "varchar(50)",
age = "int", # should generate an error
time = 'datetime') # should generate an error
validation_result <- tsql_validate_field_types(dt, field_types)Error in `tsql_validate_field_types()`:
!
🛑👿 The following columns in your dataset did not align with the proposed TSQL field types:
column: age, R Type: character, TSQL Type: int, issue: Incompatible types
column: time, R Type: Date, TSQL Type: datetime, issue: Incompatible types
tsql_chunk_loader() loads large datasets to Microsoft SQL Server in
‘chunks’. It is suitable for datasets too large for single upload. For
extreme sizes try using
bcp.
Benefits include:
- Loads large datasets in chunks, preventing memory and batch size issues
- Reduces risk of network timeouts during data transfer
- Attempts multiple uploads per chunk for error resilience
- Provides progress tracking
- Allows SQL data type specification and validation
- Verifies complete data upload
mydt = data.table(col1 = 1:10000L, # create integer
col2 = 1:10000/3) # create float
mydt[, col3 := as.Date(Sys.Date()) - col1] # create date
mydt[, col4 := as.character(col3)] # create string
myfieldtypes <- c(col1 = 'int', col2 = 'float',
col3 = 'date', col4 = 'nvarchar(20)')
tsql_chunk_loader(
ph.data = mydt,
db_conn = rads::validate_hhsaw_key(), # connect to Azure 16 hhs_analytics_workspace
chunk_size = 3333,
schema_name = Sys.getenv("USERNAME"),
table_name = 'JustTesting',
overwrite = TRUE,
append = FALSE,
field_types = myfieldtypes,
validate_field_types = TRUE, # uses tsql_validate_field_types()
validate_upload = TRUE
)Warning: `validate_hhsaw_key()` was deprecated in rads 1.5.2.
ℹ Please use `apde.data::authenticate_hhsaw()` instead.
ℹ
🙂 Success! Your desired TSQL data types are suitable for your dataset.
2026-02-11 12:09:57.361791: Loading chunk 1 of 4: rows 1-3,333
2026-02-11 12:09:57.830526: Loading chunk 2 of 4: rows 3,334-6,666
2026-02-11 12:09:58.226583: Loading chunk 3 of 4: rows 6,667-9,999
2026-02-11 12:09:58.63213: Loading chunk 4 of 4: rows 10,000-10,000
🙂 🎉 🎊 🌈
Congratulations! All the rows in {ph.data} were successfully uploaded.
validate_hhsaw_key() validates HHSAW keys and opens a connection
to HHSAW when possible.
myConnection <- validate_hhsaw_key(hhsaw_key = 'hhsaw')
print(myConnection)<OdbcConnection> rbuie@kingcounty.gov@kcitazrhpasqlprp16
Database: hhs_analytics_workspace
Microsoft SQL Server Version: 12.00.1017
These utility functions provide tools for common data manipulation and analysis tasks. By leveraging these functions, you can streamline your workflow, improve code readability, and handle various data-related challenges more efficiently.
Remember to consult the individual function documentation for more detailed information on usage and parameters. Happy coding!
– Updated February 11, 2026 (rads v1.5.3)