-
Notifications
You must be signed in to change notification settings - Fork 1
calc
calc() is the analytic workhorse of rads. It provides a standardized method for obtaining most of what we usually want to calculate: means, medians, counts, confidence intervals, standard errors, relative standard errors (RSE), numerators, denominators, the number missing, and the proportion missing. calc() can be used with record data (e.g., vital statistics, census, enrollment numbers, etc.) as well as survey data (e.g., BRFSS, ACS PUMS, etc.).
calc() provides a unified interface built on top of common R packages, including data.table, which can make it faster than implementing custom functions with base R or dplyr in some cases. This makes it a convenient and efficient option, allowing APDE staff to use consistent syntax across different data sources. While all calc() functionality could be achieved with other packages directly, the primary benefit is standardization and ease of use—even if some specialized packages might occasionally offer more efficient implementations for specific analyses.
This vignette will provide some examples to introduce the calc() function by walking through basic analyses with vital statistics (birth data) and survey data (ACS PUMS). To get the most out of this vignette, we recommend that you type each and every bit of code into R. Doing so will almost definitely help you learn the syntax much faster than just reading the vignette or copying and pasting the code.
Arguments are the values that we send to a function when it is called. The standard arguments for calc() are:
-
ph.data<- the name of the data.table/data.frame or survey object that you want to analyze. -
what<- a character vector of the variable(s) for which you want to calculate the metrics. E.g.,what = c("uninsured") -
where<- think of this as a SQLWHEREclause, i.e., a filtering or subsetting of data. E.g.,ages %in% c(0:17) & gender == "Female" -
by<- a character vector of the variables that you want to computer thewhatby, i.e., the cross-tab variable(s). E.g.,by = c("gender")would stratify results by gender -
metrics<- a character vector of the metrics that you want returned. E.g.,metrics = c("mean", "rse"). You can see a complete list of available metrics by typingmetrics()and get detailed descriptions of what each metric means by typing?metrics(). -
per<- an integer, which is the denominator whenrateis selected as a metric. Metrics will be multiplied by this value. E.g.,per = 1000. NOTE this is just a scalar. At present this does not calculate a true rate (i.e., the relevant population denominator is not used). -
win<- an integer, which is the number of consecutive units of time (e.g., years, months, etc.) over which the metrics will be calculated, i.e., the 'window' for a rolling average, sum, etc. E.g.win = 5will perform calculations over every 5 time unit window. -
time_var<- a character, which is the name of the time variable in the dataset. Used in combination with thewinargument to generate time windowed calculations. -
fancy_time<- a logical (i.e.,TorF) flag. If TRUE, a record of all the years going into the data is returned. If FALSE, just a simple range (where certain years within the range might not be represented in your data). -
proportion<- a logical (i.e.,TorF) flag determining whether the metrics should be calculated as a proportion. Currently only relevant for survey data. The default is FALSE. -
ci<- a numeric value between 0 & 1 for the confidence level returned in the estimates. E.g., 0.95 == 95% confidence interval -
verbose<- a logical used to toggle on/off printed warnings
There is no need to specify all of the arguments listed above. As you can see in the following example, the calc function simply needs a specified dataset (i.e., ph.data) and at least one what variable to return an intelligible result.
library(rads)
library(data.table)
data(mtcars)
calc(ph.data = mtcars, what = c("mpg"))[]| variable | mean | numerator | denominator | level | mean_se | mean_lower | mean_upper |
|---|---|---|---|---|---|---|---|
| mpg | 20.09062 | 642.9 | 32 | NA | 1.065424 | 18.00243 | 22.17882 |
Note: The use of [] after calc() is used to print the output to the console. Typically, you would not print the results but would save them as an object. E.g., my.est <- calc().
First get the birth data (cf. get_data vignette)
birth <- get_data_birth(cols = c("chi_year", "chi_sex", "chi_race_eth8",
"preterm", "birth_weight_grams", "mother_birthplace_state"),
year = c(2013:2019),
kingco = T)Mean (proportion) for a binary over multiple years
calc(ph.data = birth,
what = c("preterm"),
metrics = c("mean", "rse", "numerator", "denominator"),
time_var = "chi_year")[]| chi_year | variable | mean | numerator | denominator | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|---|
| 2013-2019 | preterm | 0.0905233 | 15806 | 174607 | NA | 0.0006867 | 0.0891774 | 0.0918691 | 0.7585532 |
Mean (proportion) for a binary over multiple years -- where newborn is male |
calc(ph.data = birth,
what = c("preterm"),
where = chi_sex == "Male",
metrics = c("mean", "rse", "numerator", "denominator"),
time_var = "chi_year")[]| chi_year | variable | mean | numerator | denominator | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|---|
| 2013-2019 | preterm | 0.097169 | 8694 | 89473 | NA | 0.0009902 | 0.0952282 | 0.0991097 | 1.019051 |
Mean (proportion) for a binary over multiple years -- where newborn is male born to a Hispanic mother
calc(ph.data = birth,
what = c("preterm"),
where = chi_sex == "Male" & chi_race_eth8 == "Hispanic",
metrics = c("mean", "rse", "numerator", "denominator"),
time_var = "chi_year")[]| chi_year | variable | mean | numerator | denominator | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|---|
| 2013-2019 | preterm | 0.1105628 | 1277 | 11550 | NA | 0.002918 | 0.1048435 | 0.116282 | 2.639253 |
Mean (proportion) for a binary over individual years
birth[, cy := chi_year]
calc(ph.data = birth,
what = c("preterm"),
metrics = c("mean", "rse", "numerator", "denominator"),
time_var = "chi_year",
by = "cy")[]| cy | chi_year | variable | mean | numerator | denominator | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 2013 | preterm | 0.0925978 | 2293 | 24763 | NA | 0.0018421 | 0.0889874 | 0.0962082 | 1.989329 |
| 2014 | 2014 | preterm | 0.0884756 | 2231 | 25216 | NA | 0.0017884 | 0.0849704 | 0.0919808 | 2.021357 |
| 2015 | 2015 | preterm | 0.0882979 | 2241 | 25380 | NA | 0.0017810 | 0.0848072 | 0.0917886 | 2.017038 |
| 2016 | 2016 | preterm | 0.0895684 | 2320 | 25902 | NA | 0.0017744 | 0.0860907 | 0.0930461 | 1.981016 |
| 2017 | 2017 | preterm | 0.0912052 | 2296 | 25174 | NA | 0.0018146 | 0.0876487 | 0.0947617 | 1.989553 |
| 2018 | 2018 | preterm | 0.0894487 | 2166 | 24215 | NA | 0.0018340 | 0.0858541 | 0.0930433 | 2.050369 |
| 2019 | 2019 | preterm | 0.0942939 | 2259 | 23957 | NA | 0.0018881 | 0.0905933 | 0.0979946 | 2.002371 |
Mean (proportion) for a binary over windowed years
calc(ph.data = birth,
what = c("preterm"),
metrics = c("mean", "rse", "numerator", "denominator"),
time_var = "chi_year",
win = 3)[]| chi_year | variable | mean | numerator | denominator | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|---|
| 2013-2015 | preterm | 0.0897703 | 6765 | 75359 | NA | 0.0010413 | 0.0877294 | 0.0918112 | 1.159964 |
| 2014-2016 | preterm | 0.0887866 | 6792 | 76498 | NA | 0.0010284 | 0.0867710 | 0.0908023 | 1.158281 |
| 2015-2017 | preterm | 0.0896856 | 6857 | 76456 | NA | 0.0010334 | 0.0876602 | 0.0917109 | 1.152210 |
| 2016-2018 | preterm | 0.0900772 | 6782 | 75291 | NA | 0.0010434 | 0.0880322 | 0.0921221 | 1.158314 |
| 2017-2019 | preterm | 0.0916342 | 6721 | 73346 | NA | 0.0010653 | 0.0895462 | 0.0937221 | 1.162563 |
Mean for a continuous over windowed years
calc(ph.data = birth,
what = c("birth_weight_grams"),
metrics = c("mean", "rse"),
time_var = "chi_year",
win = 3)[]| chi_year | variable | mean | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|
| 2013-2015 | birth_weight_grams | 3339.469 | NA | 2.122024 | 3335.310 | 3343.628 | 0.0635437 |
| 2014-2016 | birth_weight_grams | 3336.555 | NA | 2.087116 | 3332.465 | 3340.646 | 0.0625530 |
| 2015-2017 | birth_weight_grams | 3333.507 | NA | 2.068735 | 3329.452 | 3337.562 | 0.0620588 |
| 2016-2018 | birth_weight_grams | 3327.129 | NA | 2.073889 | 3323.065 | 3331.194 | 0.0623327 |
| 2017-2019 | birth_weight_grams | 3320.961 | NA | 2.088449 | 3316.868 | 3325.055 | 0.0628869 |
Mean for a continuous in 2019, by gender and race/eth
calc(ph.data = birth,
what = c("birth_weight_grams"),
chi_year == 2019,
metrics = c("mean", "rse"),
by = c("chi_race_eth8", "chi_sex"))[]| chi_race_eth8 | chi_sex | variable | mean | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|
| AIAN | Female | birth_weight_grams | 3251.965 | NA | 79.070604 | 3096.989 | 3406.940 | 2.4314716 |
| AIAN | Male | birth_weight_grams | 3438.641 | NA | 84.306018 | 3273.404 | 3603.877 | 2.4517252 |
| Asian | Female | birth_weight_grams | 3145.902 | NA | 9.419157 | 3127.441 | 3164.363 | 0.2994104 |
| Asian | Male | birth_weight_grams | 3231.952 | NA | 9.784437 | 3212.775 | 3251.129 | 0.3027408 |
| Black | Female | birth_weight_grams | 3196.872 | NA | 18.683442 | 3160.253 | 3233.491 | 0.5844288 |
| Black | Male | birth_weight_grams | 3306.213 | NA | 19.794051 | 3267.418 | 3345.009 | 0.5986925 |
| Hispanic | Female | birth_weight_grams | 3270.074 | NA | 13.703834 | 3243.215 | 3296.933 | 0.4190681 |
| Hispanic | Male | birth_weight_grams | 3301.784 | NA | 14.310447 | 3273.736 | 3329.832 | 0.4334156 |
| Multiple | Female | birth_weight_grams | 3251.312 | NA | 24.365976 | 3203.556 | 3299.069 | 0.7494198 |
| Multiple | Male | birth_weight_grams | 3369.178 | NA | 23.359494 | 3323.394 | 3414.962 | 0.6933292 |
| NHPI | Female | birth_weight_grams | 3287.541 | NA | 49.322668 | 3190.871 | 3384.212 | 1.5002905 |
| NHPI | Male | birth_weight_grams | 3444.178 | NA | 43.324387 | 3359.264 | 3529.092 | 1.2579021 |
| Other | Female | birth_weight_grams | 3223.599 | NA | 30.670381 | 3163.487 | 3283.712 | 0.9514328 |
| Other | Male | birth_weight_grams | 3315.557 | NA | 33.357232 | 3250.178 | 3380.936 | 1.0060823 |
| White | NA | birth_weight_grams | 1873.500 | NA | 1321.500000 | -14917.750 | 18664.750 | 70.5364291 |
| White | Female | birth_weight_grams | 3340.295 | NA | 7.280580 | 3326.026 | 3354.565 | 0.2179622 |
| White | Male | birth_weight_grams | 3460.298 | NA | 7.489711 | 3445.618 | 3474.977 | 0.2164470 |
Proportion of 2017-2019 births among each race/eth group
calc(ph.data = birth,
what = c("chi_race_eth8"),
chi_year %in% 2017:2019,
metrics = c("mean", "rse", "obs", "numerator", "denominator"))[]| level | variable | denominator | obs | mean | mean_se | mean_lower | mean_upper | numerator | rse |
|---|---|---|---|---|---|---|---|---|---|
| AIAN | chi_race_eth8 | 73701 | 73701 | 0.0053052 | 0.0002676 | 0.0048059 | 0.0058561 | 391 | 5.0438189 |
| Asian | chi_race_eth8 | 73701 | 73701 | 0.2291149 | 0.0015481 | 0.2260950 | 0.2321631 | 16886 | 0.6756696 |
| Black | chi_race_eth8 | 73701 | 73701 | 0.0903651 | 0.0010561 | 0.0883165 | 0.0924564 | 6660 | 1.1686901 |
| Hispanic | chi_race_eth8 | 73701 | 73701 | 0.1300796 | 0.0012391 | 0.1276703 | 0.1325275 | 9587 | 0.9525797 |
| Multiple | chi_race_eth8 | 73701 | 73701 | 0.0420347 | 0.0007392 | 0.0406097 | 0.0435075 | 3098 | 1.7584788 |
| NHPI | chi_race_eth8 | 73701 | 73701 | 0.0162956 | 0.0004664 | 0.0154064 | 0.0172352 | 1201 | 2.8619613 |
| Other | chi_race_eth8 | 73701 | 73701 | 0.0186700 | 0.0004986 | 0.0177176 | 0.0196726 | 1376 | 2.6705534 |
| White | chi_race_eth8 | 73701 | 73701 | 0.4681348 | 0.0018380 | 0.4645341 | 0.4717388 | 34502 | 0.3926283 |
2017-2019 rate per 100k of births among each race/eth group
calc(ph.data = birth,
what = c("chi_race_eth8"),
chi_year %in% 2017:2019,
metrics = c("obs", "numerator", "denominator", "rate"),
per = 100000)[]| level | variable | denominator | obs | numerator | rate | rate_se | rate_lower | rate_upper | rate_per |
|---|---|---|---|---|---|---|---|---|---|
| AIAN | chi_race_eth8 | 73701 | 73701 | 391 | 530.522 | 26.75857 | 480.5929 | 585.6077 | 1e+05 |
| Asian | chi_race_eth8 | 73701 | 73701 | 16886 | 22911.494 | 154.80599 | 22609.4981 | 23216.3131 | 1e+05 |
| Black | chi_race_eth8 | 73701 | 73701 | 6660 | 9036.512 | 105.60883 | 8831.6537 | 9245.6411 | 1e+05 |
| Hispanic | chi_race_eth8 | 73701 | 73701 | 9587 | 13007.965 | 123.91123 | 12767.0314 | 13252.7538 | 1e+05 |
| Multiple | chi_race_eth8 | 73701 | 73701 | 3098 | 4203.471 | 73.91714 | 4060.9678 | 4350.7475 | 1e+05 |
| NHPI | chi_race_eth8 | 73701 | 73701 | 1201 | 1629.557 | 46.63730 | 1540.6391 | 1723.5175 | 1e+05 |
| Other | chi_race_eth8 | 73701 | 73701 | 1376 | 1867.003 | 49.85932 | 1771.7604 | 1967.2633 | 1e+05 |
| White | chi_race_eth8 | 73701 | 73701 | 34502 | 46813.476 | 183.80296 | 46453.4068 | 47173.8775 | 1e+05 |
Number and proportion of missing gender by year
calc(ph.data = birth,
what = c("chi_sex"),
chi_year %in% 2017:2019,
metrics = c("obs", "missing", "missing.prop"),
by = "chi_year")[]| chi_year | variable | obs | missing | missing.prop | level |
|---|---|---|---|---|---|
| 2017 | chi_sex | 25274 | 0 | 0.0e+00 | Female |
| 2017 | chi_sex | 25274 | 0 | 0.0e+00 | Male |
| 2018 | chi_sex | 24337 | 0 | 0.0e+00 | Female |
| 2018 | chi_sex | 24337 | 0 | 0.0e+00 | Male |
| 2019 | chi_sex | 24090 | 2 | 8.3e-05 | NA |
| 2019 | chi_sex | 24090 | 2 | 8.3e-05 | Female |
| 2019 | chi_sex | 24090 | 2 | 8.3e-05 | Male |
Survey data must be 'set' to use the proper survey design. However, survey data accessed through rads::get_data() functions have already been set as data.table/dtsurvey objects and are ready for analysis with calc().
pums <- get_data_pums(year = 2021, kingco = F, records = 'person')## Your data was survey set with the following parameters is ready for rads::calc():
## - record type = person
## - valid years = 2021
## - replicate weights = pwgtp, pwgtp1 ... pwgtp9
Let's get on the same page re: the meaning of the columns from calc()
First, create a simple tabulation of the population of Seattle after sub-setting PUMS to King County.
test1 <- calc(ph.data = pums,
what = "nonseattle",
metrics = c('mean', 'numerator', 'denominator', 'obs', 'total'),
where = chi_geo_kc == "King County")
print(test1)| level | variable | denominator | obs | mean | mean_se | mean_lower | mean_upper | total | total_se | total_lower | total_upper | numerator |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NonSeattle | nonseattle | 21506 | 21506 | 0.6742587 | 0.0002685 | 0.6737242 | 0.6747932 | 1518730 | 539.2718 | 1517656.6 | 1519803.4 | 14962 |
| Seattle | nonseattle | 21506 | 21506 | 0.3257413 | 0.0002685 | 0.3252068 | 0.3262758 | 733714 | 783.3842 | 732154.7 | 735273.3 | 6544 |
-
mean: The proportion of the total population (i.e., King County) that lives in Seattle -
total: The survey weighted population Seattle and NonSeattle residents in King County -
numerator: The number of rows with people living in Seattle, compare tonrow(pums[nonseattle == 'Seattle]) -
denominator: The number of rows in the the dataset where the 'what' variable can be assessed because it is not missing, compare tonrow(pums[!is.na(nonseattle)]) -
obs: The number of rows in the dataset after filtering by the 'where' argument.
To highlight the difference between denominator and obs, in this next example we introduce 100 missing values to our 'what' variable (nonseattle), thereby lowering the denominator by 100 but leaving the obs the same as above.
pums2 <- copy(pums)
pums2 <- pums2[nonseattle == 'Seattle', nonseattle := ifelse(rowid(nonseattle) <= 100, NA, nonseattle)]
test2 <- calc(ph.data = pums2,
what = "nonseattle",
metrics = c('mean', 'numerator', 'denominator', 'obs', 'total'),
where = chi_geo_kc == "King County")
print(test2)| level | variable | denominator | obs | mean | mean_se | mean_lower | mean_upper | total | total_se | total_lower | total_upper | numerator |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NonSeattle | nonseattle | 21406 | 21506 | 0.6754666 | 0.0002912 | 0.6748871 | 0.6760462 | 1518730 | 539.2718 | 1517656.6 | 1519803.4 | 14962 |
| Seattle | nonseattle | 21406 | 21506 | 0.3245334 | 0.0002912 | 0.3239538 | 0.3251129 | 729686 | 876.1572 | 727942.1 | 731429.9 | 6444 |
As expected, the obs value is the same as in table test1 above (21,506), but the denominator value decreased by 100 to 21,406.
Mean (proportion) of those near poverty or disabled, by King County (vs. remainder of WA)
calc(ph.data = pums,
what = c("disability", "GEpov200"),
metrics = c("mean", "rse", "obs", "numerator", "denominator"),
proportion = T,
by = "chi_geo_kc")[]| chi_geo_kc | level | variable | denominator | obs | mean | mean_se | mean_lower | mean_upper | numerator | rse |
|---|---|---|---|---|---|---|---|---|---|---|
| NA | With a disability | disability | 57022 | 57022 | 0.1474778 | 0.0018915 | 0.1437525 | 0.1512827 | 9374 | 1.2825730 |
| NA | Without a disability | disability | 57022 | 57022 | 0.8525222 | 0.0018915 | 0.8487173 | 0.8562475 | 47648 | 0.2218723 |
| King County | With a disability | disability | 21506 | 21506 | 0.0966941 | 0.0024221 | 0.0919792 | 0.1016236 | 2404 | 2.5048703 |
| King County | Without a disability | disability | 21506 | 21506 | 0.9033059 | 0.0024221 | 0.8983764 | 0.9080208 | 19102 | 0.2681330 |
| NA | NA | GEpov200 | 55189 | 57022 | 0.7536707 | 0.0035095 | 0.7466186 | 0.7605894 | 42256 | 0.4656592 |
| King County | NA | GEpov200 | 20903 | 21506 | 0.8172007 | 0.0050456 | 0.8069427 | 0.8270304 | 17450 | 0.6174295 |
Proportion in each CHNA age group, by disability status
calc(ph.data = pums,
what = c("age6"),
metrics = c("mean", "rse", "obs", "numerator", "denominator"),
proportion = F,
by = "disability")[]| disability | level | variable | denominator | obs | mean | mean_se | mean_lower | mean_upper | numerator | rse |
|---|---|---|---|---|---|---|---|---|---|---|
| Without a disability | 18-24 | age6 | 66750 | 66750 | 0.0900336 | 0.0006471 | 0.0887455 | 0.0913217 | 5545 | 0.7187710 |
| Without a disability | 25-44 | age6 | 66750 | 66750 | 0.3084901 | 0.0011385 | 0.3062239 | 0.3107563 | 18734 | 0.3690608 |
| Without a disability | 45-64 | age6 | 66750 | 66750 | 0.2398575 | 0.0009444 | 0.2379776 | 0.2417373 | 16968 | 0.3937522 |
| Without a disability | 65-74 | age6 | 66750 | 66750 | 0.0886149 | 0.0006455 | 0.0873300 | 0.0898998 | 7519 | 0.7284496 |
| Without a disability | 75+ | age6 | 66750 | 66750 | 0.0348890 | 0.0005355 | 0.0338231 | 0.0359550 | 3197 | 1.5349388 |
| Without a disability | <18 | age6 | 66750 | 66750 | 0.2381149 | 0.0006944 | 0.2367327 | 0.2394971 | 14787 | 0.2916310 |
| With a disability | 18-24 | age6 | 11778 | 11778 | 0.0602630 | 0.0025721 | 0.0551434 | 0.0653827 | 610 | 4.2681494 |
| With a disability | 25-44 | age6 | 11778 | 11778 | 0.1871306 | 0.0056229 | 0.1759384 | 0.1983228 | 1760 | 3.0048262 |
| With a disability | 45-64 | age6 | 11778 | 11778 | 0.2668707 | 0.0053801 | 0.2561619 | 0.2775796 | 2959 | 2.0160022 |
| With a disability | 65-74 | age6 | 11778 | 11778 | 0.1850895 | 0.0040475 | 0.1770332 | 0.1931457 | 2390 | 2.1867618 |
| With a disability | 75+ | age6 | 11778 | 11778 | 0.2273166 | 0.0033096 | 0.2207290 | 0.2339043 | 3342 | 1.4559469 |
| With a disability | <18 | age6 | 11778 | 11778 | 0.0733296 | 0.0037293 | 0.0659066 | 0.0807525 | 717 | 5.0856353 |
Mean & median age in King County, by disability status
calc(ph.data = pums,
what = c("agep"),
chi_geo_kc == "King County",
metrics = c("mean", "median", "rse", "obs", "numerator", "denominator"),
by = "disability")[]## disability variable mean median numerator denominator obs level mean_se mean_lower mean_upper rse
## <fctr> <char> <num> <num> <int> <int> <int> <lgcl> <num> <num> <num> <num>
## 1: Without a disability agep 36.57783 38 727379 19102 19102 NA 0.06941677 36.43966 36.71600 0.1897783
## 2: With a disability agep 56.32190 64 142658 2404 2404 NA 0.66249953 55.00323 57.64057 1.1762734
In the examples above we used standard public health data (birth vital statistics and the Census Bureau's ACS survey). However, since calc() is a generalized function, you can use it with nearly any dataset, as long as it is a data.frame, a data.table, or a survey object. To demonstrate this, we will use calc() with synthetic data that we will generate below.
Create the dataset
library(data.table)
set.seed(98121)
mydt <- data.table(
school = as.factor(sample(c("Alpha", "Beta", "Gamma", "Delta"), 2000, replace = T)),
grades = as.factor(sample(c("A", "B", "C", "D"), 2000, replace = T)),
year = sample(2016:2021, 2000, replace = T))
head(mydt)| school | grades | year |
|---|---|---|
| Alpha | A | 2017 |
| Delta | B | 2019 |
| Gamma | C | 2017 |
| Beta | A | 2016 |
| Delta | C | 2018 |
| Alpha | D | 2019 |
We see that we created a dataset of 2000 rows with grades in four schools between with 2016 and 2021.
Calculate the proportion of A's and B's in the Alpha and Beta schools
grades.distribution <- calc(
ph.data = mydt,
school %in% c("Alpha", "Beta"),
what = "grades",
by = "school",
time_var = "year",
metrics = c("numerator", "denominator", "mean"), proportion = F)
print(grades.distribution[level %in% c("A", "B")])| school | level | year | variable | denominator | mean | mean_se | mean_lower | mean_upper | numerator |
|---|---|---|---|---|---|---|---|---|---|
| Alpha | A | 2016-2021 | grades | 497 | 0.2857143 | 0.0202844 | 0.2477598 | 0.3269560 | 142 |
| Alpha | B | 2016-2021 | grades | 497 | 0.2555332 | 0.0195842 | 0.2191639 | 0.2956526 | 127 |
| Beta | A | 2016-2021 | grades | 491 | 0.2484725 | 0.0195215 | 0.2123012 | 0.2885490 | 122 |
| Beta | B | 2016-2021 | grades | 491 | 0.2566191 | 0.0197311 | 0.2199795 | 0.2970375 | 126 |
These results show that the Alpha School has a higher proportion of A's (0.286 vs 0.248) and a similar proportion of B's (0.256 vs 0.257).
Wait! We forgot the survey weights!
A colleague just reminded you that you forgot to add the survey weights. Let's add weights and survey set the data.
# create weights
set.seed(98121)
mydt[, mywghts := sample(50:1300, 2000, replace = T)]
# survey set the data
# This uses the dtsurvey package
# similar logic applies for survey and srvyr package
mydt[, `_id` := NULL] # remove id to make things play nice
mysvy <-dtsurvey::dtsurvey(data.table(mydt), weight = 'mywghts')Using the survey information, again calculate the proportion of A's and B's in the Alpha and Beta schools
grades.distribution2 <- calc(
ph.data = mysvy,
school %in% c("Alpha", "Beta"),
what = "grades",
by = "school",
time_var = "year",
metrics = c("numerator", "denominator", "mean"), proportion = FALSE)
print(grades.distribution2[level %in% c("A", "B")])| school | level | year | variable | denominator | mean | mean_se | mean_lower | mean_upper | numerator |
|---|---|---|---|---|---|---|---|---|---|
| Alpha | A | 2016-2021 | grades | 497 | 0.2819952 | 0.0228060 | 0.2371869 | 0.3268036 | 142 |
| Alpha | B | 2016-2021 | grades | 497 | 0.2460211 | 0.0215780 | 0.2036254 | 0.2884167 | 127 |
| Beta | A | 2016-2021 | grades | 491 | 0.2361366 | 0.0214262 | 0.1940380 | 0.2782351 | 122 |
| Beta | B | 2016-2021 | grades | 491 | 0.2685865 | 0.0229556 | 0.2234831 | 0.3136900 | 126 |
You'll note that using the survey design caused small changes in the results. For example, the proportion of A's in the Alpha school changed from 0.286 to 0.282.
There may be a few instances where a variable you want to compute some metrics using (multiple) imputed data. calc can work with these sorts of situations.
The following code creates a fake survey of 100 people and then creates 10 different iterations of the survey with differing values of v (to represent the imputed variable).
# Create some fake data
library('data.table')
base = data.table::data.table(id = 1:100, bin = sample(0:1, 100, T), psu = sample(1:3, 100, T), weight = runif(100, 0,2))
base = dtsurvey::dtsurvey(base, psu = 'psu', weight = 'weight')
midat = lapply(1:10, function(i){
r = data.table::copy(base)[, v := sample(1:3, 100, T)]
})To use the imputation version of calc, the ph.data argument must be an imputationList object. You can turn your list of plausible datasets into an imputationList via the mitools::imputationList function. You may need to install the mitools package for this to work.
midat = mitools::imputationList(midat)
class(midat)## [1] "imputationList"
Now that midat is the right type/class of object, you can pass it to the calc function as you would any other dataset. The slight exception is that the metrics argument must be explicitly specified. For fun, a version of the results from one imputation is also computed.
Results using MI combining methods
withmi = calc(ph.data =midat, what = 'bin', by = 'v', metrics = 'mean', proportion = T)
print(withmi)| v | variable | mean | mean_se | mean_lower | mean_upper |
|---|---|---|---|---|---|
| 1 | bin | 0.5051381 | 0.1390337 | 0.2071490 | 0.8031272 |
| 2 | bin | 0.5369530 | 0.1380729 | 0.2437939 | 0.8301120 |
| 3 | bin | 0.5218571 | 0.1491685 | 0.2154041 | 0.8283100 |
Results of one iteration
nomi = calc(ph.data = midat$imputations[[1]], what = 'bin', by = 'v')
print(nomi)| v | variable | mean | mean_se | mean_lower | mean_upper |
|---|---|---|---|---|---|
| 1 | bin | 0.4273156 | 0.0143969 | 0.3653709 | 0.4892602 |
| 2 | bin | 0.6505122 | 0.0025634 | 0.6394829 | 0.6615416 |
| 3 | bin | 0.4560479 | 0.0659792 | 0.1721622 | 0.7399336 |
Some additional notes:
-
If there is no variation (or imputation) in the
whatorbyvariables, you should get the same result as doing just one iteration. -
The
proportionargument is ignored whenph.datais animputationListobject. This is because theproportionargument governs what methods are used to create confidence intervals – and the special proportion methods are not currently implemented for imputed data estimates. As such, you may find confidence intervals that are outside the normal [0-1] constraints.
You've been introduced to calc(), but you'll only become competent at using it by using it. Try it out with your favorite dataset. Play with it. Try to break it and, if you're successful, submit a GitHub issue. Enjoy!
-- `Updated April 16, 2025 (rads v1.3.5)