-
Notifications
You must be signed in to change notification settings - Fork 1
brfss_functions
⚠️ DEPRECATED: The functions in this vignette have been migrated to apde.data. Please use that package instead.
The Behavioral Risk Factor Surveillance System (BRFSS) is a gold mine of public health data – but like any mine, you need the right tools to extract the value. Since BRFSS is a complex survey, analyses need to account for the survey design and weights to get accurate results. The survey design includes stratification and weighting to ensure the sample represents the full population, accounting for who was more or less likely to be included in the survey. When we want to analyze multiple years together (which we often do to increase precision), we need to adjust those survey weights to avoid overestimating our population.
This vignette will show you how to easily work with King County BRFSS data while properly handling all these survey design considerations. Don’t worry - the functions do the heavy lifting for you! We’ll cover everything from finding available variables to getting properly weighted estimates.
Note that the BRFSS ETL process has its own repository and questions regarding the data should be directed to the data steward.
library(rads)
library(data.table)One quirk of BRFSS data is that not all questions are asked every year.
Before diving into analysis, it’s helpful to check which variables are
available for your time period of interest. The list_dataset_columns()
function makes this easy.
Since the Washington State and King County datasets are distinct, you
can specify which one you want with the kingco argument. When
kingco = TRUE, you will receive the list of columns in the King County
dataset. When kingco = FALSE, you will receive the list of columns in
the Washington State dataset. The default is kingco = TRUE.
vars_2023 <- list_dataset_columns("brfss", year = 2023)
head(vars_2023)
nrow(vars_2023)| var.names | year(s) |
|---|---|
| addepev3 | 2023 |
| age | 2023 |
| age_f | 2023 |
| age_m | 2023 |
| age5_v1 | 2023 |
| age5_v2 | 2023 |
[1] 209
# Check variables across multiple years
vars_2019_2023 <- list_dataset_columns("brfss", year = 2019:2023)
head(vars_2019_2023)
nrow(vars_2019_2023)| var.names | year(s) |
|---|---|
| aceindx1 | 2019-2021 |
| aceindx2 | 2019-2021 |
| acescor1 | 2019-2021 |
| acescor2 | 2019-2021 |
| addepev3 | 2019-2023 |
| age | 2019-2023 |
[1] 299
Notice that the year(s) column is not constant because BRFSS does not
ask every question in every year.
There are two equivalent ways to get BRFSS data: using
get_data('brfss') or get_data_brfss(). Both functions will:
- Load the data you request into memory
- Automatically adjust weights if you’re analyzing multiple years
- Survey-set the data so it’s ready for analysis
- Return a dtsurvey object,
which is a data.table friendly survey object analyzable with
rads::calc
As with list_dataset_columns(), by default you will receive King
County data. You can specify Washington State data with the
kingco = FALSE argument.
Let’s see both methods in action:
This is the general interface that you can use to access any of APDE’s analytic ready data.
brfss_full <- get_data(
dataset = "brfss",
cols = c("chi_year", "age", "race4", "chi_sex", "prediab1"),
year = 2019:2023
)Your data was survey set with the following parameters is ready for rads::calc():
- valid years = 2019-2023
- original survey weight = `finalwt1`
- adjusted survey weight = `default_wt`
- strata = `x_ststr`
brfss_full_alt <- get_data_brfss(
cols = c("chi_year", "age", "race4", "chi_sex", "prediab1"),
year = 2019:2023
)Your data was survey set with the following parameters is ready for rads::calc():
- valid years = 2019-2023
- original survey weight = `finalwt1`
- adjusted survey weight = `default_wt`
- strata = `x_ststr`
These methods return an identical
dtsurvey object that’s ready
for analysis with
calc().
Notice that the functions provide an informative message regarding the survey object parameters. These will be hidden in the examples below, but are always produced when getting or survey setting BRFSS data.
Since BRFSS weights are designed to represent the population, we can verify that our multi-year weight adjustments are working properly by comparing population sizes. We expect the adjusted weights to represent an “average” population that falls between the earliest and latest years’ populations since King County’s population has been growing. Let’s verify our weight adjustments are working as expected:
pop_2019 <- sum(brfss_full[chi_year == 2019]$finalwt1)
pop_2023 <- sum(brfss_full[chi_year == 2023]$finalwt1)pop_adjusted <- sum(brfss_full$default_wt)pop_2023 > pop_adjusted & pop_adjusted > pop_2019[1] TRUE
BRFSS data presents a unique challenge when analyzing Health Reporting Areas (HRAs) because it comes with ZIP codes rather than HRA assignments. Since ZIP codes don’t perfectly align with HRA boundaries, we need to account for this uncertainty in our analyses.
To handle this, we use a statistical technique called multiple
imputation. When you request HRA-related columns (hra20_id,
hra20_name, or chi_geo_region), the function returns an
imputationList
object containing 10 different versions of the data. Each version
represents a different possible way that ZIP codes could be assigned to
HRAs based on their overlap. This approach allows us to capture the
uncertainty in our geographic assignments and incorporate it into our
statistical estimates.
Note: APDE decided to use 10 imputations based on an extensive empirical assessment to balance between statistical accuracy and computational efficiency. This is fixed in the ETL process and is not configurable.
brfss_hra <- get_data_brfss(
cols = c("chi_year", "age", "race4", "chi_sex", "prediab1", "obese", "hra20_name"),
year = 2019:2023
)inherits(brfss_hra, "imputationList") &
length(brfss_hra$imputations) == 10 &
inherits(brfss_hra$imputations[[1]], "dtsurvey")[1] TRUE
Don’t worry if this seems complex - the
calc() function
automatically handles these
imputationList
objects.
There are times when you might need to modify BRFSS data. For example, you might want to create a new variable. Before making any modifications, first consider whether your changes should be standardized. If you’re creating variables that will be used across multiple projects (CHI, CHNA, Communities Count, etc.) or repeatedly year after year, contact the BRFSS ETL steward and politely request the addition of these changes to the analytic ready dataset.
For truly custom analyses, your modification approach will depend on whether you’re working with a simple dtsurvey object or an imputationList. Let’s look at each case:
Modifying a dtsurvey
You can modify a dtsurvey object using data.table commands without disrupting its survey settings. If you want to use dplyr commands, you may break the internals of the dtsurvey and would be wise to survey set it again following the instruction in the “Survey Setting and Creating Custom Weights” section below.
Regardless of whether you use data.table or dplyr commands, you are encouraged to create new variables as needed rather than overwriting and deleting existing ones.
Modifying an ImputationList
When working with HRA or region data, modifications become more complex since we need to maintain consistency across all 10 imputed datasets. Here’s the step-by-step example that you can follow to help you in this process:
1. Get a BRFSS ImputationList (by requesting HRA or region columns)
brfss <- get_data_brfss(
cols = c("age", "hra20_id"),
year = 2019:2023
)2. Convert it to a regular dtsurvey/data.table
brfss <- as_table_brfss(brfss)Successfully converted an imputationList to a single dtsurvey/data.table.
Remember to use as_imputed_brfss() after making modifications.
brfss[, age_category := fifelse(age <67, 'working age', 'retirement age')]4. Convert back to an ImputationList
brfss <- as_imputed_brfss(brfss)Successfully created an imputationList with 10 imputed datasets.
Data is now ready for analysis with rads::calc().
You might need to use pool_brfss_weights() in two scenarios:
- When analyzing specific years where certain questions were asked
- When you need to restore proper survey settings after using non-data.table commands for data manipulation
While get_data_brfss() automatically creates weights and survey sets
imported data, you can create new weights and re-survey set the data
using pool_brfss_weights(). Here are brief argument descriptions, see
the pool_brfss_weights() help file for details:
-
ph.data: Your BRFSS dataset (can be a data.frame, data.table, dtsurvey, or imputationList) -
years: Vector of years you want to analyze together -
year_var: Name of the year column (defaults to ‘chi_year’) -
old_wt_var: Name of the original weight variable (defaults to ‘finalwt1’) -
new_wt_var: Name for your new weight variable -
wt_method: Name of the method used to rescale your weights. Options include ‘obs’, ‘pop’, and ‘simple’ (defaults to ‘obs’) -
strata: Name of the strata variable (defaults to ‘x_ststr’)
Let’s see it in action:
brfss_odd_years <- pool_brfss_weights(
ph.data = brfss_full,
years = c(2019, 2021),
new_wt_var = "odd_year_wt" # Name for the new weight variable
)pop_2019 <- sum(brfss_odd_years[chi_year == 2019]$finalwt1)
pop_2021 <- sum(brfss_odd_years[chi_year == 2021]$finalwt1)
pop_2019_2021 <- sum(brfss_odd_years$odd_year_wt)
pop_2019 < pop_2019_2021 & pop_2019_2021 < pop_2021[1] TRUE
Analyzing BRFSS Data with calc()
Now for the fun part - analyzing our data! The
calc() function
handles all the survey design considerations for us. Let’s look at some
examples:
Calculate prediabetes prevalence by sex and race (using a dtsurvey object)
prediab_by_group <- calc(
ph.data = brfss_full,
what = "prediab1",
by = c("chi_sex", "race4"),
metrics = c("mean", "rse"),
proportion = TRUE # Since prediab is binary
)
head(prediab_by_group)| chi_sex | race4 | variable | mean | level | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|
| Male | NA | prediab1 | 0.0503843 | NA | 0.0248165 | 0.0185461 | 0.1296589 | 49.25433 |
| Male | AIAN | prediab1 | 0.2284831 | NA | 0.1021132 | 0.0839714 | 0.4889464 | 44.69182 |
| Male | Black | prediab1 | 0.1187814 | NA | 0.0235966 | 0.0796585 | 0.1734967 | 19.86554 |
| Male | Asian | prediab1 | 0.1548570 | NA | 0.0167577 | 0.1247503 | 0.1906468 | 10.82137 |
| Male | NHPI | prediab1 | 0.2287657 | NA | 0.1063999 | 0.0810248 | 0.4994791 | 46.51043 |
| Male | Hispanic | prediab1 | 0.1363339 | NA | 0.0177889 | 0.1050209 | 0.1751559 | 13.04807 |
Calculate prediabetes prevalence by HRA20 (using an imputationList)
prediab_by_hra20 <- calc(
ph.data = brfss_hra,
what = "prediab1",
by = c("hra20_name"),
metrics = c("mean", "rse"),
proportion = TRUE
)
head(prediab_by_hra20)| hra20_name | variable | level | mean | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|
| Auburn - North | prediab1 | NA | 0.1097051 | 0.0314104 | 0.0473378 | 0.1720724 | 23.87216 |
| Auburn - South | prediab1 | NA | 0.1084990 | 0.0449626 | 0.0181626 | 0.1988354 | 31.81374 |
| Bear Creek and Greater Sammamish | prediab1 | NA | 0.1392452 | 0.0439130 | 0.0511826 | 0.2273078 | 23.56366 |
| Bellevue - Central | prediab1 | NA | 0.1382798 | 0.0480590 | 0.0419377 | 0.2346218 | 26.44273 |
| Bellevue - Northeast | prediab1 | NA | 0.1117511 | 0.0400927 | 0.0318859 | 0.1916163 | 28.59449 |
| Bellevue - South | prediab1 | NA | 0.1624832 | 0.0359512 | 0.0919890 | 0.2329773 | 21.48270 |
As noted in the calc()
wiki,
when working with an
imputationList,
the proportion argument is ignored. However, we include it here to
maintain consistent
calc() usage
regardless of whether you’re working with a
dtsurvey object or an
imputationList.
Calculate prediabetes & obesity prevalence by HRA20 & sex (using an imputationList)
We will do this in two parts since only one value of what can be
specified can be specified when ph.data is an
imputationList
prediab_obese_hra20_sex <- rbind(
calc(
ph.data = brfss_hra,
what = c("prediab1"),
by = c("hra20_name", "chi_sex"),
metrics = c("mean", "rse"),
proportion = TRUE
),
calc(
ph.data = brfss_hra,
what = c("obese"),
by = c("hra20_name", "chi_sex"),
metrics = c("mean", "rse"),
proportion = TRUE
)
)
head(prediab_obese_hra20_sex)| hra20_name | chi_sex | variable | level | mean | mean_se | mean_lower | mean_upper | rse |
|---|---|---|---|---|---|---|---|---|
| Auburn - North | Male | prediab1 | NA | 0.0908839 | 0.0404166 | 0.0101633 | 0.1716045 | 36.37217 |
| Auburn - North | Female | prediab1 | NA | 0.1275012 | 0.0521343 | 0.0230377 | 0.2319647 | 31.66348 |
| Auburn - South | Male | prediab1 | NA | 0.0890185 | 0.0542901 | -0.0206931 | 0.1987302 | 46.01217 |
| Auburn - South | Female | prediab1 | NA | 0.1292806 | 0.0749252 | -0.0219830 | 0.2805441 | 43.67390 |
| Bear Creek and Greater Sammamish | Male | prediab1 | NA | 0.1473026 | 0.0460883 | 0.0566849 | 0.2379204 | 28.97528 |
| Bear Creek and Greater Sammamish | Female | prediab1 | NA | 0.1317566 | 0.0646126 | 0.0020176 | 0.2614955 | 34.99476 |
Those with experience using calc() might be wondering, “Why would we
need to use pool_brfss_weights() to analyze a subset of years when we
could just use the where argument in calc?” The short answer is
that the methods are identical – as long as you are only interested in
the mean, standard error, RSE, and confidence intervals. However, if you
want to know the survey weighted number of people within a given
demographic or with a condition, you need to use pool_brfss_weights().
The following example analyzing data for 2022 compares the results from
the two methods.
brfss_where <- get_data_brfss(cols = c('chi_year', 'obese'), year = 2019:2023)
method_where <- calc(ph.data = brfss_where,
what = 'obese',
where = chi_year == 2022,
metrics = c("mean", "rse", "total"),
proportion = TRUE )brfss_pooled <- get_data_brfss(cols = c('chi_year', 'obese'), year = 2019:2023)
brfss_pooled <- pool_brfss_weights(ph.data = brfss_pooled, years = 2022, new_wt_var = 'wt_2022')
method_pooled <- calc(ph.data = brfss_pooled,
what = 'obese',
metrics = c("mean", "rse", "total"),
proportion = TRUE )all.equal(method_where[, .(variable, mean, mean_se, mean_lower, mean_upper, rse)],
method_pooled[, .(variable, mean, mean_se, mean_lower, mean_upper, rse)])[1] TRUE
all.equal(method_where[, .(variable, total, total_se, total_lower, total_upper)],
method_pooled[, .(variable, total, total_se, total_lower, total_upper)])[1] "Column 'total': Mean relative difference: 2.30656"
The mean, standard error, RSE, and CI are identical for the two methods
but the totals differ. Please remember, to get the correct survey
weighted population you must use pool_brfss_weights.
Please refer to the APDE_SmallNumberUpdate.xlsx file on SharePoint for details.
Working with BRFSS data requires careful attention to survey weights and design, but the functions we’ve covered make this process straightforward. Remember:
- Check variable availability with
list_dataset_columns() - Get data with
get_data_brfss()orget_data() - Modify dtsurvey objects using data.table syntax
- Modify
imputationList
objects by first using
as_table_brfss(), then modifying your object with data.table syntax, then converting it back withas_imputed_brfss() - Create custom weights if needed with
pool_brfss_weights() - Analyze using
calc()
Happy analyzing!
– `Updated April 16, 2025 (rads v1.3.5)