brfss_functions

BRFSS Functions

⚠️ DEPRECATED: The functions in this vignette have been migrated to apde.data. Please use that package instead.

Introduction

The Behavioral Risk Factor Surveillance System (BRFSS) is a gold mine of public health data – but like any mine, you need the right tools to extract the value. Since BRFSS is a complex survey, analyses need to account for the survey design and weights to get accurate results. The survey design includes stratification and weighting to ensure the sample represents the full population, accounting for who was more or less likely to be included in the survey. When we want to analyze multiple years together (which we often do to increase precision), we need to adjust those survey weights to avoid overestimating our population.

This vignette will show you how to easily work with King County BRFSS data while properly handling all these survey design considerations. Don’t worry - the functions do the heavy lifting for you! We’ll cover everything from finding available variables to getting properly weighted estimates.

Note that the BRFSS ETL process has its own repository and questions regarding the data should be directed to the data steward.

Load essential packages

library(rads)
library(data.table)

Checking Variable Availability Across Years

One quirk of BRFSS data is that not all questions are asked every year. Before diving into analysis, it’s helpful to check which variables are available for your time period of interest. The list_dataset_columns() function makes this easy.

Since the Washington State and King County datasets are distinct, you can specify which one you want with the kingco argument. When kingco = TRUE, you will receive the list of columns in the King County dataset. When kingco = FALSE, you will receive the list of columns in the Washington State dataset. The default is kingco = TRUE.

King County 2023 Variable Availability

vars_2023 <- list_dataset_columns("brfss", year = 2023)
head(vars_2023)
nrow(vars_2023)

var.names	year(s)
addepev3	2023
age	2023
age_f	2023
age_m	2023
age5_v1	2023
age5_v2	2023

[1] 209

King County 2019-2023 Variable Availability

# Check variables across multiple years
vars_2019_2023 <- list_dataset_columns("brfss", year = 2019:2023)
head(vars_2019_2023)
nrow(vars_2019_2023)

var.names	year(s)
aceindx1	2019-2021
aceindx2	2019-2021
acescor1	2019-2021
acescor2	2019-2021
addepev3	2019-2023
age	2019-2023

[1] 299

Notice that the year(s) column is not constant because BRFSS does not ask every question in every year.

Getting BRFSS Data

There are two equivalent ways to get BRFSS data: using get_data('brfss') or get_data_brfss(). Both functions will:

Load the data you request into memory
Automatically adjust weights if you’re analyzing multiple years
Survey-set the data so it’s ready for analysis
Return a dtsurvey object, which is a data.table friendly survey object analyzable with rads::calc

As with list_dataset_columns(), by default you will receive King County data. You can specify Washington State data with the kingco = FALSE argument.

Let’s see both methods in action:

Method 1: Using `get_data()`

This is the general interface that you can use to access any of APDE’s analytic ready data.

brfss_full <- get_data(
  dataset = "brfss",
  cols = c("chi_year", "age", "race4", "chi_sex", "prediab1"),
  year = 2019:2023
)

Your data was survey set with the following parameters is ready for rads::calc():
 - valid years = 2019-2023
 - original survey weight = `finalwt1` 
 - adjusted survey weight = `default_wt` 
 - strata = `x_ststr`

Method 2: Using `get_data_brfss()`

brfss_full_alt <- get_data_brfss(
  cols = c("chi_year", "age", "race4", "chi_sex", "prediab1"),
  year = 2019:2023
)

Your data was survey set with the following parameters is ready for rads::calc():
 - valid years = 2019-2023
 - original survey weight = `finalwt1` 
 - adjusted survey weight = `default_wt` 
 - strata = `x_ststr`

These methods return an identical dtsurvey object that’s ready for analysis with calc().

Notice that the functions provide an informative message regarding the survey object parameters. These will be hidden in the examples below, but are always produced when getting or survey setting BRFSS data.

Since BRFSS weights are designed to represent the population, we can verify that our multi-year weight adjustments are working properly by comparing population sizes. We expect the adjusted weights to represent an “average” population that falls between the earliest and latest years’ populations since King County’s population has been growing. Let’s verify our weight adjustments are working as expected:

Calculate the survey population at the beginning and end of the period

pop_2019 <- sum(brfss_full[chi_year == 2019]$finalwt1)
pop_2023 <- sum(brfss_full[chi_year == 2023]$finalwt1)

Calculate the adjusted population for the combined period

pop_adjusted <- sum(brfss_full$default_wt)

Is the value of the adjusted population between that for 2019 and 2023?

pop_2023 > pop_adjusted & pop_adjusted > pop_2019

[1] TRUE

Working with HRAs and Regions

BRFSS data presents a unique challenge when analyzing Health Reporting Areas (HRAs) because it comes with ZIP codes rather than HRA assignments. Since ZIP codes don’t perfectly align with HRA boundaries, we need to account for this uncertainty in our analyses.

To handle this, we use a statistical technique called multiple imputation. When you request HRA-related columns (hra20_id, hra20_name, or chi_geo_region), the function returns an imputationList object containing 10 different versions of the data. Each version represents a different possible way that ZIP codes could be assigned to HRAs based on their overlap. This approach allows us to capture the uncertainty in our geographic assignments and incorporate it into our statistical estimates.

Note: APDE decided to use 10 imputations based on an extensive empirical assessment to balance between statistical accuracy and computational efficiency. This is fixed in the ETL process and is not configurable.

Get data including HRAs

brfss_hra <- get_data_brfss(
  cols = c("chi_year", "age", "race4", "chi_sex", "prediab1", "obese", "hra20_name"),
  year = 2019:2023
)

Confim we generated an imputationList of 10 dtsurvey objects

inherits(brfss_hra, "imputationList") &
   length(brfss_hra$imputations) == 10 &
   inherits(brfss_hra$imputations[[1]], "dtsurvey")

[1] TRUE

Don’t worry if this seems complex - the calc() function automatically handles these imputationList objects.

Modifying BRFSS Data

There are times when you might need to modify BRFSS data. For example, you might want to create a new variable. Before making any modifications, first consider whether your changes should be standardized. If you’re creating variables that will be used across multiple projects (CHI, CHNA, Communities Count, etc.) or repeatedly year after year, contact the BRFSS ETL steward and politely request the addition of these changes to the analytic ready dataset.

For truly custom analyses, your modification approach will depend on whether you’re working with a simple dtsurvey object or an imputationList. Let’s look at each case:

Modifying a dtsurvey

You can modify a dtsurvey object using data.table commands without disrupting its survey settings. If you want to use dplyr commands, you may break the internals of the dtsurvey and would be wise to survey set it again following the instruction in the “Survey Setting and Creating Custom Weights” section below.

Regardless of whether you use data.table or dplyr commands, you are encouraged to create new variables as needed rather than overwriting and deleting existing ones.

Modifying an ImputationList

When working with HRA or region data, modifications become more complex since we need to maintain consistency across all 10 imputed datasets. Here’s the step-by-step example that you can follow to help you in this process:

1. Get a BRFSS ImputationList (by requesting HRA or region columns)

brfss <- get_data_brfss(
  cols = c("age", "hra20_id"),
  year = 2019:2023
)

2. Convert it to a regular dtsurvey/data.table

brfss <- as_table_brfss(brfss)

Successfully converted an imputationList to a single dtsurvey/data.table.
Remember to use as_imputed_brfss() after making modifications.

3. Create or modify a variables

brfss[, age_category := fifelse(age <67, 'working age', 'retirement age')]

4. Convert back to an ImputationList

brfss <- as_imputed_brfss(brfss)

Successfully created an imputationList with 10 imputed datasets.
Data is now ready for analysis with rads::calc().

Survey Setting and Creating Custom Weights

You might need to use pool_brfss_weights() in two scenarios:

When analyzing specific years where certain questions were asked
When you need to restore proper survey settings after using non-data.table commands for data manipulation

While get_data_brfss() automatically creates weights and survey sets imported data, you can create new weights and re-survey set the data using pool_brfss_weights(). Here are brief argument descriptions, see the pool_brfss_weights() help file for details:

ph.data: Your BRFSS dataset (can be a data.frame, data.table, dtsurvey, or imputationList)
years: Vector of years you want to analyze together
year_var: Name of the year column (defaults to ‘chi_year’)
old_wt_var: Name of the original weight variable (defaults to ‘finalwt1’)
new_wt_var: Name for your new weight variable
wt_method: Name of the method used to rescale your weights. Options include ‘obs’, ‘pop’, and ‘simple’ (defaults to ‘obs’)
strata: Name of the strata variable (defaults to ‘x_ststr’)

Let’s see it in action:

Create weights for odd years only

brfss_odd_years <- pool_brfss_weights(
  ph.data = brfss_full,
  years = c(2019, 2021),
  new_wt_var = "odd_year_wt"  # Name for the new weight variable
)

Confirm the adjusted weight is reasonable

pop_2019 <- sum(brfss_odd_years[chi_year == 2019]$finalwt1)
pop_2021 <- sum(brfss_odd_years[chi_year == 2021]$finalwt1)
pop_2019_2021 <- sum(brfss_odd_years$odd_year_wt)

pop_2019 < pop_2019_2021 & pop_2019_2021 < pop_2021

[1] TRUE

Analyzing BRFSS Data with `calc()`

Now for the fun part - analyzing our data! The calc() function handles all the survey design considerations for us. Let’s look at some examples:

Calculate prediabetes prevalence by sex and race (using a dtsurvey object)

prediab_by_group <- calc(
  ph.data = brfss_full,
  what = "prediab1",
  by = c("chi_sex", "race4"),
  metrics = c("mean", "rse"),
  proportion = TRUE  # Since prediab is binary
)
head(prediab_by_group)

chi_sex	race4	variable	mean	level	mean_se	mean_lower	mean_upper	rse
Male	NA	prediab1	0.0503843	NA	0.0248165	0.0185461	0.1296589	49.25433
Male	AIAN	prediab1	0.2284831	NA	0.1021132	0.0839714	0.4889464	44.69182
Male	Black	prediab1	0.1187814	NA	0.0235966	0.0796585	0.1734967	19.86554
Male	Asian	prediab1	0.1548570	NA	0.0167577	0.1247503	0.1906468	10.82137
Male	NHPI	prediab1	0.2287657	NA	0.1063999	0.0810248	0.4994791	46.51043
Male	Hispanic	prediab1	0.1363339	NA	0.0177889	0.1050209	0.1751559	13.04807

Calculate prediabetes prevalence by HRA20 (using an imputationList)

prediab_by_hra20 <- calc(
  ph.data = brfss_hra,
  what = "prediab1",
  by = c("hra20_name"),
  metrics = c("mean", "rse"),
  proportion = TRUE
)
head(prediab_by_hra20)

hra20_name	variable	level	mean	mean_se	mean_lower	mean_upper	rse
Auburn - North	prediab1	NA	0.1097051	0.0314104	0.0473378	0.1720724	23.87216
Auburn - South	prediab1	NA	0.1084990	0.0449626	0.0181626	0.1988354	31.81374
Bear Creek and Greater Sammamish	prediab1	NA	0.1392452	0.0439130	0.0511826	0.2273078	23.56366
Bellevue - Central	prediab1	NA	0.1382798	0.0480590	0.0419377	0.2346218	26.44273
Bellevue - Northeast	prediab1	NA	0.1117511	0.0400927	0.0318859	0.1916163	28.59449
Bellevue - South	prediab1	NA	0.1624832	0.0359512	0.0919890	0.2329773	21.48270

As noted in the calc() wiki, when working with an imputationList, the proportion argument is ignored. However, we include it here to maintain consistent calc() usage regardless of whether you’re working with a dtsurvey object or an imputationList.

Calculate prediabetes & obesity prevalence by HRA20 & sex (using an imputationList)

We will do this in two parts since only one value of what can be specified can be specified when ph.data is an imputationList

prediab_obese_hra20_sex <- rbind(
  calc(
    ph.data = brfss_hra,
    what = c("prediab1"),
    by = c("hra20_name", "chi_sex"),
    metrics = c("mean", "rse"),
    proportion = TRUE
  ), 
  calc(
    ph.data = brfss_hra,
    what = c("obese"),
    by = c("hra20_name", "chi_sex"),
    metrics = c("mean", "rse"),
    proportion = TRUE
  )
)

head(prediab_obese_hra20_sex)

hra20_name	chi_sex	variable	level	mean	mean_se	mean_lower	mean_upper	rse
Auburn - North	Male	prediab1	NA	0.0908839	0.0404166	0.0101633	0.1716045	36.37217
Auburn - North	Female	prediab1	NA	0.1275012	0.0521343	0.0230377	0.2319647	31.66348
Auburn - South	Male	prediab1	NA	0.0890185	0.0542901	-0.0206931	0.1987302	46.01217
Auburn - South	Female	prediab1	NA	0.1292806	0.0749252	-0.0219830	0.2805441	43.67390
Bear Creek and Greater Sammamish	Male	prediab1	NA	0.1473026	0.0460883	0.0566849	0.2379204	28.97528
Bear Creek and Greater Sammamish	Female	prediab1	NA	0.1317566	0.0646126	0.0020176	0.2614955	34.99476

Comparing `calc(..., where = ...)` with `pool_brfss_weights`

Those with experience using calc() might be wondering, “Why would we need to use pool_brfss_weights() to analyze a subset of years when we could just use the where argument in calc?” The short answer is that the methods are identical – as long as you are only interested in the mean, standard error, RSE, and confidence intervals. However, if you want to know the survey weighted number of people within a given demographic or with a condition, you need to use pool_brfss_weights(). The following example analyzing data for 2022 compares the results from the two methods.

Setting Up the Comparison

Generate 2022 Obesity prevalence:`where` Method

brfss_where <- get_data_brfss(cols = c('chi_year', 'obese'), year = 2019:2023)

method_where <- calc(ph.data = brfss_where,
                    what = 'obese',
                    where = chi_year == 2022,
                    metrics = c("mean", "rse", "total"),
                    proportion = TRUE )

Generate 2022 Obesity prevalence:`pool_brfss_weights` Method

brfss_pooled <- get_data_brfss(cols = c('chi_year', 'obese'), year = 2019:2023)

brfss_pooled <- pool_brfss_weights(ph.data = brfss_pooled, years = 2022, new_wt_var = 'wt_2022')

method_pooled <- calc(ph.data = brfss_pooled,
                    what = 'obese',
                    metrics = c("mean", "rse", "total"),
                    proportion = TRUE )

Comparing the Results

The mean, standard error, RSE, and CI values are equal

all.equal(method_where[, .(variable, mean, mean_se, mean_lower, mean_upper, rse)],
          method_pooled[, .(variable, mean, mean_se, mean_lower, mean_upper, rse)])

[1] TRUE

The total values (i.e., the survey weighted populations) differ

all.equal(method_where[, .(variable, total, total_se, total_lower, total_upper)],
          method_pooled[, .(variable, total, total_se, total_lower, total_upper)])

[1] "Column 'total': Mean relative difference: 2.30656"

Key Takeaway

The mean, standard error, RSE, and CI are identical for the two methods but the totals differ. Please remember, to get the correct survey weighted population you must use pool_brfss_weights.

Suppression & Reliability

Please refer to the APDE_SmallNumberUpdate.xlsx file on SharePoint for details.

Conclusion

Working with BRFSS data requires careful attention to survey weights and design, but the functions we’ve covered make this process straightforward. Remember:

Check variable availability with list_dataset_columns()
Get data with get_data_brfss() or get_data()
Modify dtsurvey objects using data.table syntax
Modify imputationList objects by first using as_table_brfss(), then modifying your object with data.table syntax, then converting it back with as_imputed_brfss()
Create custom weights if needed with pool_brfss_weights()
Analyze using calc()

Happy analyzing!

– `Updated April 16, 2025 (rads v1.3.5)

brfss_functions

BRFSS Functions

Introduction

Load essential packages

Checking Variable Availability Across Years

King County 2023 Variable Availability

King County 2019-2023 Variable Availability

Getting BRFSS Data

Method 1: Using get_data()

Method 2: Using get_data_brfss()

Calculate the survey population at the beginning and end of the period

Calculate the adjusted population for the combined period

Is the value of the adjusted population between that for 2019 and 2023?

Working with HRAs and Regions

Get data including HRAs

Confim we generated an imputationList of 10 dtsurvey objects

Modifying BRFSS Data

Modifying a dtsurvey

Modifying an ImputationList

1. Get a BRFSS ImputationList (by requesting HRA or region columns)

2. Convert it to a regular dtsurvey/data.table

3. Create or modify a variables

4. Convert back to an ImputationList

Survey Setting and Creating Custom Weights

Create weights for odd years only

Confirm the adjusted weight is reasonable

Analyzing BRFSS Data with calc()

Calculate prediabetes prevalence by sex and race (using a dtsurvey object)

Calculate prediabetes prevalence by HRA20 (using an imputationList)

Calculate prediabetes & obesity prevalence by HRA20 & sex (using an imputationList)

Comparing calc(..., where = ...) with pool_brfss_weights

Setting Up the Comparison

Generate 2022 Obesity prevalence:where Method

Generate 2022 Obesity prevalence:pool_brfss_weights Method

Comparing the Results

The mean, standard error, RSE, and CI values are equal

The total values (i.e., the survey weighted populations) differ

Key Takeaway

Suppression & Reliability

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Method 1: Using `get_data()`

Method 2: Using `get_data_brfss()`

Analyzing BRFSS Data with `calc()`

Comparing `calc(..., where = ...)` with `pool_brfss_weights`

Generate 2022 Obesity prevalence:`where` Method

Generate 2022 Obesity prevalence:`pool_brfss_weights` Method