spatagg provides functions for the aggregation of estimates between non-nesting geographies. For example, “converting data at the census tract level to the ZIP code level.
Two main approaches are provided:
-
Geographic/Fractional Approach: A target geography value is constructed from overlapping source geographies based on fractional overlap. If target T is covered 60/40 by geographies A and B, then:
$T = .6A + .4B$ -
Point weighted population: Instead of using geographic overlap as the weights for aggregation (e.g. #1), the population overlap is used instead. THIS IS THE ONE YOU USUALLY WANT TO USE!
- BEWARE OF EDGE EFFECTS, especially when using the point pop option with the parcel population from the kcparcelpop package. Things like ZIP codes cross county boundaries. For ZIP codes specifically, I think this generally means all estimates in ZIPs that cross the boundary will be fully assigned to intersecting targets within King County. This may or may not be what you want. I encourage folks to double check the crosswalk values/weights/xwalk_df when using geographies that are outside King County and you are using the parcel pop option.
- BEWARE OF WATER, especially when using geographic overlap. In general, make sure source and target have similar ideas of what land is.
- THE
proportionARGUMENT IN CROSSWALK IS NOT JUST FOR proportion data. Basically, if your data is not count data, setproportion = T. Or at a minimum, try both and see which makes sense. The validation vignette describes this quandary in greater detail (and so does?crosswalk). - Don’t forget the rescale option exists. Review
?crosswalkto figure out if you want it toggled or not.
- The function
assign_casescan take a set of crosswalk instructions and probabilistically apply it to microdata. This is a relatively new function, although it is tested. That said, best to manually review its results for sensibility. - The function
reduce_overlapscan be used to fix up geographies (in some cases) so thatcheck_internal_consistencyquiets down. This function does slightly weird things to the output geography, so don’t use it for mapping or anything. If you encounter a geography that you got from the APDE shapefile repository that fails acheck_internal_consistencytest, please let Daniel Casey know. - Review the validation vignette to see how geographic overlap compares to population based ones
# You will need the remotes package
remotes::install_github('https://github.com/PHSKC-APDE/spatagg')
# You may also want to install the kcparcelpop package
# Useful for population weighted aggregation
remotes::install_github('https://github.com/PHSKC-APDE/kcparcelpop/')The code below creates a 3 x 3 grid of squares as a set of “source” polygons. The source polygon contains the information that needs to be translated/aggregated to “target” polygon. In “real” instances, source polygons will be stuff like census tracts or ZIP codes.
Two target polygons are created for demonstration purposes. The first nests nicely with the source polygons (e.g. can be exactly constructed by source polygons) whereas the second overlaps the source polygons but does not nest. In this example, each “target” is just one polygon, but it could be a collection of polygons– e.g. cities, regions, or ZIP codes.
In general, you want the source polygons to be smaller than the target polygons.
library('spatagg')
library('sf', quietly = TRUE)Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library('ggplot2')
library('knitr', quietly = TRUE)
library('data.table')
set.seed(1)
#create a test grid of bottom left points
grid_size = 3
grid = expand.grid(x = seq_len(grid_size)-1, y = seq_len(grid_size)-1)
# given a bottom left corner, create a square
poly_points = function(x, y, step =1){
mat = matrix(c(x, y, # bottom left
x+step, y, # bottom right
x+step, y+step, # top right
x, y+step, # top left
x, y # close
),ncol = 2, byrow = TRUE)
}
# remove the outbounds of the grid as they'll get computed
source_poly = lapply(seq_len(nrow(grid)), function(i) {
x = grid[i,'x']
y = grid[i, 'y']
r = data.frame(id = i)
r$geom = sf::st_sfc(sf::st_polygon(list(poly_points(x,y,1))))
sf::st_sf(r, sf_column_name = 'geom')
})
source_poly = do.call(rbind, source_poly)
#add some random data to source_poly
source_poly$denom = round(runif(nrow(source_poly), 100,200))
source_poly$numer = round(runif(nrow(source_poly), 50,100))
source_poly$frac = with(source_poly, numer/denom)
# total overlap target
tp1 = data.frame(id = 1)
tp1$geom = sf::st_sfc(sf::st_polygon(list(poly_points(0,0,2))))
tp1 = sf::st_sf(tp1, sf_column_name = 'geom')
# partial overlap target
tp2 = data.frame(id = 1)
tp2$geom = sf::st_sfc(sf::st_polygon(list(poly_points(.5,.5,2))))
tp2 = sf::st_sf(tp2, sf_column_name = 'geom')
# graph them
ggplot() + geom_sf(data = source_poly, fill = NA) +
#geom_sf(data = target_poly, fill = NA, color = 'red') +
geom_sf(data = tp1, fill = NA, color = 'purple', size = 1.2) +
geom_sf_label(data = source_poly, aes(label = id)) +
theme_classic() +
ggtitle('Source polygons nest within target',
'Source = black, target = purple') ggplot() + geom_sf(data = source_poly, fill = NA) +
geom_sf(data = tp2, fill = NA, color = 'purple', size = 1.2) +
geom_sf_label(data = source_poly, aes(label = id)) +
#geom_sf(data = overlap_poly, fill = NA, color = 'purple') +
theme_classic() +
ggtitle('Source polygons do not cleanly nest within target',
'Source = black, target = purple')The translation of a set of estimates/data from source to target occurs
in two main steps. First, a set of weights that describe how to convert
from source to target are generated via create_xwalk. Second, those
weights are applied via crosswalk.
The following command creates a set of geographic overlap weights. By
default, create_xwalk expects that source and target overlap by at
least 75%. This default is set high because I expect most instances of
source/target overlap to broadly cover the same area (e.g. two different
ways of carving up KC). However, tp1 and source_poly only overlap by
44%, so the min_overlap option is reduced to prevent an error.
cw1 = create_xwalk(source_poly, tp1,
source_id = 'id', target_id = 'id', min_overlap = .4)
knitr::kable(cw1)| source_id | target_id | s2t_fraction | isect_amount | tcoverage_amount | target_amount |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 4 | 4 |
| 2 | 1 | 1 | 1 | 4 | 4 |
| 4 | 1 | 1 | 1 | 4 | 4 |
| 5 | 1 | 1 | 1 | 4 | 4 |
The resulting crosswalk data.frame returns a row per each source:target
pair. s2t_fraction represents the percent of a source polygon that
falls within the specified target polygon. isect_amount is the area of
the intersection (in the units squared of the CRS of source_poly).
tcoverage_amount is the area of target covered by source polygons. For
example, in this example, 4 units^2 of target is covered by source
polygons. target_amount is the total area of the target polygon (by
target_id). Since tcoverage_amount == target_amount we can infer
that tp1 is fully covered by source_poly (which was already known, but
hey, its fun to confirm).
The crosswalk function has the following arguments:
-
source: the polygons with estimates to be converted/aggregated/translated to the target polygon(s) from thecreate_xwalkstep -
source_id: Column name of the id column forsource -
est: column name of the estimate to be converted -
proportion: Logical flag denoting whetherestrepresents proportion data -
se: An optional name of the column containing the standard error ofest -
by: An optional vector of column names to repeat the translation by. See?crosswalkfor more details -
xwalk_df: The result ofcreate_xwalkstep -
rescale: Logical flag to determine whether the conversion/aggregation weights should be scaled to approximate full coverage of target bysource
# for counts
r = crosswalk(source_poly, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = cw1)
# it should be equal to the four polygons tp1 overlaps with
r$est[1] 266
r$est == sum(source_poly$numer[c(1,2,4,5)])[1] TRUE
# for proportions
r2 = crosswalk(source_poly, source_id = 'id',
est = 'frac',proportion = TRUE, xwalk_df = cw1)
r2$est[1] 0.4675174
r2$est == mean(source_poly$frac[c(1,2,4,5)])[1] TRUE
In general, and where possible, its better to convert counts to a new geography rather than metrics with the denominator already included. Or at least, its worth thinking about what you are actually converting from source to target. Using counts implicitly has the aggregation weights combine source population with the geographic overlaps. In the case below, things are close, but not the same.
r$est/sum(source_poly$denom[c(1,2,4,5)])[1] 0.4626087
r2$est[1] 0.4675174
r$est/sum(source_poly$denom[c(1,2,4,5)]) == r2$est[1] FALSE
Unlike the previous example (using tp1), tp2 does not cleanly
intersect with source_poly. However, the general process is the same.
cw2 = create_xwalk(source_poly, tp2,
source_id = 'id', target_id = 'id', min_overlap = .4)
knitr::kable(cw2)| source_id | target_id | s2t_fraction | isect_amount | tcoverage_amount | target_amount |
|---|---|---|---|---|---|
| 1 | 1 | 0.25 | 0.25 | 4 | 4 |
| 2 | 1 | 0.50 | 0.50 | 4 | 4 |
| 3 | 1 | 0.25 | 0.25 | 4 | 4 |
| 4 | 1 | 0.50 | 0.50 | 4 | 4 |
| 5 | 1 | 1.00 | 1.00 | 4 | 4 |
| 6 | 1 | 0.50 | 0.50 | 4 | 4 |
| 7 | 1 | 0.25 | 0.25 | 4 | 4 |
| 8 | 1 | 0.50 | 0.50 | 4 | 4 |
| 9 | 1 | 0.25 | 0.25 | 4 | 4 |
# for counts
r3 = crosswalk(source_poly, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = cw2)
# it should be equal to the four polygons tp1 overlaps with
r3$est[1] 299.75
compare = merge(cw2[,c('source_id', 's2t_fraction')], source_poly,
by.x = 'source_id', by.y = 'id')
r3$est == sum(compare$numer * compare$s2t_fraction)[1] TRUE
# for proportions
r4 = crosswalk(source_poly, source_id = 'id',
est = 'frac',proportion = TRUE, xwalk_df = cw1)
r4$est[1] 0.4675174
In addition to converting data between source and target based on how
they geographically interact, spatagg provides routines to do
aggregation based on how population falls within both source and target.
The kcparcelpop package provides a few datasets to facilitate this for
King County. However, for this example, we will generate some population
of our own.
ptpop = st_sample(source_poly, 100)
ptpop = data.frame(id = seq_len(100),
pop = round(runif(100, 100,200)),
ptpop)
ptpop = st_sf(ptpop, sf_column_name = 'geometry')
ggplot() + geom_sf(data = source_poly, fill = NA) +
#geom_sf(data = target_poly, fill = NA, color = 'red') +
geom_sf(data = ptpop, aes(color = pop)) +
theme_minimal() +
ggtitle('Population points')# "Nested"
ppcw1 = create_xwalk(source_poly, tp1, source_id = 'id', target_id = 'id',
method = 'point pop', point_pop = ptpop, min_overlap = .1)
knitr::kable(ppcw1)| source_id | target_id | s2t_fraction | isect_amount | tcoverage_amount | target_amount |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 669 | 6584 | 6584 |
| 2 | 1 | 1 | 2208 | 6584 | 6584 |
| 4 | 1 | 1 | 1949 | 6584 | 6584 |
| 5 | 1 | 1 | 1758 | 6584 | 6584 |
# "Not nested"
ppcw2 = create_xwalk(source_poly, tp2, source_id = 'id', target_id = 'id',
method = 'point pop', point_pop = ptpop, min_overlap = .1)
knitr::kable(ppcw2)| source_id | target_id | s2t_fraction | isect_amount | tcoverage_amount | target_amount |
|---|---|---|---|---|---|
| 1 | 1 | 0.1704036 | 114 | 7691 | 7691 |
| 2 | 1 | 0.5493659 | 1213 | 7691 | 7691 |
| 3 | 1 | 0.4249292 | 600 | 7691 | 7691 |
| 4 | 1 | 0.5987686 | 1167 | 7691 | 7691 |
| 5 | 1 | 1.0000000 | 1758 | 7691 | 7691 |
| 6 | 1 | 0.7321867 | 1192 | 7691 | 7691 |
| 7 | 1 | 0.3865415 | 494 | 7691 | 7691 |
| 8 | 1 | 0.5329341 | 890 | 7691 | 7691 |
| 9 | 1 | 0.1619458 | 263 | 7691 | 7691 |
The output of a “point pop” crosswalk is structured similarly to the
fractional geographic overlap ones computed above. The main difference
is the units of isect_amount, tcoverage_amount and target_amount
are population instead of area (e.g. unit^2)
Use the crosswalk function to convert from source to target.
# for counts
# nesting for fun
r5 = crosswalk(source_poly, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = ppcw1)
# it should be equal to the four polygons tp1 overlaps with
r5$est[1] 266
compare = merge(ppcw1[,c('source_id', 's2t_fraction')], source_poly,
by.x = 'source_id', by.y = 'id')
r5$est == sum(compare$numer * compare$s2t_fraction)[1] TRUE
# for proportions
# non-nesting, just for fun
r6 = crosswalk(source_poly, source_id = 'id',
est = 'frac',proportion = TRUE, xwalk_df = ppcw2)
r6$est[1] 0.4802856
Sometimes source estimates might include some metric of uncertainty that
a user wishes to propagate to the target geography. Passing a column of
standard errors via the se argument will compute, via simulation, a
se for the target.
# create an se for source_poly
source_poly$se_frac = runif(nrow(source_poly), .01, .10)
r7 = crosswalk(source_poly, source_id = 'id',
est = 'frac',proportion = TRUE, xwalk_df = ppcw2,
se = 'se_frac')
print(r7) target_id est se
1 1 0.4806777 0.02674552
sp = source_poly[c(1,2,4,5),]
ggplot() + geom_sf(data = sp, fill = NA, color = 'black', size = 1.1) +
geom_sf(data = tp2, fill = NA, color = 'purple', size = 1.2) +
#geom_sf(data = overlap_poly, fill = NA, color = 'purple') +
theme_minimal() +
ggtitle('Source polygons do not totally encompass target',
'Source = black, target = purple')When scaling counts, the rescale argument (set to TRUE) can adjust
the counts upwards to approximate full overlap.
# Geographic overlap
non_geo = create_xwalk(sp, tp2,
source_id = 'id', target_id = 'id', min_overlap = .1)
# no scaling
r8 = crosswalk(sp, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = non_geo,
rescale = FALSE)
print(r8) target_id est
1 1 154.25
#yes scaling
r9 = crosswalk(sp, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = non_geo,
rescale = TRUE)
print(r9) target_id est
1 1 274.2222
# Population overlap
non_pop = create_xwalk(sp, tp2, source_id = 'id',
target_id = 'id', min_overlap = .1,
method = 'point pop', point_pop = ptpop)
# no scaling
r10 = crosswalk(sp, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = non_pop,
rescale = FALSE)
print(r10) target_id est
1 1 161.2899
#yes scaling
r11 = crosswalk(sp, source_id = 'id',
est = 'numer',proportion = FALSE, xwalk_df = non_pop,
rescale = TRUE)
print(r11) target_id est
1 1 291.7405
It should be noted that rescaling is not relevant when
proportion = TRUE.
Whereas the crosswalk function translates estimates from one areal
unit to another via the output of a create_xwalk call, assign_cases
does the same for microdata by probabilistically assigning each case
from a source geography to a target geography.
For commonly used datasets, assign_cases should be run centrally and
have its results saved/attached to the microdata for reproducibility
across analysts. assign_cases is best used for aggregation or
translation between similarly sized geographies (e.g. ZIP to HRAs).
While it can be used for dis-aggregation, do so with caution.
# create some fake data
fake_micro = data.frame(rid = 1:10000)
fake_micro$id = sample(source_poly$id, nrow(fake_micro), replace = T)
# two split down the middle polygons
m1 = t(matrix(c(0,0,1.5,0,1.5,3,0,3,0,0), ncol = 5))
m2 = m1
m2[,1] <- m2[,1] + 1.5
tp3 = sf::st_sfc(st_polygon(list(m1)), st_polygon(list(m2)))
tp3 = st_sf(data.frame(id = 1:2, geom = tp3))
ggplot() + geom_sf(data = source_poly, fill = NA) +
geom_sf(data = tp3, fill = NA, color = 'purple', size = 1.2) +
geom_sf_label(data = source_poly, aes(label = id)) +
#geom_sf(data = overlap_poly, fill = NA, color = 'purple') +
theme_classic() +
ggtitle('Two target polyons with some down the middle splits',
'Source = black, target = purple')# create the crosswalk instructions
xw3 = create_xwalk(source_poly, tp3, 'id', 'id')
# Use assign cases
fake_micro$target_id = assign_cases(source = fake_micro, 'id', xw3)
# summarise results (using data.table because I can' bother with base R)
setDT(fake_micro)
fake_micro = fake_micro[, .N, keyby = .(id, target_id)]
fake_micro[, percent := round(100 *N/sum(N)), id]
# should be about 50/50
knitr::kable(fake_micro[id %in% c(2,5,8),])| id | target_id | N | percent |
|---|---|---|---|
| 2 | 1 | 556 | 51 |
| 2 | 2 | 541 | 49 |
| 5 | 1 | 524 | 49 |
| 5 | 2 | 549 | 51 |
| 8 | 1 | 518 | 48 |
| 8 | 2 | 566 | 52 |
# The rest should be 100%
knitr::kable(fake_micro[!id %in% c(2,5,8),])| id | target_id | N | percent |
|---|---|---|---|
| 1 | 1 | 1068 | 100 |
| 3 | 2 | 1118 | 100 |
| 4 | 1 | 1142 | 100 |
| 6 | 2 | 1133 | 100 |
| 7 | 1 | 1178 | 100 |
| 9 | 2 | 1107 | 100 |




