The Crossmaps Framework and {xmap}
workflow
This package is an implementation of the Crossmaps Framework for unified specification, verification, implementation and documentation of operations involved in transforming aggregate statistics between related measurement instruments (e.g. classification codes).
The framework conceptualises the aggregation of redistribution of numeric masses between related taxonomic structures as an operation which applies a graph-based representation of mapping and redistribution logic between source and target keys (the crossmap), to conformable key-value pairs (shared mass array).
A crossmap specifies:
- related pairs of source and target key (e.g. states in country)
- weights between 0 and 1 for distributing numeric mass between each related pair of source and target keys (e.g. 25% of country-level GDP -> state-A)
A shared mass array is a collection of key-value pairs, where the values form a shared numeric and the keys are parts of a shared conceptual whole (e.g. GDP by state -> country)
The crossmaps framework is an alternative approach to data transformation that removes the need for bespoke code to handle data preparation involving many-to-one or one-to-many operations.
The framework gives rise to assertions on input crossmap and shared mass arrays which ensure the transformations are valid, and implemented exactly as specified. Valid and well-documented transformation workflows should have the following properties:
- preservation of the shared total mass before and after transformation. For example, country level GDP should remain constant regardless of disaggregation method or granularity (e.g. state vs county)
- explicit handling of missing values, without any implicit missing
value arithemtic (e.g. aggregating ‘missing’ state-level mass to
country-level by treating the values as zeros via expressions like
sum(state, na.rm = TRUE)
)
See the related paper, A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics, for further details on the conditions which guarantee the above properties. This package implements workflow warnings and errors to ensure relevant conditions are met.
Example: Country-State Mappings
Consider data transformations which reference relations between hierarchical administrative regions.
In the following example, we use some basic data manipulation operations from dplyr to generate mapping weights for transforming numeric mass (e.g. GDP):
- aggregating from state-level to country-level,
- redistributing from country-level to state-level
Aggregation, Coverage, and Missing Value Checks
For aggregation, we use unit weights:
aus_state_agg_links <- demo$aus_state_pairs |>
mutate(ones = 1L)
Links are validated when coercing them into crossmaps, and some additional information about the transformation is computed (i.e. how many unique keys are in the source and target taxonomies):
(agg_xmap <- aus_state_agg_links |>
as_xmap_tbl(from = state, to = ctry, weight_by = ones)
)
#> # A crossmap tibble: 8 × 3
#> # with unique keys: [8] state -> [1] ctry
#> .from$state .to$ctry .weight_by$ones
#> <chr> <chr> <int>
#> 1 AU-ACT AUS 1
#> 2 AU-NSW AUS 1
#> 3 AU-NT AUS 1
#> 4 AU-QLD AUS 1
#> 5 AU-SA AUS 1
#> 6 AU-TAS AUS 1
#> 7 AU-VIC AUS 1
#> 8 AU-WA AUS 1
The unit weights represent a “transfer” of 100% of the source values
indexed by .from
keys to the target .to
keys.
Let’s generate some dummy state-level data to apply our aggregation to:
set.seed(1395)
(aus_state_data <- demo$aus_state_pairs |>
mutate(
gdp = runif(n(), 100, 2000),
ref = 100
))
#> # A tibble: 8 × 4
#> ctry state gdp ref
#> <chr> <chr> <dbl> <dbl>
#> 1 AUS AU-ACT 1626. 100
#> 2 AUS AU-NSW 1244. 100
#> 3 AUS AU-NT 703. 100
#> 4 AUS AU-QLD 239. 100
#> 5 AUS AU-SA 1388. 100
#> 6 AUS AU-TAS 1192. 100
#> 7 AUS AU-VIC 1535. 100
#> 8 AUS AU-WA 306. 100
Now to transform / aggregate our data:
(aus_ctry_data <- aus_state_data |>
apply_xmap(
.xmap = agg_xmap,
values_from = c(gdp, ref),
keys_from = state
)
)
#> # A tibble: 1 × 3
#> ctry gdp ref
#> <chr> <dbl> <dbl>
#> 1 AUS 8233. 800
What happens if our crossmap was missing instructions for multiple states?
## dropping links
agg_xmap[1:3, ]
#> # A crossmap tibble: 3 × 3
#> # with unique keys: [3] state -> [1] ctry
#> .from$state .to$ctry .weight_by$ones
#> <chr> <chr> <int>
#> 1 AU-ACT AUS 1
#> 2 AU-NSW AUS 1
#> 3 AU-NT AUS 1
## will lead to an error!
apply_xmap(
.data = aus_state_data,
.xmap = agg_xmap[1:3, ],
values_from = c(gdp, ref),
keys_from = state
)
#> Error in `apply_xmap()`:
#> ✖ One or more keys in `.data` do not have corresponding links in `.xmap`
#> ℹ Add missing links to `.xmap` or subset `.data`
This error prevents the accidental dropping of observations by incomplete specification of transformation instruction.
To inspect and remedy this issue, we can use
diagnose_apply_xmap()
to find out which keys in
.data
are not covered by the .xmap
:
diagnose_apply_xmap(
.data = aus_state_data,
.xmap = agg_xmap[1:3, ],
values_from = c(gdp, ref)
)
#> Warning: ✖ Found 8 keys in `.data` without corresponding match in `.xmap$.from`
#> See .$not_covered
#> $not_covered
#> # A tibble: 8 × 2
#> .key .value$gdp $ref
#> <tibble[,0]> <dbl> <dbl>
#> 1 1626. 100
#> 2 1244. 100
#> 3 703. 100
#> 4 239. 100
#> 5 1388. 100
#> 6 1192. 100
#> 7 1535. 100
#> 8 306. 100
Missing values will also be flagged to encourage explicit handling of
missing values before the apply_xmap()
mapping
transformation:
# add some `NA`
aus_state_data_na <- aus_state_data
aus_state_data_na[c(1, 3, 5), "gdp"] <- NA
apply_xmap(
.data = aus_state_data_na,
.xmap = agg_xmap,
values_from = gdp,
keys_from = state
)
#> Error in `apply_xmap()`:
#> ✖ Missing values not allowed in `.data` columns: gdp
#> ℹ Remove or replace missing values.
Redistribution, valid weights and preserving totals
For redistributing, we can choose any weights as long as the sum of
weights on outgoing links from each source key totals one (or
dplyr::near()
enough). This ensures that we only split
source values into percentage parts that sum to 100%.
A common naive strategy is to distribute equally amongst related target keys:
demo$aus_state_pairs |>
group_by(ctry) |>
mutate(equal = 1 / n_distinct(state)) |>
ungroup() |>
as_xmap_tbl(from = ctry, to = state, weight_by = equal)
#> # A crossmap tibble: 8 × 3
#> # with unique keys: [1] ctry -> [8] state
#> .from$ctry .to$state .weight_by$equal
#> <chr> <chr> <dbl>
#> 1 AUS AU-ACT 0.125
#> 2 AUS AU-NSW 0.125
#> 3 AUS AU-NT 0.125
#> 4 AUS AU-QLD 0.125
#> 5 AUS AU-SA 0.125
#> 6 AUS AU-TAS 0.125
#> 7 AUS AU-VIC 0.125
#> 8 AUS AU-WA 0.125
If we use invalid weights, such as unit weights,
as_xmap_tbl()
will error:
demo$aus_state_pairs |>
mutate(ones = 1) |>
as_xmap_tbl(from = ctry, to = state, weight_by = ones)
#> Error in `xmap_tbl()`:
#> ! Invalid `.weight_by` found for some links
#> ✖ The total outgoing `.weight_by` for some `.from` nodes are not near enough to
#> 1
#> ℹ Modify `.weight_by` or adjust `tol` and try again.
#> ℹ Use `diagnose_xmap_tbl() for more information.
Except in the case of one-to-one mappings, crossmaps are generally lateral (one-way), and have different weights in each direction.
A more sophisticated strategy for generating weights is to use reference information. For example, we can use population shares to redistribute GDP between states:
(split_xmap_pop <- demo$aus_state_pop_df |>
group_by(ctry) |>
mutate(pop_share = pop / sum(pop)) |>
ungroup() |>
as_xmap_tbl(
from = ctry, to = state, weight_by = pop_share
))
#> # A crossmap tibble: 8 × 3
#> # with unique keys: [1] ctry -> [8] state
#> .from$ctry .to$state .weight_by$pop_share
#> <chr> <chr> <dbl>
#> 1 AUS AU-ACT 0.0176
#> 2 AUS AU-NSW 0.314
#> 3 AUS AU-NT 0.00965
#> 4 AUS AU-QLD 0.205
#> 5 AUS AU-SA 0.0701
#> 6 AUS AU-TAS 0.0220
#> 7 AUS AU-VIC 0.255
#> 8 AUS AU-WA 0.107
Let’s redistribute the country level data we aggregated above back to state level using our calcuted population weights:
aus_state_data2 <- aus_ctry_data |>
mutate(ref = 10000) |>
apply_xmap(split_xmap_pop,
values_from = c(gdp, ref),
keys_from = ctry
)
Note: that the values in the transformed ref
column do
not exactly match the float values in .weight_by$pop_share
used as transformation weights. This is due to floating point
inaccuracies. Over larger transformations with more keys, this may
result in slight mismatches between the total numeric mass before and
after transformation.
#> # A tibble: 8 × 5
#> .from$ctry state gdp ref .weight_by$pop_share
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 AUS AU-ACT 145. 176. 0.0176
#> 2 AUS AU-NSW 2584. 3139. 0.314
#> 3 AUS AU-NT 79.4 96.5 0.00965
#> 4 AUS AU-QLD 1687. 2049. 0.205
#> 5 AUS AU-SA 577. 701. 0.0701
#> 6 AUS AU-TAS 181. 220. 0.0220
#> 7 AUS AU-VIC 2096. 2546. 0.255
#> 8 AUS AU-WA 883. 1072. 0.107