File: sids.Rmd

package info (click to toggle)
r-cran-spdep 1.2-7%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm
size: 3,308 kB
sloc: ansic: 871; sh: 16; makefile: 2
file content (466 lines) | stat: -rw-r--r-- 20,230 bytes
parent folder | download | duplicates (2)
---
title: "Introduction to the North Carolina SIDS data set (re-revised)"
author: "Roger Bivand"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
    toc_depth: 2
bibliography: refs.bib
vignette: >
  %\VignetteIndexEntry{Introduction to the North Carolina SIDS data set (re-revised)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

## Introduction

This data set was presented first in @symonsetal:1983, analysed with
reference to the spatial nature of the data in @cressie+read:1985,
expanded in @cressie+chan:1989, and used in detail in @cressie:1991. It
is for the 100 counties of North Carolina, and includes counts of
numbers of live births (also non-white live births) and numbers of
sudden infant deaths, for the July 1, 1974 to June 30, 1978 and July 1,
1979 to June 30, 1984 periods. In @cressie+read:1985, a listing of
county neighbours based on shared boundaries (contiguity) is given, and
in @cressie+chan:1989, and in @cressie:1991 [pp. 386–389], a different
listing based on the criterion of distance between county seats, with a
cutoff at 30 miles. The county seat location coordinates are given in
miles in a local (unknown) coordinate reference system. The data are
also used to exemplify a range of functions in the spatial statistics
module user’s manual [@kaluznyetal:1996].

## Getting the data into R 

```{r, echo=FALSE,eval=TRUE,warning=FALSE, message=FALSE}
library(spdep)
```

We will be using the **spdep** and **spreg** packages, here version: `r spdep()[1]`, the **sf** package and the **tmap** package. The data from the sources
referred to above is documented in the help page for the `nc.sids`
data set in **spData**. The actual data, included in a shapefile of the county boundaries for North Carolina were made available in the **maptools** package [^1]. These data are known to be geographical coordinates (longitude-latitude in decimal degrees) and are assumed to use the NAD27 datum.

```{r echo=TRUE,eval=TRUE}
library(spdep)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)
```

The shapefile format presupposes that you have three files with
extensions `.shp`, `.shx`, and `.dbf`, where the first
contains the geometry data, the second the spatial index, and the third
the attribute data. They are required to have the same name apart from
the extension, and are read here using `sf::st_read()` into the
`sf` object `nc`; the class is defined in **sf**. The centroids of the largest polygon in each county are available
using the `st_centroid` method from **sf** as an **sfc** POINT object, and can
be used to place labels after the extraction of the coordinate matrix:

```{r echo=TRUE}
sf_use_s2(FALSE)
plot(st_geometry(nc), axes=TRUE)
text(st_coordinates(st_centroid(st_geometry(nc), of_largest_polygon=TRUE)), label=nc$FIPSNO, cex=0.5)
```

We can examine the names of the columns of the data frame to see what it
contains — in fact some of the same columns that we will be examining
below, and some others which will be useful in cleaning the data set.

```{r echo=TRUE,eval=TRUE}
names(nc)
summary(nc)
```

Let's check the different versions of the data against each other - **sf** and **spData** have NC SIDS files, as does GeoDa Center in two forms:

```{r echo=TRUE,eval=TRUE}
library(sf)
nc_sf <- st_read(system.file("shape/nc.shp", package="sf"),
                 quiet=TRUE)
st_crs(nc_sf)
nc <- st_read(system.file("shapes/sids.shp",
                 package="spData"), quiet=TRUE)
st_crs(nc)
```

As the actual CRS is unknown, **spData** reports missing, although it may very well be `+proj=longlat +datum=NAD27`

```{r echo=TRUE,eval=TRUE}
st_crs(nc) <- "+proj=longlat +datum=NAD27"
```

Next, are the geometries the same? `sf::st_equals` returns a logical matrix, so we'll check that the diagonal values are all `TRUE`, and that only those values are `TRUE` by summing and recalling that `n` is `100`:

```{r echo=TRUE,eval=TRUE}
suppressWarnings(st_crs(nc_sf) <- st_crs(nc))
xx <- st_equals(nc, nc_sf, sparse=FALSE)
all(diag(xx)) && sum(xx) == 100L
```

Next, let's download the GeoDa files and repeat the comparisons:

```{r echo=TRUE,eval=TRUE}
td <- tempdir()
#download.file("https://geodacenter.github.io/data-and-lab//data/sids.zip", file.path(td, "sids.zip"), quiet=TRUE) 
# local copy (2020-10-22) as repository sometimes offline
file.copy(system.file("etc/misc/sids.zip", package="spdep"), td)
unzip(file.path(td, "sids.zip"), c("sids/sids.dbf", "sids/sids.prj", "sids/sids.shp", "sids/sids.shx"), exdir=td)
sids_sf <- st_read(file.path(td, "sids/sids.shp"), quiet=TRUE)
#download.file("https://geodacenter.github.io/data-and-lab//data/sids2.zip", file.path(td, "sids2.zip"), quiet=TRUE)
file.copy(system.file("etc/misc/sids2.zip", package="spdep"), td)
unzip(file.path(td, "sids2.zip"), c("sids2/sids2.dbf", "sids2/sids2.prj", "sids2/sids2.shp", "sids2/sids2.shx"), exdir=td)
sids2_sf <- st_read(file.path(td, "sids2/sids2.shp"), quiet=TRUE)
```

```{r echo=TRUE,eval=TRUE}
st_crs(sids_sf)
st_crs(sids2_sf)
```

It looks as though the external files are assuming WGS84/NAD83 for the datum, but also contain the same geometries.

```{r echo=TRUE,eval=TRUE}
suppressWarnings(st_crs(sids_sf) <- st_crs(nc_sf))
xx <- st_equals(sids_sf, nc_sf, sparse=FALSE)
all(diag(xx)) && sum(xx) == 100L
```

```{r echo=TRUE,eval=TRUE}
suppressWarnings(st_crs(sids2_sf) <- st_crs(nc_sf))
xx <- st_equals(sids2_sf, nc_sf, sparse=FALSE)
all(diag(xx)) && sum(xx) == 100L
```

Now for the contents of the files - `sids2` also contains rates, while the file in `spData` contains the coordinates as given in @cressie:1991, and the parcels of contiguous counties on p. 554, and the aggregations used for median polishing.

```{r echo=TRUE, eval=TRUE}
all.equal(as.data.frame(nc_sf)[,1:14], as.data.frame(sids_sf)[,1:14])
all.equal(as.data.frame(nc_sf)[,1:14], as.data.frame(sids2_sf)[,1:14])
```

The **spData** data set has some columns reordered and a surprise:

```{r echo=TRUE, eval=TRUE}
all.equal(as.data.frame(nc_sf)[,1:14], as.data.frame(nc)[,c(2,3,4,1,5:14)])
```

so a difference in `NWBIR74`:

```{r echo=TRUE, eval=TRUE}
which(!(nc_sf$NWBIR74 == nc$NWBIR74))
c(nc$NWBIR74[21], nc_sf$NWBIR74[21])
```

where **spData** follows @cressie:1991 and **sf** and Geoda follow @cressie+chan:1989 for NWBIR74 in Chowan county.

We will now examine the data set reproduced from Cressie and
collaborators, included in **spData** (formerly in **spdep**), and add the neighbour relationships used in
@cressie+chan:1989 to the background map as a graph shown in Figure
\ref{plot-CC89.nb}:

```{r echo=TRUE, eval=TRUE}
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCR85
gal_file <- system.file("weights/ncCC89.gal", package="spData")[1]
ncCC89 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCC89
```
```{r label=plot-CC89.nb, echo=TRUE}
plot(st_geometry(nc), border="grey")
plot(ncCC89, st_centroid(st_geometry(nc), of_largest_polygon), add=TRUE, col="blue")
```

Printing the neighbour object shows that it is a neighbour list object,
with a very sparse structure — if displayed as a matrix, only 3.94% of
cells would be filled. Objects of class [`nb`]{} contain a list as long
as the number of counties; each component of the list is a vector with
the index numbers of the neighbours of the county in question, so that
the neighbours of the county with [`region.id`]{} of [`37001`]{} can be
retreived by matching against the indices. More information can be
obtained by using [`summary()`]{} on an [`nb`]{} object. Finally, we
associate a vector of names with the neighbour list, through the
[`row.names`]{} argument. The names should be unique, as with data frame
row names.

```{r echo=TRUE}
r.id <- attr(ncCC89, "region.id")
ncCC89[[match("37001", r.id)]]
r.id[ncCC89[[match("37001", r.id)]]]
``` 


The neighbour list object records neighbours by their order in relation
to the list itself, so the neighbours list for the county with
[`region.id`]{} “37001” are the seventeenth, nineteenth, thirty-second,
forty-first and sixty-eighth in the list. We can retreive their codes by
looking them up in the [`region.id`]{} attribute.

```{r echo=TRUE}
as.character(nc$NAME)[card(ncCC89) == 0]
```

We should also note that this neighbour criterion generates two counties
with no neighbours, Dare and Hyde, whose county seats were more than 30
miles from their nearest neighbours. The [`card()`]{} function returns
the cardinality of the neighbour set. We need to return to methods for
handling no-neighbour objects later on. We will also show how new
neighbours lists may be constructed in , and compare these with those
from the literature.

### Probability mapping

Rather than review functions for measuring and modelling spatial
dependence in the **spdep** package, we will focus on probability mapping for
disease rates data. Typically, we have counts of the incidence of some
disease by spatial unit, associated with counts of populations at risk.
The task is then to try to establish whether any spatial units seem to
be characterised by higher or lower counts of cases than might have been
expected in general terms [@bailey+gatrell:1995].

An early approach by @choynowski:1959, described by @cressie+read:1985
and @bailey+gatrell:1995, assumes, given that the true rate for the
spatial units is small, that as the population at risk increases to
infinity, the spatial unit case counts are Poisson with mean value equal
to the population at risk times the rate for the study area as a whole.
Choynowski’s approach folds the two tails of the measured probabilities
together, so that small values, for a chosen $\alpha$, occur for spatial
units with either unusually high or low rates. For this reason, the high
and low counties are plotted separately below. Note that `cut` returns a `factor` labeled with cut intervals.

```{r echo=TRUE}
ch <- choynowski(nc$SID74, nc$BIR74)
nc$ch_pmap_low <- ifelse(ch$type, ch$pmap, NA)
nc$ch_pmap_high <- ifelse(!ch$type, ch$pmap, NA)
prbs <- c(0,.001,.01,.05,.1,1)
nc$high = cut(nc$ch_pmap_high, prbs)
nc$low = cut(nc$ch_pmap_low,prbs )
```

```{r}
is_tmap <- FALSE
if (require(tmap, quietly=TRUE)) is_tmap <- TRUE
is_tmap
```


```{r choymap, echo=TRUE, eval=is_tmap}
library(tmap)
tm_shape(nc) + tm_fill(c("low", "high"), palette="Set1", title="p-values") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("low", "high"))
```

For more complicated thematic maps, it may be helpful to use ColorBrewer
(<https://colorbrewer2.org>) colour palettes. Here we use 
palettes accessed through **tmap**, available in R in the **RColorBrewer** package.

While the [`choynowski()`]{} function only provides the probability map
values required, the [`probmap()`]{} function returns raw (crude) rates,
expected counts (assuming a constant rate across the study area),
relative risks, and Poisson probability map values calculated using the
standard cumulative distribution function [`ppois()`]{}. This does not
fold the tails together, so that counties with lower observed counts
than expected, based on population size, have values in the lower tail,
and those with higher observed counts than expected have values in the
upper tail, as we can see.

```{r echo=TRUE}
pmap <- probmap(nc$SID74, nc$BIR74)
nc$pmap <- pmap$pmap
``` 

```{r, eval=is_tmap, echo=TRUE}
brks <- c(0,0.001,0.01,0.025,0.05,0.95,0.975,0.99,0.999,1)
tm_shape(nc) + tm_fill("pmap", breaks=brks, midpoint=0.5, palette="RdBu") + tm_layout(legend.outside=TRUE)
```


Marilia Carvalho (personal communication) and Virgilio Gómez Rubio
[@gomez-rubio+ferrandiz+lopez:2003] have pointed to the unusual shape of
the distribution of the Poisson probability values (histogram below), repeating the doubts about probability mapping voiced by
@cressie:1991 [p. 392]: “an extreme value $\ldots$ may be more due to
its lack of fit to the Poisson model than to its deviation from the
constant rate assumption”. There are many more high values than one
would have expected, suggesting perhaps overdispersion, that is that the
ratio of the variance and mean is larger than unity.

```{r label=poishist, echo=TRUE}
hist(nc$pmap, main="")
```

One ad-hoc way to assess the impact of the possible failure of our
assumption that the counts follow the Poisson distribution is to
estimate the dispersion by fitting a generalized linear model of the
observed counts including only the intercept (null model) and offset by
the observed population at risk (suggested by Marilia Carvalho and
associates):

```{r echo=TRUE}
res <- glm(SID74 ~ offset(log(BIR74)), data=nc, family="quasipoisson")
nc$stdres <- rstandard(res)
```

```{r, eval=is_tmap, echo=TRUE}
brks <- c(-4, -3, -2, -1.5, -1, -0.5, 0.5, 1, 1.5, 2, 3, 4)
tm_shape(nc) + tm_fill("stdres", breaks=brks, midpoint=0, palette="RdBu") + tm_layout(legend.outside=TRUE)
```

The dispersion is equal to `r summary(res)$dispersion`, much greater than unity; we calculate
the corrected probability map values by taking the standardised
residuals of the model, taking the size of the dispersion into account;
the results are shown above. Many fewer counties appear
now to have unexpectedly large or small numbers of cases. This is an
ad-hoc adjustment made because R provides access to a wide range of
model-fitting functions that can be used to help check our assumptions.
@gomez-rubio+ferrandiz+lopez:2003 chose rather to construct a
probability map under the hypothesis that data are drawn from a Negative
Binomial distribution.

So far, none of the maps presented have made use of the spatial
dependence possibly present in the data. A further elementary step that
can be taken is to map Empirical Bayes estimates of the rates, which are
smoothed in relation to the raw rates. The underlying question here is
linked to the larger variance associated with rate estimates for
counties with small populations at risk compared with counties with
large populations at risk. Empirical Bayes estimates place more credence
on the raw rates of counties with large populations at risk, and modify
them much less than they modify rates for small counties. In the case of
small populations at risk, more confidence is placed in either the
global rate for the study area as a whole, or for local Empirical Bayes
estimates, in rates for a larger moving window including the neighbours
of the county being estimated. The function used for this in **spdep** is
[`EBlocal()`]{}, initially contributed by Marilia Carvalho. It parallels
a similar function in GeoDa, but uses the @bailey+gatrell:1995
interpretation of @marshall:1991, rather than that in GeoDa
[@anselin+syabri+smirnov:2002].

```{r echo=TRUE}
global_rate <- sum(nc$SID74)/sum(nc$BIR74)
nc$Expected <- global_rate * nc$BIR74
res <- EBlocal(nc$SID74, nc$Expected, ncCC89, zero.policy=TRUE)
nc$EB_loc <- res$est
```

```{r, eval=is_tmap}
brks <- c(0, 0.25, 0.5, 0.75, 1, 2, 3, 4, 5)
nc_miss <- st_centroid(st_geometry(nc[card(ncCC89) == 0,]), of_largest_polygon)
tm_shape(nc) + tm_fill("EB_loc", breaks=brks, midpoint=1, palette="RdBu") + tm_layout(legend.outside=TRUE) + tm_shape(nc_miss) + tm_symbols(shape=8, size=0.5)

```

The results are shown in Figure \ref{EBlocal}. Like other relevant
functions in **spdep**,`EBlocal()` takes a `zero.policy`
argument to allow missing values to be passed through. In this case,
no local estimate is available for the two counties with no neighbours,
marked by stars.

In addition to Empirical Bayes smoothing globally, used both for disease
mapping and the Assun[c]{}ão and Reis correction to Moran’s $I$ for
rates data (to shrink towards the global rate when the population at
risk is small, here as a Monte Carlo test), lists of local neighbours
can be used to shrink towards a local rate.

```{r echo=TRUE}
set.seed(1)
EBImoran.mc(nc$SID74, nc$BIR74, nb2listw(ncCC89, style="B", zero.policy=TRUE), nsim=999, zero.policy=TRUE)
```

## Exploration and modelling of the data

One of the first steps taken by @cressie+read:1985 is to try to bring
out spatial trends by dividing North Carolina up into $4\times4$ rough
rectangles. Just to see how this works, let us map these rough
rectangles before proceeding further.

```{r echo=TRUE}
nc$both <- factor(paste(nc$L_id, nc$M_id, sep=":"))
nboth <- length(table(unclass(nc$both)))
``` 

```{r, eval=is_tmap}
tm_shape(nc) + tm_fill("both", palette="Set1", title="rough\nrectangles") + tm_layout(legend.outside=TRUE)
```

Cressie constructs a transformed SIDS rates variable, 1974–78, for his
analyses (with co-workers). We can replicate his stem-and-leaf figure on
p. 396 in the book, taken from @cressie+read:1989:

```{r echo=TRUE}
nc$ft.SID74 <- sqrt(1000)*(sqrt(nc$SID74/nc$BIR74) + sqrt((nc$SID74+1)/nc$BIR74))
stem(round(nc$ft.SID74, 1), scale=2)
```

### Median polish smoothing {#medpol}

@cressie:1991 [pp. 46–48, 393–400] discusses in some detail how
smoothing may be used to partition the variation in the data into smooth
and rough. In order to try it out on the North Carolina SIDS data set,
we will use a coarse gridding into four columns and four rows given by
@cressie:1991 [pp. 553–554], where four grid cells are empty; these are
given by variables [`L_id`]{} and [`M_id`]{} in object [`nc`]{}. Next we
aggregate the number of live births and the number of SIDS cases
1974–1978 for the grid cells:

```{r echo=TRUE,eval=TRUE}
mBIR74 <- tapply(nc$BIR74, nc$both, sum)
mSID74 <- tapply(nc$SID74, nc$both, sum)
```

Using the same Freeman-Tukey transformation as is used for the county
data, we coerce the data into a correctly configured matrix, some of the
cells of which are empty. The [`medpolish`]{} function is applied to the
matrix, being told to remove empty cells; the function iterates over the
rows and columns of the matrix using [`median`]{} to extract an overall
effect, row and column effects, and residuals:

```{r echo=TRUE,eval=TRUE}
mFT <- sqrt(1000)*(sqrt(mSID74/mBIR74) + sqrt((mSID74+1)/mBIR74))
# mFT1 <- t(matrix(mFT, 4, 4, byrow=TRUE))
# wrong assignment of 12 elements to a 4x4 matrix detected by CRAN test 2021-05-22
rc <- do.call("rbind", lapply(strsplit(names(mFT), ":"), as.integer))
mFT1 <- matrix(as.numeric(NA), 4, 4)
for (i in 1:nrow(rc)) mFT1[rc[i,1], rc[i,2]] <- mFT[i]
med <- medpolish(mFT1, na.rm=TRUE, trace.iter=FALSE)
med
```

Returning to the factors linking rows and columns to counties, and
generating matrices of dummy variables using [`model.matrix`]{}, we can
calculate fitted values of the Freeman-Tukey adjusted rate for each
county, and residuals by subtracting the fitted value from the observed
rate. Naturally, the fitted value will be the same for counties in the
same grid cell:

```{r echo=TRUE,eval=TRUE}
mL_id <- model.matrix(~ as.factor(nc$L_id) -1)
mM_id <- model.matrix(~ as.factor(nc$M_id) -1)
nc$pred <- c(med$overall + mL_id %*% med$row + mM_id %*% med$col)
nc$mp_resid <- nc$ft.SID74 - nc$pred
```

```{r, eval=is_tmap}
out1 <- tm_shape(nc) + tm_fill(c("ft.SID74", "pred")) + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Observed", "Median polish prediction"))
out2 <- tm_shape(nc) + tm_fill("mp_resid", midpoint=0) + tm_layout(legend.outside=TRUE)
tmap_arrange(out1, out2, ncol=1)
```


The figure shows the median polish smoothing results as
three maps, the observed Freeman-Tukey transformed SIDS rates, the
fitted smoothed values, and the residuals. In addition, a plot for the
median polish object is also shown, plotting the smooth residuals
against the outer product of the row and column effects divided by the
overall effect, which would indicate a lack of additivity between row
and column if this was the case — this is more relevant for analysis of
tables of covariates rather than geographical grids.

## References

[^1]: These data were taken with permission from a now-offline link:
[sal.agecon.uiuc.edu/datasets/sids.zip]; see also [GeoDa Center](https://geodacenter.github.io/data-and-lab/) for a contemporary source.