This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 2 Basic spatial analysis in R: Point pattern analysis

Phil Riris (Institute of Archaeology, UCL)

2.0.1 Introduction

This demo is prepared for the SAA 2017 forum at Vancouver “How to do archaeological science using R”. This demo has three goals:

Demonstrate how to load two types of spatial data (vectors and rasters) and plot them
Demonstrate data coercion into useful formats for spatial analysis
Explore spatial dependency based on a) between-point interaction, b) external covariates

For more examples of spatial stats applied to archaeological questions, see:

Riris, P. 2017. “Towards an artefact’s-eye view: Non-site analysis of discard patterns and lithic technology in Neotropical settings with a case from Misiones province, Argentina” Journal of Archaeological Science: Reports 11 DOI: http://dx.doi.org/10.1016/j.jasrep.2017.01.002

2.0.1.1 Loading and plotting spatial data

Suppose we have surveyed a prehistoric settlement landscape in full, and we are reasonably certain all major sites have been located by our awesome field team. We want to understand the empirical pattern in front of us, specifically how smaller sites not only outnumber, but also seem to group around larger and more important sites. How strong is this pattern? Does it hold equally across the whole study area? Can we explain the distribution of a type of site in terms of that of another?

Here are the R packages that we will need to answer these questions:

require(spatstat) # Load packages
require(sp)
require(maptools)
require(rgdal)
require(raster)

With time and money limited and excavation difficult, we can’t return to the field, so we have to resort to the data we have: in this case a basic distribution map! We will import our field data into R, briefly investigate it to ensure that it looks right:

main.sites <- readShapeSpatial("main.sites.shp") # Load our sites from shapefile

secondary.sites <- readShapeSpatial("secondary.sites.shp")

elev <- raster("elevation.tif") # Load elevation map

str(main.sites) # Investigate and check that the files have loaded correctly.

## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  21 obs. of  1 variable:
##   .. ..$ marks: Factor w/ 1 level "Primary": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "data_types")= chr "C"
##   ..@ coords.nrs : num(0) 
##   ..@ coords     : num [1:21, 1:2] 0.0442 0.4661 0.8143 0.3285 0.7165 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:21] "0" "1" "2" "3" ...
##   .. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
##   ..@ bbox       : num [1:2, 1:2] 0.0314 0.0483 0.9766 0.9296
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. ..@ projargs: chr NA

str(secondary.sites)

## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  105 obs. of  1 variable:
##   .. ..$ marks: Factor w/ 1 level "Secondary": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "data_types")= chr "C"
##   ..@ coords.nrs : num(0) 
##   ..@ coords     : num [1:105, 1:2] 0.0432 0.1002 0.1007 0.0414 0.4581 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:105] "0" "1" "2" "3" ...
##   .. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
##   ..@ bbox       : num [1:2, 1:2] 0.00799 0.01278 0.9964 0.98552
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "coords.x1" "coords.x2"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. ..@ projargs: chr NA

With the data imported and checked, we can make a simple exploratory plot:

# png(filename = paste0(getwd(), "/", "Sites.png"), 
#     width = 600, 
#     height = 600, 
#     units = "px")

plot(elev, 
     main = "Study Area", 
     axes = FALSE) # Inspect data

plot(main.sites, 
     col = "red", 
     pch = 16, 
     add = TRUE)

plot(secondary.sites, 
     col = "blue", 
     pch = 17, 
     add = TRUE)

legend("bottomleft", 
       title = "Site types", 
              c("Primary", "Secondary"), 
       pch = c(16, 17), 
       col = c("red", "blue"))

# dev.off()

2.0.1.2 Querying spatial data

For visualisation of raw patterns and classic exploratory tools we must coerce our data into a ‘ppp’ (planar point pattern) object for spatial statistics using the spatstat library, and then compute the intensity of the data (or ‘stuff per unit area’):

main.pp <- as.ppp(main.sites) # Coerce into 'ppp' object for spatial statistics
secondary.pp <- as.ppp(secondary.sites)

sites.pp <- superimpose(main.pp, secondary.pp)

intensity(sites.pp) # Measure intensity of different patterns

##   Primary Secondary 
##  21.84156 109.20781

We can also compute the intesity for all sites together:

allsites <- unmark(sites.pp) # removes "marks" (i.e. descriptors) from point pattern

intensity(allsites) # Intensity (points/unit of area) = 131.0494

## [1] 131.0494

This is not very informative on its own - an abstract number. Let’s see if we can get more insights from additional visualisations and stats:

d.sites <- density.ppp(sites.pp) # Overall density of all points in our data

plot(d.sites, main = "Density of all sites") # Plot interpolated values
plot(sites.pp, add = T)

For numeric marks (e.g. number of structures per settlement) we would use smooth.ppp() function. But is this informative on its own? We have two different types of site.

par(mfrow=c(1,2)) # Set number of plotting rows to 1 and number of columns to 2
 
plot(density(main.pp), main = "Density of primary settlements")
plot(density(secondary.pp), main = "Density of satellite settlements")

Visualisations alone can be misleading. The patterns are clearly related, but how do we formally characterise this? First, it’s important to make note of any significant spatial relationships:

clarkevans.test(secondary.pp, correction = c("Donnelly"), nsim = 999)

## 
##  Clark-Evans test
##  Donnelly correction
##  Monte Carlo test based on 999 simulations of CSR with fixed n
## 
## data:  secondary.pp
## R = 0.65946, p-value = 0.002
## alternative hypothesis: two-sided

R < 1 among secondary settlements

clarkevans.test(main.pp, correction = c("Donnelly"), nsim = 999)

## 
##  Clark-Evans test
##  Donnelly correction
##  Monte Carlo test based on 999 simulations of CSR with fixed n
## 
## data:  main.pp
## R = 1.053, p-value = 0.624
## alternative hypothesis: two-sided

R > 1 among primary sites, but not significant

Tentative conclusion: there are two different processes creating the observed distribution of sites

2.0.1.3 Exploring spatial relationships

More formally, we can state our null hypothesis as: Primary and secondary sites are realisations of two independent point processes (different cultural strategies?)

A point process is simply a process (random or otherwise) that generates an empirical pattern.

We know in this case that they are almost certainly related, but it is always worthwhile to go through the motions in case of anything unexpected cropping up.

A good place to start is bivariate spatial statistics such as the bivariate pair-correlation function. It tests for clustering or dispersion between two point patterns, i.e. the influence of one on the other.

The “classic” pair-correlation function and its ancestor Ripley’s K are both univariate, but still worth using. However, here, we focus on interaction and dependency between two site-types.

plot(pcfcross(
  sites.pp, 
  divisor = "d", 
  correction = "Ripley"), 
  main = "Bivariate pair-correlation function")

Empirical pattern obviously differs greatly from function under spatial randomness: g(pois), but how greatly?

We can use the same approach and simulate envelopes of significance to compare agains the empirical pattern.

bivar <- envelope(
  fun = pcfcross, 
  sites.pp, 
  divisor = "d", 
  nsim = 99, 
  correction = c("Ripley")
  )

## Generating 99 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,  99.
## 
## Done.

plot(bivar, main = "Pair-correlation function, with simulation envelope")

This function shows us the following: very strong association at short ranges (up to 10 km), but around 12 km and 16-17 km there are actually fewer secondary sites than expected. A univariate function shows this dispersion to be present in the whole dataset. In fact, this matches the findings of the Clark-Evans NN test.

What may have caused secondary sites to be clustered to primary sites, but primary sites to be dispersed?

Now we know: a) the distribution of sites varies in space, b) the presence of one type of site is tied to the presence of another.

We can now take the step from exploratory spatial analysis to confirmatory.

2.0.1.4 Point process modelling

Define a spatstat window object for primary sites, and compute a kernel density estimate of primary settlement sites, to act as covariate, then fit point process model with the covariate:

pwin <- as.owin(c(-0.1, 1.1, -0.1, 1.1)) # 

main.pp$window <- pwin # Assign window

pden <- density(main.pp) # Kernel density estimate of primary settlement sites, to act as covariate

fit.sec <- ppm(unmark(secondary.pp), ~ Cov1, covariates = list(Cov1 = pden)) # Fit point process model with the covariate

This appears to be somewhat acceptable model; locations of secondary sites seem to depend on those of primary sites. But how good is this fit?

# png(filename = "density.png", width = 600, height = 600, units = "px")

plot(predict(fit.sec), 
     main="Predicted distribution of secondary settlements", 
     col = rainbow(255))

points(secondary.pp) # Appears to be somewhat acceptable model;

                  # locations of secondary sites seem to depend
                  # on those of primary sites! But how good is this fit?

#dev.off()

The Residual K function indicates that the locations of satellite settlements, although spatially autocorrelated with the principal settlements, are poorly explained by using only this covariate. Although a good fit is found between ~ 14 and ~18 km (the purple lines), at scales above and below this narrow band there is both significantly more dispersion and clustering, respectively. What may explain this statistical pattern?

fit.res <- envelope(
  fun = Kres, 
  secondary.pp, 
  model = fit.sec, 
  nsim = 19, 
  correction = c("Ripley")) # Examine

## Generating 19 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,  19.
## 
## Done.

par(mfrow = c(1,2)) # Plot with two columns

# png(filename = "statplot.png", width = 600, height = 600, units = "px")

plot(fit.res, main="Residual K")

abline(v = 0.14, col = "purple")
abline(v = 0.18, col = "purple")

plot(predict(fit.sec), main = "Predicted distribution", col = rainbow(255))
points(secondary.pp)

# dev.off()

plot(fit.res, main = "Residual K")

abline(v = 0.14, col = "purple")
abline(v = 0.18, col = "purple")

plot(predict(fit.sec), main = "Predicted distribution", col = rainbow(255))
points(secondary.pp)

2.0.2 Summary

This exercise invites us to:

Think critically about the relationships we think we see in our data,
Formalise our hypotheses about the archaeological record and test them,
Pursue further data collection and analysis when we are proven wrong.

sessionInfo()

## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] raster_2.5-8    rgdal_1.2-5     maptools_0.9-2  sp_1.2-4       
## [5] spatstat_1.50-0 rpart_4.1-10    nlme_3.1-131   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10         knitr_1.15.17        tensor_1.5          
##  [4] magrittr_1.5         spatstat.utils_1.4-1 lattice_0.20-34     
##  [7] stringr_1.2.0        tools_3.3.3          grid_3.3.3          
## [10] mgcv_1.8-17          deldir_0.1-12        htmltools_0.3.5     
## [13] yaml_2.1.14          abind_1.4-5          goftest_1.0-4       
## [16] rprojroot_1.2        digest_0.6.12        bookdown_0.3.16     
## [19] Matrix_1.2-8         fftwtools_0.9-8      evaluate_0.10       
## [22] rmarkdown_1.4        polyclip_1.6-1       stringi_1.1.3       
## [25] backports_1.0.5      foreign_0.8-67

Want to know when the book is for sale? Enter your email so we can let you know.