This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right hand corner of the page

Chapter 8 Predicting site location with simple additive raster sensitivity analysis using R

Matthew Harris

8.0.1 Simple Arbitrary Additive Weight Sensitivity Model

This code demonstrates the process for building a simple arbitrary weight archaeological sensitivity model, selecting a model threshold, and model validation. This type of model is by no means the ‘best’ model for archaeological predictive modeling, but it is very common, a useful approach to sensitivity stratification, as well as a good example of working with raster data and model validation.

The primary use for this approach is:

  • Stratify areas of a landscape based on the additive arbitrary sensitivity weights of one or more variables.

Some advantages include:

  • Conceptually simple; easy to explain
  • based on expert opinion
  • Computationally simple
  • Not based on sample of known sites
  • Sidesteps some bias in site sampling

Some drawbacks include:

  • Central Limit Theorem challenges statistical basis
  • Difficult to optimize
  • Not based on sample of known sites
  • No method for assessing variable importance
  • Does not include the variation evident in known site locations

Primary R techniques used in this script:

  • reading/writing raster files
  • using the raster package in R (e.g. reclassify, mask, etc…)
  • Interactive maps with mapview package
  • Creating your own function
  • plotting with ggplot2

8.0.2 Study Area

This study area is a relatively large study area (897336 raster cells at 10.63425 meter resolution, for 101,477,305 square meters) including 120 pre-contact archaeological sites from somewhere in the United States. The coordinates of the data have been altered to protect the location of the sites. For the purpose of mapping, the coordinates of the data have been relocated to map over Fargo, North Dakota. In an ideal world, more site location data would be free…

8.0.3 Code

The code below follows the sequence of defining functions and packages, loading raster data, establishing weights, classifying rasters based on weights, summing all weighted rasters, validating with known site locations, plotting results.

8.0.4 Define functions

I typically put all my functions at the beginning of my code once I have them in working order. Here I define a function that takes the predicted sensitivity and presence/absence of known sites and returns a series of performance metrics to demonstrate how well the model classifies known sites. This function utilizes the ROCR package to define performance metrics across all thresholds.

# predict == predicted value from model raster
# response == site present (1) or site absent (0) at each predicted cell location
balance_threshold <- function(predict, response) {
  perf <- ROCR::performance(ROCR::prediction(predict, response), "sens", "spec")
  auc <- ROCR::performance(ROCR::prediction(predict, response), "auc")
  auc <- round(auc@y.values[[1]],3)
  df <- data.frame(Weight = perf@alpha.values[[1]], 
                   Spec = perf@x.values[[1]], 
                   Sens = perf@y.values[[1]],
                   Back_pcnt = 1 - perf@x.values[[1]],
                   Xover = abs((perf@y.values[[1]] + (1-perf@x.values[[1]]))-1))
  df$kg <- 1-((1-df$Spec)/df$Sens)
  df$reach <- 1-((1-df$Sens)/df$Spec)
  df$reach <- ifelse(df$reach == 1, 0, df$reach) # removing reach == 1
  df <- data.frame(apply(df,2,function(x) round(x,3)))
  sens_spec <- df[which.max(df$Sens + df$Spec), "Weight"]
  xover <- df[which.min(df$Xover), "Weight"]
  kg <- df[which.max(df$kg), "Weight"]
  reach <- df[which.max(df[which(df$reach < 1),"reach"]), "Weight"] # max where it is not == 1
  list(df = df, sens_spec = sens_spec, xover = xover, 
       auc = auc, kg = kg, reach = reach)
}

8.0.5 Load packages

I also typically put all of the necessary packages at the top of my code and typically note why I used that package in a comment. Otherwise, I will forget for some of the less common packages. Using R typically requires using a series of packages for any analysis. This is a strength of R, but requires one to learn how various packages work and to find the package for the task at hand. Don’t be worried about using a bunch of packages, but try to keep the updated!

library("raster")       # for raster manipulation
library("rgdal")        # for raster processing
library("dplyr")        # for data processing
library("mapview")      # for interactive map plots
library("ggplot2")      # for plotting results
library("ROCR")         # for model validation functions
library("RColorBrewer") # for raster color scale
library("knitr")        # for printing nicer tables
library("viridis")      # for color scale in ggplot

8.0.6 Load raster files

The raster package is the workhorse of dealing with raster data in R. It is a large an complex package, but has good documentation and is pretty easy to learn. The files used here were saved as .tif files from ArcGIS. A critical part of making this work is to make sure that you set your environments in ArcGIS or QGIS to align the grid cells for each raster, use the same projection, and consistently mask the raster so they have the same number of rows and columns. This can all be done in R, but to simplify this script, I did it in ArcGIS and exported to tif files.

The raster stack is a raster data structure that combines rasters of the same dimension into a single file for ease of use. This includes running one operation that works on each raster in the stack; as we will see.

Variables:

  • slope.tif is the percent slope of the landscape derived from a 1/3rd arc-second DEM
  • ed_h2.tif is the euclidean distance to National Hydrology Dataset (NHD) stream features
  • ed_h4.tif is the euclidean distance to National Wetland Dataset (NWD) wetland polygons
  • sites.tif is the location of 120 known archaeological site as rasterized polygons
data_loc <- "clip_raster/"
slope <- raster(paste0(data_loc, "slope.tif"))
ed_h2 <- raster(paste0(data_loc, "ed_h2.tif"))
ed_h4 <- raster(paste0(data_loc, "ed_h4.tif"))
sites <- raster(paste0(data_loc, "sites.tif"))
raster_vars <- stack(slope, ed_h2, ed_h4, sites)

8.0.7 Construct weights

The core component of this model is the establishing of arbitrary weights assigned to classes of each raster variable. These weights can be based on educated guesses, empirical evidence, or to test a theory. Likewise, the manner in which the raster is classified into regions to weight is equally arbitrary or empirically based. What is not happening here is the use of known data to find the optimum set of weights and data splits to best separate site locations from non-sites. That is a model that discriminates based on known sites and weights from a metric such as information value or through statistical means of classification such as logistic regression, random forests, or any number of models.

Here, the weights and splits are based on regional literature for how micro-social camps type sites are possibly distributed relative to these variables. Known site locations obviously influence this understanding, but they are not directly calculated to create the weights. The structure of these tables needs to be three columns that depict the values of form, to, and value. These are the raster values to classify from and to and the weight to assign to that class.

### Slope weighting Models###
slp_from <- c(0, 3, 5, 8, 15)
slp_to   <- c(3, 5, 8, 15, 99999)
slp_wght <- c(50, 30, 15, 5, 0)
slp_rcls<- cbind(slp_from, slp_to, slp_wght) 

### Dist to h20 weighting Models###
h20_from <- c(0, 100, 200, 400, 800)
h20_to   <- c(100, 200, 400, 800, 9999)
h20_wght <- c(60, 25, 10, 4, 1)
h20_rcls <- cbind(h20_from, h20_to, h20_wght) 

### Dist to wetland weighting Models###
wtl_from <- c(0, 100, 200, 400, 800)
wtl_to   <- c(100,200, 400, 800, 9999)
wtl_wght <- c(35, 25, 20, 15, 5)
wtl_rcls <- cbind(wtl_from, wtl_to, wtl_wght) 

print(slp_rcls)
##      slp_from slp_to slp_wght
## [1,]        0      3       50
## [2,]        3      5       30
## [3,]        5      8       15
## [4,]        8     15        5
## [5,]       15  99999        0
print(h20_rcls)
##      h20_from h20_to h20_wght
## [1,]        0    100       60
## [2,]      100    200       25
## [3,]      200    400       10
## [4,]      400    800        4
## [5,]      800   9999        1
# an example of a more fully formatted table
knitr::kable(wtl_rcls, digits = 0,
             col.names = c("From", "To", "Weight"),
             caption = "Sensitivity Weights for Distance (m) to Wetlands (NWD)")
(#tab:construct_weights)Sensitivity Weights for Distance (m) to Wetlands (NWD)
From To Weight
0 100 35
100 200 25
200 400 20
400 800 15
800 9999 5

8.0.8 Reclassify rasters

The code to reclassify the rasters is very straight forward. For each of the three variable, we indicate that particular raster from the stack using the indexing of a list and the weight table. The reclassify function in the raster package does the work for us.

raster_vars[["slope"]] <- reclassify(raster_vars[["slope"]], slp_rcls)
raster_vars[["ed_h2"]] <- reclassify(raster_vars[["ed_h2"]], h20_rcls)
raster_vars[["ed_h4"]] <- reclassify(raster_vars[["ed_h4"]], wtl_rcls)

8.0.9 Summing rasters

Given that the weighted rasters are within the raster stack, the sum() function will add together all of the layers that we indicate; in this case [1:3]. These are the weighted slope, ed_h2, and ed_h4 rasters.

model_sum <- sum(raster_vars[[1:3]])

base plot of model_sum

ggplot of model_sum

coords <- coordinates(model_sum)
x <- data.frame(x = coords[,"x"], y = coords[,"y"], value = as.vector(model_sum))
ggplot(x, aes(x = x, y = y, fill = value)) +
  geom_raster(hjust = 0, vjust = 0) +
  theme_minimal() +
  viridis::scale_fill_viridis() +
  labs(title = "Summed Sensitivity Weights",
       caption = "Projection: Albers Conic Equal Area")

8.0.10 Clip sites

In order to validate the weighting scheme and assess model performance, it is necessary to find out the summed weight value at known site locations. With this information, one can see if the summed weights are able to discriminate the parts of the landscape that are known to contain sites vs. the summed weights of the study area overall. Ideally, a model will give high summed weights to where sites are known and lower weights on average to where there are no sites; e.g. the background. However, since the purpose of this model is to identify areas of site potential/sensitivity, we need it to have some areas of high summed weights, but no known sites. The trick of any model type is to balance the size of this false-positive area (until survey proves otherwise) against the false-positive areas where we misclassify sites as absent. Further along we will use performance metrics to try to find the weight threshold that achieves this balance.

The mask() function of the raster package is used to clip the summed weight raster model_sum by the known site locations.

sites_sum <- mask(model_sum, raster_vars[["sites"]])

8.0.10.1 Mapview

Spatial data can be easily rendered in a ‘slippy-map’ format with the mapview package. This allows for overlays, base maps, legends, zooming, and panning with a very easy to use function.

m <-  mapview(model_sum, 
              col.regions = viridisLite::viridis, 
              alpha = 0.75, maxpixels =  897336) +
      mapview(sites_sum, 
              col.regions = viridisLite::magma, 
              maxpixels =  897336)
m