Skip to contents

Fit a recursive partitioning model (classification tree) to data from sources

Usage

ps_tree(
  doc = "ps_tree",
  data,
  GroupVar,
  Groups = "All",
  AnalyticVars,
  wts = NA,
  Seed = 11111,
  CpDigits = 3,
  plotTree = TRUE,
  plotCp = TRUE,
  Model,
  ModelTitle,
  minSplit = 20,
  cP = 0.01,
  predictSources = TRUE,
  predictUnknowns = FALSE,
  unknownData,
  ID = " ",
  unknownID = " ",
  folder = " "
)

Arguments

doc

A string with documentation added to defintion of usage, default is ps_tree (the function name)

data

A data frame with the source data to be analyzed

GroupVar

The name of the variable defining groups, grouping is required

Groups

A vector of codes for groups to be used, 'All' (the default) if use all groups

AnalyticVars

A vector with the names (character values) of the analytic variables

wts

Option to weight the observations, if used, vector with length nrow(data); if NA (the default), assume equal weights

Seed

A positive integer, to produce a reproducible analysis

CpDigits

The number of significant digits to display in the Cp table, default value is 3

plotTree

Logical. If TRUE (the default), plot the recursive partitioning tree

plotCp

Logical. If TRUE (the default), plot the Cp table values

Model

A character string containing the names of the variables (characters) considered separated by + signs

ModelTitle

The parameter Model as a single character value

minSplit

The minimum size of a group for splitting, default is 20 (the default in rpart())

cP

The required improvement in Cp for a group to be split, default is .01 (the default in rpart())

predictSources

Logical: if TRUE, use the tree to predict sources for the source data; default is TRUE

predictUnknowns

Logical: if TRUE, use the tree to predict sources for observations in unknownData; default is FALSE

unknownData

Data frame with data used to predict sources, must contain all variables in AnalyticVars

ID

If not " " (the default), the name of a variable identifying a sample in data

unknownID

If not " " (the default), the name of a variable identifying a sample in unknownData

folder

The path to the folder in which data frames will be saved; default is " "

Value

The function returns a list with the following components:

  • usage: A string with the contents of the argument doc, the date run, the version of R used

  • dataUsed: The contents of the argument data restricted to the groups used

  • params_grouping: A list with the values of the arguments GroupVar and Groups

  • analyticVars: A vector with the value of the argument AnalyticVars

  • params: A list with the values of the grouping, logical, and splitting parameters

  • Seed: A positive integer to set the random number generator

  • model: A character string with the value of the argument ModelTitle

  • treeFit: A list with details of the tree construction_

  • classification: A data frame showing the crossclassification of sources and predicted sources. Rows represent sources, columns represent predicted source

  • CpTable: A data frame showing the decrease in Cp with increasing numbers of splits

  • predictedSource: If predictSources = TRUE, a data frame with the predicted source for each source sample, plus the known source, the sample ID (if given), and the analytic variable values

  • predictedProbs: If predictSources = TRUE, a data frame with the set of prediction probabilities for each source sample, plus the known source and sample ID (if given)

  • predictedSourceUnknowns: If predictUnknowns = TRUE, a data frame with the predicted source for each unknown sample, plus the the sample ID (if given) and the analytic variable values

  • predictedProbsUnknowns: If predictUnknowns = TRUE, a data frame with the set of prediction probabilities for each unknown sample, plus the sample ID (if given)

  • errorRate: If predictSources = TRUE, the proportion of misassigned source samples

  • errorCount: If predictSources = TRUE, a vector with the number of misassigned sources and total number of sources

  • predictedTotalsUnknowns: If predictUnknowns = TRUE, a vector with the number of objects predicted to be from each source

  • location: The value of the argument folder

Details

The function fits a classification tree model us the R function rpart(). The variables in AnalyticVars are considered in the order in which they appear in the Model argument (from left to right). See the vignette for more details.

Examples

# Analyze the obsidian source data with variables in the model statement in order of
# importance from a random forest analysis
data(ObsidianSources)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_tree <- ps_tree(data=ObsidianSources, GroupVar="Code",Groups="All",
 AnalyticVars=analyticVars, Model = "Rb"+"Sr"+"Y"+"Zr"+"Nb",
 ModelTitle = "Sr + Nb + Rb + Y + Zr", predictSources=TRUE, predictUnknowns=FALSE,
 ID="ID")



# Predict the sources of the obsidian artifacts
data(ObsidianSources)
data(ObsidianArtifacts)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_tree <- ps_tree(data=ObsidianSources, GroupVar="Code",Groups="All",
 AnalyticVars=analyticVars, Model = "Rb"+"Sr"+"Y"+"Zr"+"Nb",
 ModelTitle = "Sr + Nb + Rb + Y + Zr", predictSources=FALSE, predictUnknowns=TRUE,
 unknownData=ObsidianArtifacts, unknownID="ID")