ps_randomForest
ps_randomForest.Rd
Implements a random forest analysis of source data, and predicts sources of unknowns if requested
Usage
ps_randomForest(
doc = "ps_randomForest",
data,
GroupVar,
Groups = "All",
AnalyticVars,
sourceID = " ",
Ntrees = 500,
NvarUsed = NA,
Seed = 11111,
digitsImportance = 1,
plotErrorRate = TRUE,
plotImportance = TRUE,
predictSources = FALSE,
predictData = NA,
unknownID = " ",
plotSourceProbs = TRUE,
folder = " "
)
Arguments
- doc
Documentation for the function use added to model usage, default value is the function name
- data
A data frame with the data used to grow trees (source data if predictions are made)
- GroupVar
The name of variable defining groups, grouping is required
- Groups
A vector of codes for groups to be used, 'All' if use all groups
- AnalyticVars
A vector with names (character-valued) of the analytic variables
- sourceID
If not " " (the default), the name of the variable with sample ID for source data
- Ntrees
The number of trees grown, default value of 500 is that for the randomForest function
- NvarUsed
If not NA (the default), the number of variables to use in each random forest call to rpart; if NA, rpart uses the default value for randomForest() (the square root of the number of candidate variables)
- Seed
If not NA, a random number generator seed to produce reproducible results; default value is 11111
- digitsImportance
The number of significant digits for the importance measure, default is 1
- plotErrorRate
Logical, whether to show the error rate plot, default is TRUE
- plotImportance
Logical, whether to show the plot of variable importance, default is TRUE
- predictSources
Logical; if T, predict sources for the data in predictData; default is FALSE
- predictData
A data frame or matrix with data used to predict sources for observations, must contain all variables in AnalyticVars_
- unknownID
if not " " (the default), the name of the variable with the sample ID for artifact data
- plotSourceProbs
Logical, if TRUE (the default) and predictSources=TRUE, show box plots of source probabilities
- folder
The path to the folder in which data frames will be saved; default is " "
Value
The function returns a list with the following components:
usage: A string with the contents of the argument doc, the date run, the version of R used
dataUsed: The contents of the argument data restricted to the groups used
sourcesNA: A data frame with data from the data frame data with missing values, NÁ if no missing values
analyticVars: A vector with the value of the argument AnalyticVars
params: A list with the values of the grouping, logical, and numeric arguments
formulaRf: The formula used in the analysis (the variables specified in the argument AnalyticVars separated by + signs)
forest: A summary of the random forest call, estimated error rate, and confusion matrix
importance: A data frame with information on the importance of each variable in AnalyticVars
confusion: A data frame with the estimate of the confusion matrix
predictedData: A data frame with the artifact data used for predictions; if there is missing data, after imputation of the missing data
predictedNA: A data frame with the observations for which missing data were imputed; NA if there are no missing data
predictedSources: A data frame with prediction information, sample ID (if requested), and values of AnalyticVars
predictedTotals: A vector with the predicted totals for each group (source)
impError: The estimated OOB (out of bag) error for imputed predictor data; NA if no imputed data
location: The value of the parameter folder
Details
The function implements a random forest analysis using the R function randomForest(). If predictSources and plotSourceProbs are TRUE, the function creates two box plots. The first plot shows, for each source, the set of probabilities of assignment to that source for the observations assigned to that source (all of these probabilities should be large). The second plot shows, for each source, the set of probabilities of assignment to that source for the observations not assigned to that source (for each source, there is one such probability for observation); these probabilities should be relatively small, and some should be zero. See the vignette for more details and examples of these plots.
Examples
data(ObsidianSources)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_randomForest <- ps_randomForest(data=ObsidianSources, GroupVar="Code",Groups="All",
sourceID="ID", AnalyticVars=analyticVars, NvarUsed=3, plotSourceProbs=FALSE)
#
# predicted sources for artifacts
data(ObsidianSources)
data(ObsidianArtifacts)
analyticVars<-c("Rb","Sr","Y","Zr","Nb")
save_randomForest <- ps_randomForest(data=ObsidianSources, GroupVar="Code",Groups="All",
AnalyticVars=analyticVars, sourceID="ID", NvarUsed=3, plotErrorRate=FALSE,
plotImportance=FALSE, predictSources=TRUE, predictData=ObsidianArtifacts, unknownID="ID",
plotSourceProbs=TRUE)
#> Warning: some notches went outside hinges ('box'): maybe set notch=FALSE
#> Warning: some notches went outside hinges ('box'): maybe set notch=FALSE