Acquire Split Statistic Information — stat.split.rfsrc • Fast Unified Random Forests with randomForestSRC

Extract split statistic information from the forest. The function returns a list of length ntree, in which each element corresponds to a tree. The element [[b]] is itself a vector of length xvar.names identified by its x-variable name. Each element [[b]]$xvar contains the complete list of splits on xvar with associated identifying information. The information is as follows:

treeID Tree identifier.
nodeID Node identifier.
parmID Variable indentifier.
contPT Value node was split in the case of a continuous variable.
mwcpSZ Size of the multi-word complementary pair in the case of a factor split.
dpthID Zero (0) based depth of split.
spltTY Split type for parent node:

bit 1 bit 0 meaning
----- ----- -------
0 0 0 = both daughters have valid splits
0 1 1 = only the right daughter is terminal
1 0 2 = only the left daughter is terminal
1 1 3 = both daughters are terminal
spltEC End cut statistic for real valued variables between [0,0.5] that is small when the split is towards the edge and large when the split is towards the middle. Subtracting this value from 0.5 yields the end cut statistic studied in Ishwaran (2014) and is a way to identify ECP behavior (end cut preference behavior).
spltST Split statistic:
1. For objects of class (rfsrc, grow), this is the split statistic that resulted in the variable being choosen for the split.
2. For an object of class (rfsrc, pred) this is the variance of the response within the node for the test data. This value is relevant only for real valued responses. In classification and survival, it is not relevant.

# S3 method for rfsrc
stat.split(object, ...)

Arguments

object: An object of class (rfsrc, grow), (rfsrc, synthetic) or (rfsrc, predict)
...: Further arguments passed to or from other methods.

Value

Invisibly, a list with the following components:

...: ...

Author

Hemant Ishwaran and Udaya B. Kogalur

References

Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75-118.

Examples

# \donttest{
## run a forest, then make a call to stat.split
grow.obj <- rfsrc(mpg ~., data = mtcars, membership=TRUE, statistics=TRUE)
stat.obj <- stat.split(grow.obj)

## nice wrapper to extract split-statistic for desired variable
## for continuous variables plots ECP data
get.split <- function(splitObj, xvar, inches = 0.1, ...) {
  which.var <- which(names(splitObj[[1]]) == xvar)
  ntree <- length(splitObj)
  stat <- data.frame(do.call(rbind, sapply(1:ntree, function(b) {
    splitObj[[b]][which.var]})))
  dpth <- stat$dpthID
  ecp <- 1/2 - stat$spltEC
  sp <- stat$contPT
  if (!all(is.na(sp))) {
    fgC <- function(x) {
      as.numeric(as.character(cut(x, breaks = c(-1, 0.2, 0.35, 0.5),
      labels = c(1, 4, 2))))
    }
    symbols(jitter(sp), jitter(dpth), ecp, inches = inches, bg = fgC(ecp),
      xlab = xvar, ylab = "node depth", ...)
    legend("topleft", legend = c("low ecp", "med ecp", "high ecp"),
      fill = c(1, 4, 2))
   }
  invisible(stat)
}

## use get.split to investigate ECP behavior of variables
get.split(stat.obj, "disp")
# }

bit 1	bit 0	meaning
-----	-----	-------
0	0	0 = both daughters have valid splits
0	1	1 = only the right daughter is terminal
1	0	2 = only the left daughter is terminal
1	1	3 = both daughters are terminal